RelayForge™ · Research Foundations · 26 SourcesThe papers, systems, benchmarks, and standards that informed the RelayForge architecture. Our 46-source research survey was commissioned May 2026 and validated against our live codebase. Every design decision links back to something here.
Park et al. (2023). Introduced the memory stream, retrieval scoring, and reflection pass that showed agents can build coherent behavior from accumulated observations. Direct inspiration for our post-session reflection architecture.
Shinn et al. (2023). Demonstrated that agents improve dramatically when they reflect on failures and store that reflection as memory rather than discarding it. Foundation of our consolidation pass.
Wang et al. (2023). First clear demonstration of a compounding procedural skill library in an LLM agent. Lobster skill memory (agent_skills) is directly informed by Voyager's approach — including the SWE-Skills-Bench caveat that skills must be selectively routed, not blindly loaded.
ACT-R, SOAR, and the broader cognitive science literature established the working / episodic / semantic / procedural memory taxonomy. Every serious persistent agent converges on this shape. Our six-layer architecture is this taxonomy made concrete.
Packer et al. (2023). Stateful agent runtime with self-editing memory and explicit page-in/page-out from persistent storage. Demonstrated that LLMs can manage their own memory across sessions without human curation.
Production agent runtime built on the MemGPT research. Our context engineering design — specifically the current-session cutoff and structured fact injection — is informed by Letta's in-context storage model.
Memory extraction and consolidation pipeline for AI agents. Extracts structured facts from conversation history through an LLM pass. Our reflection route (memory-reflection.ts) follows the Mem0 consolidation pattern.
Temporal graph memory system for AI agents. Stores facts as a time-aware knowledge graph where you can query by entity, time, and causality — not just semantic similarity. Phase 4 target for lobster temporal reasoning.
Checkpointed state machine model for long-running agent workflows. LangGraph's thesis — that agents are state machines, not chatbots — directly informed our session ledger and episode design.
Published research and engineering guidance on managed agents, context window management, and memory tooling. The 39% performance improvement and 84% token reduction findings cited throughout our architecture come from Anthropic's agentic-search benchmarks.
Benchmark for long-term memory in conversational AI. Evaluates whether agents can correctly recall user-stated facts across sessions. Our agent_memory_facts table design is partly a response to LongMemEval's failure modes.
Extended version with harder temporal reasoning tasks and larger session gaps.
Evaluates agent memory attribution — can the agent correctly identify where a fact came from (user-stated, inferred, tool-returned)? Drove our decision to store source and confidence on every memory fact.
Personal context and tool-use benchmark. Tests whether agents can use personal context (preferences, history, relationships) correctly when calling tools. Directly relevant to the lobster use case.
Evaluates agents over extended time periods with evolving user goals. One of the few benchmarks that explicitly tests the personal agent scenario RelayForge is targeting.
General AI assistant benchmark covering real-world tasks requiring multi-step reasoning, tool use, and knowledge retrieval. Sets a general quality baseline.
Multi-environment benchmark for LLM agents covering web, database, OS, and coding tasks.
Benchmark for policy adherence and user interaction fidelity in enterprise agent workflows. Tests whether agents follow scoped permissions and approval gates correctly.
Computer use benchmark on real OS environments. Tests GUI navigation, file management, and desktop application control.
Evaluated injected skill files for software engineering agents. Key finding: many skill injections added token overhead without improving task performance. Drove our decision to selectively route skills rather than bulk-load them.
Vulnerability and prompt injection benchmark. Tests robustness of agent tool-call pipelines against adversarial inputs. Informed our Clawmark security gate design.
Benchmark measuring supply-chain attack surface through injected skill files and third-party tool surfaces. Confirmed that third-party tools must be verified before lobsters can access them.
RelayForge proprietary benchmark. The industry's first rigorous evaluation of AI models for unassisted industrial deployment (refinery, manufacturing, process control). Air-gapped and confidential. Published failure data only — no inflated scores.
Open specification for tool ecosystems and agent-to-tool authorization. Every lobster tool is MCP-discoverable. The Clawmark G2 gate requires MCP-compatible manifests.
NIST AI RMF. The Carapace protocol was developed with NIST AI governance principles in mind. We've submitted standards contributions to NIST and ISA.
Industrial cybersecurity standard. Informs the DAWES benchmark's security requirements and the authorization model for lobsters deployed in operational technology environments.
This list covers the primary influences on our architecture. The full research survey commissioned for Phase 3 spans 46 sources including additional benchmark papers, production system analyses, and cognitive science literature not listed here.
Read how we apply this research →© 2026 RelayForge, Inc. · Anacortes, WA · relayforge.tools