RelayForgeRelayForge™ · Research Foundations · 26 Sources

What We're Built On

The papers, systems, benchmarks, and standards that informed the RelayForge architecture. Our 46-source research survey was commissioned May 2026 and validated against our live codebase. Every design decision links back to something here.

PAPERS (4)SYSTEMS (6)BENCHMARKS (13)STANDARDS (3)
PAPERSFoundational Research

Park et al. (2023). Introduced the memory stream, retrieval scoring, and reflection pass that showed agents can build coherent behavior from accumulated observations. Direct inspiration for our post-session reflection architecture.

Shinn et al. (2023). Demonstrated that agents improve dramatically when they reflect on failures and store that reflection as memory rather than discarding it. Foundation of our consolidation pass.

Wang et al. (2023). First clear demonstration of a compounding procedural skill library in an LLM agent. Lobster skill memory (agent_skills) is directly informed by Voyager's approach — including the SWE-Skills-Bench caveat that skills must be selectively routed, not blindly loaded.

Cognitive Architecture (40 Years)
Paper

ACT-R, SOAR, and the broader cognitive science literature established the working / episodic / semantic / procedural memory taxonomy. Every serious persistent agent converges on this shape. Our six-layer architecture is this taxonomy made concrete.

SYSTEMSMemory Systems & Frameworks

Packer et al. (2023). Stateful agent runtime with self-editing memory and explicit page-in/page-out from persistent storage. Demonstrated that LLMs can manage their own memory across sessions without human curation.

Production agent runtime built on the MemGPT research. Our context engineering design — specifically the current-session cutoff and structured fact injection — is informed by Letta's in-context storage model.

System

Memory extraction and consolidation pipeline for AI agents. Extracts structured facts from conversation history through an LLM pass. Our reflection route (memory-reflection.ts) follows the Mem0 consolidation pattern.

Temporal graph memory system for AI agents. Stores facts as a time-aware knowledge graph where you can query by entity, time, and causality — not just semantic similarity. Phase 4 target for lobster temporal reasoning.

Checkpointed state machine model for long-running agent workflows. LangGraph's thesis — that agents are state machines, not chatbots — directly informed our session ledger and episode design.

Published research and engineering guidance on managed agents, context window management, and memory tooling. The 39% performance improvement and 84% token reduction findings cited throughout our architecture come from Anthropic's agentic-search benchmarks.

BENCHMARKSBenchmarks & Evaluation
LongMemEval
Benchmark

Benchmark for long-term memory in conversational AI. Evaluates whether agents can correctly recall user-stated facts across sessions. Our agent_memory_facts table design is partly a response to LongMemEval's failure modes.

LongMemEval-V2
Benchmark

Extended version with harder temporal reasoning tasks and larger session gaps.

AMA-Bench
Benchmark

Evaluates agent memory attribution — can the agent correctly identify where a fact came from (user-stated, inferred, tool-returned)? Drove our decision to store source and confidence on every memory fact.

ASTRA-bench
Benchmark

Personal context and tool-use benchmark. Tests whether agents can use personal context (preferences, history, relationships) correctly when calling tools. Directly relevant to the lobster use case.

LifelongAgentBench
Benchmark

Evaluates agents over extended time periods with evolving user goals. One of the few benchmarks that explicitly tests the personal agent scenario RelayForge is targeting.

Benchmark

General AI assistant benchmark covering real-world tasks requiring multi-step reasoning, tool use, and knowledge retrieval. Sets a general quality baseline.

Benchmark

Multi-environment benchmark for LLM agents covering web, database, OS, and coding tasks.

tau-bench
Benchmark

Benchmark for policy adherence and user interaction fidelity in enterprise agent workflows. Tests whether agents follow scoped permissions and approval gates correctly.

Benchmark

Computer use benchmark on real OS environments. Tests GUI navigation, file management, and desktop application control.

SWE-Skills-Bench
Benchmark

Evaluated injected skill files for software engineering agents. Key finding: many skill injections added token overhead without improving task performance. Drove our decision to selectively route skills rather than bulk-load them.

VPI-Bench
Benchmark

Vulnerability and prompt injection benchmark. Tests robustness of agent tool-call pipelines against adversarial inputs. Informed our Clawmark security gate design.

Skill-Inject
Benchmark

Benchmark measuring supply-chain attack surface through injected skill files and third-party tool surfaces. Confirmed that third-party tools must be verified before lobsters can access them.

DAWES — Domain Anchored Workplace Expertise Standard
Benchmark

RelayForge proprietary benchmark. The industry's first rigorous evaluation of AI models for unassisted industrial deployment (refinery, manufacturing, process control). Air-gapped and confidential. Published failure data only — no inflated scores.

STANDARDSStandards & Specifications

Open specification for tool ecosystems and agent-to-tool authorization. Every lobster tool is MCP-discoverable. The Clawmark G2 gate requires MCP-compatible manifests.

NIST AI RMF. The Carapace protocol was developed with NIST AI governance principles in mind. We've submitted standards contributions to NIST and ISA.

ISA / IEC 62443
Standard

Industrial cybersecurity standard. Informs the DAWES benchmark's security requirements and the authorization model for lobsters deployed in operational technology environments.

This list covers the primary influences on our architecture. The full research survey commissioned for Phase 3 spans 46 sources including additional benchmark papers, production system analyses, and cognitive science literature not listed here.

Read how we apply this research →

© 2026 RelayForge, Inc. · Anacortes, WA · relayforge.tools