RelayForge™ · Research Foundations · 26 Sources

What We're Built On

The papers, systems, benchmarks, and standards that informed the RelayForge architecture. Our 46-source research survey was commissioned May 2026 and validated against our live codebase. Every design decision links back to something here.

PAPERS (4)SYSTEMS (6)BENCHMARKS (13)STANDARDS (3)

PAPERSFoundational Research

Generative Agents: Interactive Simulacra of Human Behavior ↗2023

Paper

Park et al. (2023). Introduced the memory stream, retrieval scoring, and reflection pass that showed agents can build coherent behavior from accumulated observations. Direct inspiration for our post-session reflection architecture.

Reflexion: Language Agents with Verbal Reinforcement Learning ↗2023

Paper

Shinn et al. (2023). Demonstrated that agents improve dramatically when they reflect on failures and store that reflection as memory rather than discarding it. Foundation of our consolidation pass.

Voyager: An Open-Ended Embodied Agent with Large Language Models ↗2023

Paper

Wang et al. (2023). First clear demonstration of a compounding procedural skill library in an LLM agent. Lobster skill memory (agent_skills) is directly informed by Voyager's approach — including the SWE-Skills-Bench caveat that skills must be selectively routed, not blindly loaded.

Cognitive Architecture (40 Years)

Paper

ACT-R, SOAR, and the broader cognitive science literature established the working / episodic / semantic / procedural memory taxonomy. Every serious persistent agent converges on this shape. Our six-layer architecture is this taxonomy made concrete.

SYSTEMSMemory Systems & Frameworks

MemGPT / Letta ↗2023

System

Packer et al. (2023). Stateful agent runtime with self-editing memory and explicit page-in/page-out from persistent storage. Demonstrated that LLMs can manage their own memory across sessions without human curation.

Letta (MemGPT successor) ↗

System

Production agent runtime built on the MemGPT research. Our context engineering design — specifically the current-session cutoff and structured fact injection — is informed by Letta's in-context storage model.

Mem0 ↗

System

Memory extraction and consolidation pipeline for AI agents. Extracts structured facts from conversation history through an LLM pass. Our reflection route (memory-reflection.ts) follows the Mem0 consolidation pattern.

Graphiti / Zep ↗

System

Temporal graph memory system for AI agents. Stores facts as a time-aware knowledge graph where you can query by entity, time, and causality — not just semantic similarity. Phase 4 target for lobster temporal reasoning.

LangGraph ↗

System

Checkpointed state machine model for long-running agent workflows. LangGraph's thesis — that agents are state machines, not chatbots — directly informed our session ledger and episode design.

Anthropic: Agents and Context Engineering ↗2024–2026

System

Published research and engineering guidance on managed agents, context window management, and memory tooling. The 39% performance improvement and 84% token reduction findings cited throughout our architecture come from Anthropic's agentic-search benchmarks.

BENCHMARKSBenchmarks & Evaluation

LongMemEval

Benchmark

Benchmark for long-term memory in conversational AI. Evaluates whether agents can correctly recall user-stated facts across sessions. Our agent_memory_facts table design is partly a response to LongMemEval's failure modes.

LongMemEval-V2

Benchmark

Extended version with harder temporal reasoning tasks and larger session gaps.

AMA-Bench

Benchmark

Evaluates agent memory attribution — can the agent correctly identify where a fact came from (user-stated, inferred, tool-returned)? Drove our decision to store source and confidence on every memory fact.

ASTRA-bench

Benchmark

Personal context and tool-use benchmark. Tests whether agents can use personal context (preferences, history, relationships) correctly when calling tools. Directly relevant to the lobster use case.

LifelongAgentBench

Benchmark

Evaluates agents over extended time periods with evolving user goals. One of the few benchmarks that explicitly tests the personal agent scenario RelayForge is targeting.

GAIA ↗2023

Benchmark

General AI assistant benchmark covering real-world tasks requiring multi-step reasoning, tool use, and knowledge retrieval. Sets a general quality baseline.

AgentBench ↗2023

Benchmark

Multi-environment benchmark for LLM agents covering web, database, OS, and coding tasks.

tau-bench

Benchmark

Benchmark for policy adherence and user interaction fidelity in enterprise agent workflows. Tests whether agents follow scoped permissions and approval gates correctly.

OSWorld ↗2024

Benchmark

Computer use benchmark on real OS environments. Tests GUI navigation, file management, and desktop application control.

SWE-Skills-Bench

Benchmark

Evaluated injected skill files for software engineering agents. Key finding: many skill injections added token overhead without improving task performance. Drove our decision to selectively route skills rather than bulk-load them.

VPI-Bench

Benchmark

Vulnerability and prompt injection benchmark. Tests robustness of agent tool-call pipelines against adversarial inputs. Informed our Clawmark security gate design.

Skill-Inject

Benchmark

Benchmark measuring supply-chain attack surface through injected skill files and third-party tool surfaces. Confirmed that third-party tools must be verified before lobsters can access them.

DAWES — Domain Anchored Workplace Expertise Standard

Benchmark

RelayForge proprietary benchmark. The industry's first rigorous evaluation of AI models for unassisted industrial deployment (refinery, manufacturing, process control). Air-gapped and confidential. Published failure data only — no inflated scores.

STANDARDSStandards & Specifications

Model Context Protocol (MCP) ↗

Spec

Open specification for tool ecosystems and agent-to-tool authorization. Every lobster tool is MCP-discoverable. The Clawmark G2 gate requires MCP-compatible manifests.

NIST AI Risk Management Framework ↗

Standard

NIST AI RMF. The Carapace protocol was developed with NIST AI governance principles in mind. We've submitted standards contributions to NIST and ISA.

ISA / IEC 62443

Standard

Industrial cybersecurity standard. Informs the DAWES benchmark's security requirements and the authorization model for lobsters deployed in operational technology environments.

This list covers the primary influences on our architecture. The full research survey commissioned for Phase 3 spans 46 sources including additional benchmark papers, production system analyses, and cognitive science literature not listed here.

Read how we apply this research →