Benchmarks¶

Engram ships three reproducible benchmark scripts covering progressively harder long-term memory tasks. The numbers below come from running them against real databases with publicly available datasets. All three use on-device embeddings (free, no API cost at ingest) and the same retrieval pipeline.

[!WARNING] Honest setup note: all three benchmarks ingest memories via add_batch() — raw episodic turns stored verbatim, no LLM extraction at ingest time. This is a deliberate floor measurement of Engram's retrieval layer. add_conversation() (full LLM-based extraction, deduplication, and supersession) is expected to score higher on structured fact types like knowledge-update and preference-following but was not used here. The composer (claude-sonnet-4-6) and judge (claude-opus-4-8) are different models, but both from Anthropic. A stronger model grading a weaker one is stricter than self-judging, but it is still same-vendor; for full independence, re-judge with a non-Anthropic model using --rejudge-only.

Results at a glance¶

Latest run set: 2026-07-08 (benchmark/runs/lme-final, locomo-final, beam-1m-final).

Benchmark	Dataset	Questions	Accuracy
LongMemEval-S (ICLR 2025)	isolated per-question haystacks	500	90.8%
LoCoMo-10 (ACL 2024)	10 long-running conversations	1,540	94.1%
BEAM 1M (ICLR 2026)	35 conversations, 10 question types	700	79.7%

The pipeline is question-type-blind end to end: retrieval, evidence budgets, and the composer never see the benchmark's question-type label, so every number is directly comparable to a baseline without oracle access to the question type.

Engram benchmarks at a glance — accuracy, cost, and context savings

LongMemEval-S — 90.8%¶

Dataset: 500 questions, each with its own isolated haystack of chat histories Composer: claude-sonnet-4-6 · Judge: claude-opus-4-8 Embeddings: sentence-transformers/all-MiniLM-L6-v2 (384-d, on-device) Retrieval: hybrid search + cross-encoder rerank, session-diversified (max 4 turns/session), 60 memories/question Graph depth: 0 (disabled)

Question type	Accuracy	Raw
knowledge-update	98.7%	77 / 78
single-session-user	98.6%	69 / 70
abstention	96.7%	29 / 30
single-session-assistant	94.6%	53 / 56
single-session-preference	90.0%	27 / 30
temporal-reasoning	89.5%	119 / 133
multi-session	82.0%	109 / 133
Overall	90.8%	454 / 500

The 46 failures are essentially all reader-side. A prompt-free retrieval check (--score-retrieval) surfaces the gold answer session for all 470 answerable questions (100% hit-rate, 99.7% mean session recall). Every failure had its evidence sitting in the context block, so the composer is what missed, not the retriever.

Session-level retrieval quality per type (retrieval_quality in summary.json, hybrid-search surface, 470 answerable questions scored):

Question type	Session hit-rate	Mean session recall
knowledge-update	100%	100%
single-session-user	100%	100%
single-session-assistant	100%	100%
single-session-preference	100%	100%
temporal-reasoning	100%	99.6%
multi-session	100%	99.4%

Hit-rate asks "did at least one gold answer session make it into the evidence?"; mean recall asks "what fraction of all gold sessions did?" — only the two multi-gold-session types drop fractionally below 100% on the recall side.

LoCoMo-10 — 94.1%¶

LoCoMo (ACL 2024) uses 10 long-running synthetic two-person conversations spanning hundreds of sessions each. We evaluate categories 1–4 (1,540 questions); category 5 (adversarial) is excluded per the benchmark spec.

Dataset: 1,540 questions across 10 conversations Composer: claude-sonnet-4-6 · Judge: claude-opus-4-8 Embeddings: sentence-transformers/all-MiniLM-L6-v2 (384-d, on-device) Retrieval: hybrid search + cross-encoder rerank + lineage traversal, session-diversified (max 6 turns/session), 100 memories/question Graph depth: 0 (disabled — auto-ingest creates no graph edges, so traversal is a no-op) Judge: single unified rubric (partial-credit / semantic-equivalence), consistent with published baselines

Category	Accuracy	Raw
single-hop	95.6%	804 / 841
temporal	95.0%	305 / 321
multi-hop	93.6%	264 / 282
open-domain	79.2%	76 / 96
Overall	94.1%	1,449 / 1,540

Open-domain (79.2%) is the honest weak spot. These questions ask about world knowledge that was never stored in the conversation, and no retrieval tuning fills a gap that was never in the corpus.

Turn-level retrieval quality¶

LoCoMo cites the answer-bearing turns by dia_id, which allows a turn-level IR measurement — strictly finer-grained than LongMemEval's session-level check. The harness intersects those gold turn ids with the ids the hybrid-search surface actually returned, separating a retrieval miss (gold turn never surfaced) from a reading error (surfaced but answered wrong). 1,536 of 1,540 questions carry gold evidence and are scored; turns skipped at ingest (empty/photo-only) can never be retrieved, so these numbers are a conservative floor.

Category	Turn hit-rate	Mean turn recall
temporal	96.3%	95.1%
single-hop	95.8%	95.1%
multi-hop	94.0%	79.7%
open-domain	79.4%	69.0%
Overall	94.6%	90.7%

Two things worth reading off this table:

Multi-hop: mean turn recall is only 79.7% (a multi-hop question cites several gold turns and the evidence budget doesn't always hold all of them), yet accuracy is 93.6% — the composer routinely bridges partial evidence when at least one gold turn (94.0% hit-rate) plus related context is present.
Open-domain: the 79.4% hit-rate mirrors the 79.2% accuracy almost exactly. These questions ask about world knowledge, so their "gold" turns are only obliquely related to the query text — the retrieval gap and the accuracy gap are the same gap.

BEAM 1M — 79.7%¶

BEAM (ICLR 2026) is the hardest of the three. It tests ten distinct question types, including some that require Engram to do things raw retrieval systems fundamentally cannot: identify contradictions between turns, infer chronological event order, and produce full-span conversation summaries. We ran the full 1M-token split: 35 conversations × 20 questions = 700 total.

Dataset: 700 questions across 35 conversations (1M token scale) Composer: claude-sonnet-4-6 · Judge: claude-opus-4-8 Embeddings: sentence-transformers/all-MiniLM-L6-v2 (384-d, on-device) Retrieval: hybrid search + cross-encoder rerank + lineage; candidate pool 500 pre-rerank, 100 post-rerank Scoring: rubric nugget scoring per question (0 / 0.5 / 1.0 per nugget, mean ≥ 0.5 = pass) Graph depth: 0 (disabled — auto-ingest creates no graph edges, so traversal is a no-op)

Question type	Accuracy	Raw
preference_following	94.3%	66 / 70
instruction_following	88.6%	62 / 70
abstention	82.9%	58 / 70
knowledge_update	82.9%	58 / 70
multi_session_reasoning	82.9%	58 / 70
summarization	80.0%	56 / 70
information_extraction	78.6%	55 / 70
event_ordering	71.4%	50 / 70
contradiction_resolution	68.6%	48 / 70
temporal_reasoning	67.1%	47 / 70
Overall	79.7%	558 / 700

Overall average nugget score: 0.738 (rubric nuggets scored 0 / 0.5 / 1.0; mean ≥ 0.5 = pass). Average nugget recall — the fraction of a question's rubric nuggets the judge scored ≥ 0.5 in the generated answer — is 77.4% (answer_quality.avg_nugget_recall in summary.json).

[!NOTE] Nugget recall is an answer/generation metric, not a retrieval metric: BEAM ships no per-nugget gold evidence ids, so a missing nugget cannot be attributed to retrieval versus composition. For true retrieval-level IR, see LongMemEval's session-level and LoCoMo's turn-level retrieval_quality blocks.

Earlier iterations of the BEAM script applied per-type retrieval scaffolding: type-specific evidence budgets, supplemental sub-queries for event_ordering and contradiction_resolution, corpus stratification for summarization, and question-type injection into the composer prompt. All of that has been removed. A production agent doesn't know its question type in advance, so a harness that routes on the type label is grading a different (easier) task.

The current script runs one uniform pipeline for every question: hybrid search over a 500-candidate pool, cross-encoder rerank, MMR diversification (--diversify), and a flat 100-memory evidence slice. The composer sees only the question and the retrieved memories — no type label, no per-type coaching. The 79.7% headline is lower than the old scaffolded number (81.9%) precisely because the leak is gone; it is the honest, apples-to-apples score.

What replaced the scaffolding lives inside Engram itself, where a production agent would actually get it: the verify recall intent runs a counter-evidence search for did-it-happen questions (so a lone "I never did X" turn isn't buried under bulk affirmative evidence), and the composer prompt carries a universal contradiction check that applies to every question rather than firing on a type label.

Where BEAM still fails¶

Temporal reasoning (67.1%) misses when BEAM embeds date information implicitly in turn text ("[May 15, 2023] USER: ..."). The recall operator resolves temporal phrases correctly, but when a question asks for date arithmetic ("how many days between X and Y"), Engram has to extract two dates from two separate turns and compute the interval in the composer pass. The temporal_chain recall intent (parallel search per event anchor, evidence merged chronologically) was built for this and helps, but the gap stays when dates appear only as inline text rather than structured metadata.

Contradiction resolution (68.6%) requires surfacing a minority turn that says the opposite of what many other turns say. Similarity ranking buries that lone denial under bulk affirmative evidence. The verify recall intent attacks exactly this (a dedicated negation-query search, superseded rows included), and the composer's always-on contradiction check reports both sides when they land in evidence — but when the denial never enters the candidate pool at all, the contradiction is invisible.

Event ordering (71.4%) needs many specific event turns simultaneously in evidence. The broad question query retrieves the topic cluster, but individual events phrased unlike the question can rank below the evidence cutoff; without per-event sub-queries (removed as type scaffolding), ordering rests on what the uniform slice happens to cover.

Latency & context efficiency¶

Engram efficiency at a glance — BEAM 1M: 79.7% accuracy, 60,646 evidence tokens per query, 94.1% less than full context, 17× reduction

Per-question wall-clock and token economics from the 2026-07-08 runs. "Evidence tokens" is what the composer actually reads; "full context" is the size of the raw conversation(s) that evidence was distilled from. The gap between them is the compression Engram buys you: the composer never sees the haystack, only the reranked evidence block.

Benchmark	Ingest	Retrieval	Generation	Total — p50 / avg / p95
LongMemEval-S	12.0 s	5.6 s	8.3 s	25.7 / 25.9 / 36.5 s
LoCoMo-10	amortized *	2.6 s	10.9 s	12.7 / 13.4 / 21.2 s
BEAM 1M	amortized *	2.2 s	27.4 s	22.5 / 29.6 / 79.9 s

LongMemEval ingests a fresh haystack for every question, so its ingest (~12 s) is charged per question. LoCoMo and BEAM ingest each long conversation once and amortize it across all of that conversation's questions, so ingest is not part of the per-question total.

Benchmark	Evidence tokens	Full context	Compression	Search hits
LongMemEval-S	17,707	101,601	82.6%	57
LoCoMo-10	4,619	17,888	74.2%	100
BEAM 1M	60,646	1,033,243	94.1%	100

The BEAM row is the headline: Engram hands the composer 61K evidence tokens distilled from a 1.03M-token conversation, a 94.1% reduction, and the composer answers from that block alone. At Anthropic list prices the composer costs about $0.21 per question on BEAM, versus a projected $3.13 per question if you fed the whole conversation in every time, roughly 15× cheaper. Generation dominates latency on BEAM (27.4 s of the 29.6 s average) because that evidence block is large and dense; retrieval itself is only 2.2 s.

Cost savings¶

The composer only ever reads the reranked evidence block, never the full conversation. That is where the money is. The table below puts Engram's per-question composer cost next to the no-retrieval baseline: what it would cost to send the entire conversation as context for every question. Both use Anthropic list prices; dollar figures come from provider-billed token usage, and the compression baseline is counted with tiktoken (o200k_base).

Benchmark	Compression	Composer $/q	Full-context $/q	Cheaper by
LongMemEval-S	82.6%	$0.074	$0.325	4.4×
LoCoMo-10	74.2%	$0.028	$0.068	2.4×
BEAM 1M	94.1%	$0.210	$3.13	14.9×

The savings track conversation length. On BEAM's 1M-token conversations the full-context baseline is brutal — $3.13 per question, composer alone — so retrieval pays off 15×. On LoCoMo's short conversations the whole history is already small, so retrieval only buys 2.4×. That is the honest shape of it: the longer your history, the more retrieval saves, and on a short chat it barely matters.

Two things the per-question composer cost leaves out, both in Engram's favor:

Ingest is free. Embeddings run on-device (all-MiniLM-L6-v2), so there is no LLM bill at write time. The full-context baseline has no ingest step to compare against, but it re-reads the entire conversation on every single question, which is the cost the table captures.
The judge is excluded. It runs only to score the benchmark, not in a real deployment. Counting it, the all-in cost per question was $0.089 (LongMemEval), $0.034 (LoCoMo), and $0.267 (BEAM).

Whole-run totals: $44.60 for LongMemEval (500 q), $51.79 for LoCoMo (1,540 q), and $186.62 for BEAM (700 q), composer plus judge.

The shared pipeline¶

Shared benchmark pipeline — zero-LLM ingest, three fused retrieval surfaces, one composer pass, independent judge

All three benchmarks run the same core pipeline:

add_batch() → search() + recall() + get_lineage() → composer LLM

Ingest (add_batch()): raw conversation turns are embedded on-device and written to pgvector. No LLM is called at this stage. Ingestion takes roughly 14 seconds per question on LongMemEval.

Retrieve — three surfaces, all called per question:

API	What it does
`search(mode='hybrid', rerank=True)`	pgvector cosine + PostgreSQL full-text, fused with Reciprocal Rank Fusion, then cross-encoder reranked against the question
`recall(compose_answer=False)`	intent-classified retrieval (current / historical / event / lineage / temporal_chain / verify); passes structured lineage evidence — current value, superseded predecessors, conflict notes — without generating a prose answer. temporal_chain runs one search per event anchor; verify adds a counter-evidence search for the negation of the claim
`get_lineage()`	follows supersession chains so corrected values carry their history into the evidence block

[!NOTE] Engram also exposes traverse_many() for multi-hop graph traversal, but it is not exercised by these benchmarks: the add_batch() ingest path creates no graph edges, so traversal is a no-op. It is available for applications that populate their own relations via add_relation().

Generate: one composer LLM call assembles the evidence block into an answer. The judge runs separately on the same output.

[!NOTE] What this measures: All three benchmarks bypass add_conversation() (Engram's full LLM-extraction pipeline). The scores reflect the retrieval layer as a raw substrate — episodic turns stored verbatim, with all reasoning deferred to query time. add_conversation() adds semantic extraction, fact deduplication, and conflict resolution at ingest; these are expected to improve structured fact types. The benchmark numbers are a floor, not a ceiling.

What each component contributes (LongMemEval ablation)¶

Configuration	Composer	Rerank	Accuracy
Hybrid search only	Haiku	no	77.8%
+ cross-encoder rerank	Haiku	yes	87.0%
+ stronger composer	Sonnet	yes	90.8%

Reranking is the biggest single lever. The 9-point gap between no-rerank and rerank is retrieval quality: irrelevant turns are cut before the composer sees them. The additional 3 points from Haiku to Sonnet is reasoning quality over evidence that's already clean.

Evidence budget interacts with question type. Tightening below 60 memories regressed aggregation and multi-session questions. 60 memories over a reranked pool outperformed 30 memories with higher nominal precision, because counting and cross-session reasoning need every relevant turn in context.

Retrieval vs. reader: how much is the prompt?¶

Two tools isolate where accuracy actually comes from, because end-to-end accuracy conflates retrieval, composer prompt, model, and judge:

--score-retrieval <traces.jsonl> computes a prompt-free retrieval hit-rate: it joins the retrieved session ids against the dataset's gold answer_session_ids. No LLM, no judge. On the 90.8% run, retrieval surfaces the gold answer session 100% of the time (470 / 470 answerable; 99.7% mean session recall). Retrieval is not the bottleneck on LongMemEval; wherever an answer is wrong, the evidence was present.
--dumb-reader swaps the tuned composer for a neutral one-paragraph reader, holding ingest, retrieval, and judge identical. The accuracy delta isolates the prompt's contribution.

On a 100-question Sonnet slice: the dumb reader scores 86%, the tuned composer 91% — a directional +5 points (not statistically significant at this sample, McNemar p≈0.23), concentrated entirely in hard multi-session questions. The same tuned prompt was net-negative on Haiku. Read together: the substrate (retrieval + the model reading clean evidence) carries ~95% of the result; the 300-line composer prompt is a model-specific top-up, not the engine, and should not be treated as portable accuracy. A caveat on the retrieval number: hit-rate is measured at session granularity, so it is an upper bound on evidence adequacy (the answer-bearing turn within a retrieved session may still be trimmed by the budget). LoCoMo closes exactly that gap: its dataset cites gold turns by id, so its retrieval_quality block measures turn-level hit-rate (94.6%) and recall (90.7%) — see Turn-level retrieval quality.

Where each benchmark still fails¶

LongMemEval (46 failures): every failure had the right session in the retrieved evidence, so these are composer errors, not retrieval (the prompt-free check scores retrieval at 100% session hit-rate, 470 / 470 answerable).

LoCoMo open-domain (21% miss rate): world knowledge the system never ingested. Retrieval cannot fill facts that were never stored.

BEAM temporal reasoning (33% miss rate): two-hop date arithmetic. Both event dates are usually in the evidence block, but computing the interval requires the composer to extract two dates from different turns and subtract. Accuracy here depends heavily on how explicitly dates are stated in the conversation. When dates appear only as inline text ([May 15, 2023]), the composer handles it. When they're implicit ("that was three weeks after I started"), the chain breaks.

BEAM contradiction resolution (31% miss rate): the minority denial turn must enter the candidate pool for the contradiction to be visible at all. The verify recall intent's counter-evidence search finds it when the classifier routes the question as a yes/no verification; when it doesn't, similarity ranking alone can bury the lone denial below the evidence cutoff.

Reproduce it¶

All scripts are in benchmark/. Data files go in data/.

[!WARNING] LLM API calls for composer and judge are billable. On-device embeddings are free. Set ENGRAM_ANTHROPIC_API_KEY in your .env.

LongMemEval — 90.8% run¶

python benchmark/longmemeval_benchmark.py \
  --llm-model claude-sonnet-4-6 \
  --judge-model claude-opus-4-8 \
  --rerank \
  --search-limit 60 \
  --max-per-session 4 \
  --local-embedding --embedding-model sentence-transformers/all-MiniLM-L6-v2 --embedding-dimension 384 \
  --concurrency 8 \
  --clean-db \
  --output-dir benchmark/runs/lme-final

LongMemEval — cheaper run (Haiku composer, 87.0%)¶

python benchmark/longmemeval_benchmark.py \
  --rerank \
  --search-limit 60 \
  --max-per-session 4 \
  --judge-model claude-opus-4-8 \
  --local-embedding --embedding-model sentence-transformers/all-MiniLM-L6-v2 --embedding-dimension 384 \
  --concurrency 8 \
  --graph-depth 0 \
  --clean-db \
  --output-dir benchmark/runs/lme-cheap

LoCoMo-10 — 94.1% run¶

python benchmark/locomo_benchmark.py \
  --conversations 0,1,2,3,4,5,6,7,8,9 \
  --search-limit 100 \
  --max-per-session 6 \
  --rerank \
  --concurrency 8 \
  --local-embedding --embedding-model sentence-transformers/all-MiniLM-L6-v2 --embedding-dimension 384 \
  --llm-model claude-sonnet-4-6 \
  --judge-model claude-opus-4-8 \
  --clean-db \
  --output-dir benchmark/runs/locomo-final

BEAM 1M — 79.7% run¶

python benchmark/beam_benchmark.py \
  --chat-sizes 1M \
  --llm-model claude-sonnet-4-6 \
  --judge-model claude-opus-4-8 \
  --rerank \
  --diversify \
  --search-limit 100 \
  --candidate-limit 500 \
  --cutoffs 100 \
  --event-ordering-tau \
  --answer-max-tokens 4000 \
  --local-embedding --embedding-model sentence-transformers/all-MiniLM-L6-v2 --embedding-dimension 384 \
  --concurrency 8 \
  --judge-concurrency 10 \
  --clean-db \
  --output-dir benchmark/runs/beam-1m-final-v2

Stress Test¶

python benchmark/engram_stress_benchmark.py \
  --rerank \
  --local-embedding --embedding-model sentence-transformers/all-MiniLM-L6-v2 --embedding-dimension 384 \
  --concurrency 8 \
  --clean-db \
  --output-dir benchmark/runs/stress-final

Re-score without re-running¶

python benchmark/longmemeval_benchmark.py \
  --rejudge-only benchmark/runs/lme-final/traces.jsonl \
  --judge-model claude-sonnet-4-6 \
  --output-dir benchmark/runs/lme-rejudge

Output files¶

Each run writes three files to the output directory:

File	Contents
`traces.jsonl`	Question, gold answer, retrieved evidence, composer answer, retrieval stats — one JSON object per question
`judgments.jsonl`	Per-question verdict with reasoning
`summary.json`	Overall and per-type accuracy, full configuration

Retrieval tuning flags¶

All four benchmark scripts (longmemeval, locomo, beam, stress) accept these flags to A/B-test retrieval strategies:

Flag	Default	Effect
`--rerank`	off	Cross-encoder rerank the candidate pool before selection. Biggest accuracy lever.
`--diversify`	off	Replace session round-robin (or top-N slicing in BEAM) with Maximal Marginal Relevance: search() overfetches + reranks internally, then MMR picks the final evidence budget for semantic coverage.
`--mmr-lambda`	0.7	Relevance/diversity trade-off for `--diversify`. 1.0 = pure relevance, 0.0 = pure diversity.

[!TIP] --diversify is most useful for question types that need coverage across many sessions (e.g., BEAM summarization, LoCoMo multi-hop). For single-fact lookups, session round-robin and MMR perform similarly.

Notes for the community¶

Judge stronger than composer: all three headline runs use claude-sonnet-4-6 as the composer and claude-opus-4-8 as the judge. A stronger model grading a weaker one's output is stricter than self-judging, but composer and judge still share a vendor. For an independence check, re-judge with a non-Anthropic model via --rejudge-only.

BEAM is a newer and harder benchmark: unlike LongMemEval and LoCoMo, BEAM includes question types that test the retrieval system's ability to surface contradictions, reconstruct event orderings, and summarize across full conversation spans. The 79.7% headline comes from a pipeline that is question-type-blind end to end — no per-type retrieval tricks, budgets, or composer coaching — so it is directly comparable to any baseline without oracle access to the question type. (An earlier scaffolded harness scored 81.9%; that number was retired as a type-label leak.)

**add_batch() vs add_conversation()**: these benchmarks deliberately use add_batch() (raw episodic turn storage, zero ingest LLM calls) to isolate the retrieval layer. Production use of add_conversation() performs LLM-based fact extraction, deduplication, and supersession at write time, which reduces retrieval noise for structured fact types. The benchmark scores are a lower bound on what the full Engram pipeline can achieve.

Reproducibility: given the same model versions and configuration, runs reproduce within ~1%. Accuracy changes meaningfully with embedding model choice, reranking, evidence budget, and composer strength — all exact parameters are stored in summary.json alongside the scores.