How Good Is Agent Memory, Really? We Benchmarked Datost Against the Field

Agent-memory benchmark numbers are everywhere and almost none are comparable. We ran Datost Brain on LoCoMo and LongMemEval with fixed, disclosed protocols: best-in-class retrieval (98.6 recall@5), an honest 79.2 on end-to-end QA, and the enforcement layer nobody else benchmarks.

Every agent-memory company claims state of the art. Mem0 reports 92.5 on LoCoMo and 94.4 on LongMemEval. ByteRover says 92.2. Zep said 84 on LoCoMo, until Mem0 recomputed it to 58, Zep re-litigated to 75, and then formally admitted the error. The numbers are everywhere, and almost none of them are comparable. We ran our memory system, Datost Brain, on the field’s own benchmarks with disclosed, fixed protocols, and tried to reproduce everyone else’s. Here is the honest result, including the parts that don’t flatter us: best-in-class retrieval, an end-to-end QA number we’re proud of for the right reasons, and an enforcement layer no memory leaderboard measures at all.

There aren’t two metrics, there are three regimes

The reason memory numbers don’t line up is that “memory benchmark” covers three different things, and vendors slide numbers between them.

Relational retrieval (precision/recall@k): “Who invested in Acme? Who was in that meeting?” Pure memory, no LLM reader, no judge.
Conversational retrieval recall@k: did the system surface the right session in the top-k? Still pure memory.
End-to-end QA accuracy: retrieve, then an LLM writes an answer, then an LLM judge grades it. This one is dominated by the reader model and the judge, not the memory.

When a vendor says “94 on LongMemEval,” they rarely tell you which regime, or which reader and judge model. That ambiguity is where the marketing lives.

Three tells that a headline QA number isn’t what it looks like

The oracle tell. On LoCoMo’s standard protocol, a system handed the full conversation (gpt-4o-mini) scores 72.9%. That’s the ceiling. Yet ByteRover claims 92.2, MemMachine 84.9, and Mem0 (2026) 92.5, all above the ceiling. The only way past it is a stronger reader or judge than the baselines they’re plotted against. ByteRover’s best run uses Gemini 3 Pro to write the answers and Gemini 3 Flash to grade them. That measures Gemini 3’s reading comprehension, not the memory.

The Category-5 tell. LoCoMo has 1,986 questions, but the standard protocol scores only 1,540. It excludes 446 adversarial “Category 5” questions that are easy to ace. Zep’s 84% counted them; recomputed without them, it was about 58. We report our own number both ways below, because it’s the cleanest illustration there is.

The recall-isn’t-accuracy tell. A system can retrieve the right evidence 99% of the time and still answer 60% of questions correctly. The reader is a separate axis. As the gbrain README puts it: “100% retrieval recall can coexist with 60% QA accuracy.” We prove it on our own system in a minute.

Regime 1: relational retrieval, our home turf

This is the axis Garry Tan’s gbrain-evals measures, and we ran Datost Brain on gbrain’s own sealed harness.

System	Per-corpus schema?	P@5	R@5
Datost Brain (typed graph)	yes	93.2	99.0
gbrain	yes (4 templates)	49.1	97.9
Datost Brain (schema-free retriever)	no	64.5	76.9

Brain beats gbrain on precision by 44 points, and on F1, on gbrain’s own benchmark. gbrain keeps a hair more raw recall through a grep dragnet that halves its precision. And a schema-free version of our retriever, with no per-corpus rules at all, generalizes to a held-out domain it was never tuned on at R@5 88.7.

Regime 2: conversational retrieval recall, also best in class

LongMemEval_S, 500 questions, retrieval only:

Metric	Datost Brain	Field reference
LongMemEval recall@5	98.6	gbrain 97.6, MemPalace 96.6
recall@1 / recall@10	90.2 / 99.6	—

98.6 with a cheap cosine rerank and no LLM. On the metric that actually isolates memory, Brain is at the top of the field. This is the retrieval half of the problem, and it’s effectively solved.

Regime 3: end-to-end QA, where we found the gap and closed most of it honestly

Here’s the part nobody else publishes cleanly. We ran Brain retrieval, then a standard disclosed reader, then the official judge prompts. First with a thin single-shot reader, which exposed the real lesson, then with a reasoning reader (decompose, re-retrieve per hop, step-by-step dated reasoning) on the same disclosed gpt-4o / gpt-4o-mini models, no frontier judge.

Benchmark	Thin reader	Reasoning reader	Honest field references
LongMemEval_S (500Q, gpt-4o)	66.2	79.2	full-context 60.6, with retrieval 65.7, gold-evidence oracle 82.4, Supermemory 81.6
LoCoMo (cats 1-4, gpt-4o-mini)	60.0	63.2	oracle 72.9, Mem0 66.9, Zep 66.0, Letta-fs 74.0
LoCoMo including Category 5	66.3	—	the inflated number we refuse to quote

Four honest conclusions:

The thin reader proved the thesis on our own system. Retrieval recall was 98.6, but thin QA accuracy was only 66, a 30-point gap that is the reader, not the memory. Every vendor sitting 20 points above the thin number got there with reader engineering or a frontier judge.

So we did the reader engineering, honestly, and it worked. The reasoning reader lifts LongMemEval from 66.2 to 79.2. That clears the full-context oracle (60.6) by about 19 points and lands within 3 of the gold-evidence oracle (82.4), competitive with Supermemory’s 81.6, all on a disclosed gpt-4o reader. The lift comes from reasoning, not a stronger judge: temporal questions went from 40.6 to 82.7 on explicit dated arithmetic.

The gains scale with reader strength, and we report that honestly. The same pipeline lifts LoCoMo only 60.0 to 63.2, because the Mem0 paper protocol mandates a gpt-4o-mini reader that can’t exploit multi-step reasoning as fully. We’re still a touch behind Mem0 and Zep on LoCoMo, and we say so. For context, LoCoMo’s answer key is independently measured 6.4% wrong and its judge accepts 63% of deliberately wrong answers, so sub-6-point gaps there are noise.

We still refuse the Category-5 trick. Including LoCoMo’s adversarial split would inflate our thin 60.0 to 66.3, exactly the move that took Zep from 84 to 58. We quote the 60.0 and 63.2.

Everything here is reproducible in one command

Self-reported numbers are worthless in this field. Ours are a script: public datasets, official judge prompts, checkpointed runs, the model named every time.

pnpm brain:bench:longmemeval        # Regime 2 retrieval recall@5 -> 98.6
pnpm brain:bench:qa:longmemeval     # Regime 3 end-to-end QA (gpt-4o) -> 66.2 thin
pnpm brain:bench:qa:locomo          # Regime 3 LoCoMo QA (gpt-4o-mini) -> 60.0 (cats 1-4)
# gbrain head-to-head runs against the sealed garrytan/gbrain-evals harness.

If you can reproduce a competitor’s 92.5 on the official protocol with a disclosed judge, we want to see the command. We couldn’t, and the oracle math says it isn’t there.

The part that isn’t on any leaderboard

Mem0, Zep, MemPalace, gbrain: they are all retrieval. Datost Brain is the only one that also enforces. It blocks a bad SQL query before it touches a customer’s warehouse. It isolates memory by scope (org, database, table, column, thread, user). It ladders authority, so a confirmed contract outranks a stale guess. And it audits every retrieval and every block. For an analytics agent writing SQL against your database, that capability gap, not a recall decimal, is the whole game. It’s the same reason grounding matters so much for text-to-SQL accuracy: the model has to use your metric definitions and semantic layer, and memory is how those corrections stick.

The honest bottom line

Brain has best-in-class retrieval (it beats gbrain on gbrain’s own harness and tops LongMemEval recall at 98.6), and a reasoning reader that turns that retrieval into 79.2 on LongMemEval, within about 3 points of the gold-evidence oracle and competitive with the honest leaders, on a disclosed gpt-4o reader with zero metric-gaming. We were transparent when the thin reader put us mid-pack, we showed exactly why (the reader, not the memory), and then we closed most of the gap the honest way: by reasoning better, not by swapping in a frontier judge or counting the excluded questions. Plus a category of enforcement work no memory system on the leaderboard does at all.

We’ll publish a defensible 79 over a fragile 92 every time. A number that dies under scrutiny is worse than no number. This is the same standard we held ourselves to on BIRD-Interact. Want to see it for yourself? Run the commands, or read how the memory feeds grounding on every query.