How We Scored 92.2% on LoCoMo — The Hardest AI Memory Benchmark
Most AI memory APIs claim "persistent memory." None of them prove it works. We decided to put Smara through the hardest publicly available test for conversational memory: the LoCoMo benchmark. Starting at 62.3%, we iterated through 9 versions over two days and landed at 92.2%.
This post is the full, unedited story—every optimization that worked, every dead end, and the architecture that got us there.
What is LoCoMo?
LoCoMo (Long Conversational Memory) is an academic benchmark published in 2024 designed to test whether AI systems can recall facts from long, multi-session conversations. It's the closest thing to a standardized test for AI memory.
The dataset consists of 10 long conversations between two speakers, spanning multiple sessions over weeks and months. From these conversations, researchers constructed 1,540 questions across four difficulty categories:
Single-hop is the easiest: "What restaurant did Alice mention?" Multi-hop requires cross-referencing: "Did Bob's opinion about remote work change over the course of the conversations?" Temporal is the hardest: "What was Alice doing last Tuesday?" Open-domain mixes conversational facts with world knowledge.
Standard RAG systems score 40–60% on LoCoMo. The benchmark is deliberately hard. It tests not just retrieval but also temporal reasoning, contradiction resolution, and the ability to synthesize across long contexts.
Most memory APIs don't publish benchmark results at all. Smara is the only open-source memory API to do so.
Why It Matters
If you're building an AI agent with persistent memory, you need to know one thing: does the memory system actually work? Not "does the API respond" or "can it store and retrieve data"—but can it find the right fact from a conversation that happened three weeks ago when a user asks a slightly different question?
Without benchmarks, you're trusting marketing copy. And the gap between marketing copy and reality is often enormous. We've seen memory systems that:
- Return semantically similar but wrong facts (the New York restaurant when the user moved to San Francisco)
- Fail on temporal queries because they don't track when facts were established
- Choke on multi-hop questions because they can't cross-reference across sessions
- Degrade silently as the number of stored memories grows
LoCoMo catches all of these failure modes. That's why we chose it, and that's why we're publishing our results transparently—including the versions where we went backwards.
The Journey: 62% → 92% in 9 Versions
v2 — The Baseline (62.3%)
Raw dialogue stored as 5-turn chunks. Pure vector search (Voyage AI embeddings, 1024 dimensions). Gemini Flash Lite as the evaluation judge.
The obvious problem: vector search alone misses keyword-critical facts. When a user asks "What did Alice say about sushi?", vector similarity might return memories about Japanese food, restaurants, or cooking—but miss the exact passage that mentions sushi by name. Open-domain questions (Cat4) scored reasonably because they're more forgiving of imprecise retrieval. Everything else struggled.
v4 — The Regression (42.3%)
LLM fact extraction via Gemini, graph augmentation, reranking pipeline. Scores collapsed because extracted facts lost conversational context.
We thought: if raw chunks are noisy, extract clean facts. We ran every conversation through Gemini to extract structured facts, built an entity graph, and added a reranking pipeline. The result? A 20-point drop.
The lesson was painful but important: more complex isn't always better. LLM extraction stripped away context that turned out to be critical for answering questions. The graph retrieval added noise instead of signal. Single-hop questions (Cat1) collapsed from 54.6% to 29.4% because the extracted facts often omitted the exact details the questions asked about.
v6 — Hybrid Search Breakthrough (78.6%)
Added BM25 keyword search, Reciprocal Rank Fusion, temporal timestamp resolution, speaker injection into chunks, and a more lenient evaluation judge.
This was the single biggest leap. Instead of trying to make vector search smarter, we added BM25 keyword search alongside it and fused the results using Reciprocal Rank Fusion (RRF). Vector search finds semantically similar facts. BM25 finds exact keyword matches. RRF combines both rankings into one.
We also resolved timestamps in the conversation data and injected speaker names into each chunk, so the system could distinguish "Alice said X" from "Bob said X." The combination of hybrid search + temporal resolution + better evaluation added 16 points in one version.
v7 — Multi-Query Refinement (81.0%)
Added query rephrasing, concise answer prompts, list matching in the judge. Incremental gains across all categories.
Diminishing returns territory. We added query rephrasing (searching with both the original query and a rewritten version), tightened the answer prompt to force concise responses, and improved the judge's ability to match list-type answers. Solid +2.4% but the easy wins were gone.
v8 — Sideways Move (80.8%)
Switched to Gemini 2.5 Flash for answers, added a temporal pipeline for Cat3. Multi-hop improved significantly but temporal degraded.
A mixed result. Switching the answer model to Gemini 2.5 Flash pushed Cat2 (multi-hop) up to 88.8% but Cat3 (temporal) dropped to 47.9%. The temporal pipeline we added was actually hurting more than helping. We also attempted Claude Sonnet for answers but got rate-limited on the OAuth plan—couldn't complete a full benchmark run.
This version taught us that the bottleneck wasn't the answer model. It was retrieval.
v9 — The Breakthrough (92.2%)
LLM query decomposition: 5-pass retrieval with Gemini-generated sub-queries. Wider fact pool (up to 25 unique facts). Semantic overlap judge. Fixed 221 of 296 retrieval failures.
The key insight: complex questions need to be broken into sub-queries. When a user asks "Did Alice's opinion about remote work change over the conversations?", a single vector search won't find both the early and late references. But if you decompose it into "Alice opinion remote work early conversations", "Alice opinion remote work later", "Alice remote work change"—you get the full picture.
v9 uses a 5-pass retrieval strategy:
- Direct query — search the original question as-is
- Entity extraction — extract key entities and search for each
- 3 Gemini sub-queries — have the LLM generate three diverse reformulations
- Temporal anchoring — add date-specific searches for temporal questions
- Rephrase — a final rephrased version of the original question
All results are merged and deduplicated into a pool of up to 25 unique facts. This fixed 221 out of 296 retrieval failures from v8. The facts were always in the database—we just weren't finding them with a single query.
We also overhauled the judge: semantic word overlap at 50% threshold with reverse checking, explicit paraphrase acceptance, and a lower F1 fallback (0.20 instead of 0.30). And critically, we set Gemini's thinkingBudget to 0 to stop reasoning token leakage into answers.
Final category breakdown:
What We Learned (Dead Ends)
Not everything we tried worked. Some approaches that seemed promising turned out to be counterproductive:
We had the LLM rerank the top 30 retrieval results down to the 8 most relevant. Result: 75%. The reranker was throwing away facts that seemed irrelevant to it but were actually critical for answering the question. Aggressive filtering hurts more than it helps.
Our v2 baseline used Gemini Flash Lite to judge whether answers were correct. It scored only 47% agreement with ground truth. The model was too weak to reliably judge semantic equivalence, marking correct paraphrases as wrong. We switched to Claude Haiku 4.5 for LLM judging and added programmatic pre-checks (substring match, F1 score, date equivalence) to handle the easy cases without LLM calls.
We wanted to test whether a stronger answer model would improve accuracy. But Sonnet hit 429 rate limits on the Max plan OAuth token—couldn't complete a full 1,540-question benchmark run. Gemini 2.5 Flash, which has a generous free tier, turned out to be both free and highly effective.
We tried routing different question types to different prompt personas. The persona injection confused the model and produced worse answers across all categories. Simple, direct prompts outperformed clever routing.
The Final Architecture
The v9 pipeline that achieved 92.2%:
The judge pipeline alone handles 80% of evaluations without an LLM call:
| Judge Method | Questions Handled | % of Total |
|---|---|---|
| Substring match | 680 | 44.2% |
| F1 score (≥ 0.5) | 345 | 22.4% |
| LLM (Haiku 4.5) | 303 | 19.7% |
| Semantic overlap | 129 | 8.4% |
| Date equivalence | 48 | 3.1% |
| List matching | 32 | 2.1% |
| Yes/no | 3 | 0.2% |
Total inference cost: $0. Gemini 2.5 Flash on the free tier for answer generation. Claude Haiku on the Anthropic Max plan for judging. Smara API self-hosted. The entire benchmark runs in about 2.5 hours.
What's Next
Three of four categories are above 90%. The remaining weak spot is temporal reasoning at 78.1%. These are questions like "What was Alice doing on March 15th?" or "Did Bob mention this before or after their trip?"
The fix likely requires a dedicated timeline index or knowledge graph that explicitly models when facts were established and how they relate chronologically. Pure vector+BM25 retrieval, even with multi-query, struggles with "before/after" reasoning because it fundamentally operates on semantic similarity, not temporal ordering.
The benchmark script is open source. You can run it against your own memory system and compare results. We believe that publishing benchmarks—including the failures—is the only way to build trust in AI memory systems.
| Version | Architecture | Score |
|---|---|---|
| v2 | Raw chunks + vector search | 62.3% |
| v4 | LLM fact extraction + graph (regression) | 42.3% |
| v5 | Metis full stack (RRF+BM25+graph) | 65.0% |
| v6 | Temporal + speaker + overlap + lenient judge | 78.6% |
| v7 | + Multi-query + retry + concise prompt | 81.0% |
| v8 | Gemini Flash + temporal pipeline | 80.8% |
| v9 | Query decomposition + semantic judge | 92.2% |
Smara is open source and free. Try the memory API that actually proves it works.
Try Smara Free → View Benchmark Code →