EchoMem Research · LongMemEval
Evidence-First Memory: EchoMem Reaches 95.8% on LongMemEval
Reliable AI memory is not remembering everything. It is knowing what the evidence actually supports.
Long-term memory for AI assistants is usually framed as a retrieval problem. Can the system find the right fact? Can it surface the right conversation? Can it pull a relevant memory into context?
But retrieval is only the beginning. A useful memory system has to do something harder: it has to know what the evidence actually supports. It has to distinguish old facts from new ones, resolve details spread across sessions, infer preferences without hallucinating them, and refuse to answer when the available memory does not justify the question.
That is the lens we used to evaluate EchoMem on LongMemEval.
EchoMem reached 95.8% accuracy across 500 LongMemEval questions. A second run with GPT-4o scored 90.4% on the same benchmark. The result is encouraging, but the more interesting lesson is not simply that EchoMem remembered more. It is that reliable memory depends on evidence.
Why LongMemEval Matters
Many memory benchmarks test whether a system can retrieve a fact hidden in a long context. That is useful, but it does not fully capture what real AI assistants need to do.
Real assistant memory is messy. A user's preferences change. Facts become outdated. Important details appear across multiple sessions. Some questions contain false premises. Some answers require temporal reasoning rather than direct lookup. A good memory system has to reason over all of that without turning plausible guesses into confident answers.
Knowledge update
Whether the system can track changed facts
Multi-session
Whether it can combine evidence across conversations
Temporal reasoning
Whether it understands time, order, and duration
Single-session user
Whether it recalls user-side facts from one session
Single-session assistant
Whether it recalls assistant-side facts from one session
Single-session preference
Whether it can infer preferences from evidence
That makes LongMemEval more than a retrieval benchmark. It is closer to a behavioral test for memory: can the system answer only when the memory actually warrants the answer?
The Result
EchoMem scored 95.80% across 500 LongMemEval questions in the headline run. The GPT-4o run scored 90.40% using the same 500-question benchmark.
Category scores across the same 500-question benchmark.
The headline run answered 479/500 questions correctly, with 21 misses and 0 unscored questions. The GPT-4o run answered 452/500 questions correctly, with 48 misses.
Both runs used the same EchoMem memory stack and were scored with the official LongMemEval GPT-4o scorer.
Benchmark Comparison
LongMemEval results are easiest to read in context. Below is the public comparison we use for the homepage claim, moved here so the article carries the benchmark detail directly.
| System | KU | MS | TR | SSU | SSA | SSP | Overall |
|---|---|---|---|---|---|---|---|
| EchoMem | 100.0% | 92.5% | 94.0% | 98.6% | 100.0% | 93.3% | 95.8% |
| Supermemory | 99.0% | 93.0% | 91.0% | 97.0% | 100.0% | 90.0% | 95.0% |
| Mastra | 96.2% | 87.2% | 95.5% | 95.7% | 94.6% | 100.0% | 94.9% |
| Mem0 | 93.6% | 88.0% | 97.0% | 98.6% | 98.2% | 96.7% | 94.4% |
| Zep | 83.3% | 57.9% | 62.4% | 92.9% | 80.4% | 56.7% | 71.2% |
Per-system scores are sourced from each vendor's published research. Best score per column is underlined.
The Core Lesson: Evidence Comes First
The biggest lesson from LongMemEval was not that memory systems need to store more information. It was that they need to be more careful about evidence.
A memory system can fail in several ways. It can fail to retrieve the right memory. It can retrieve the right memory but answer the wrong question. It can overgeneralize from related evidence. It can treat an outdated fact as current. It can answer a false-premise question instead of abstaining.
A memory-backed answer should only be given when the available evidence supports the specific question being asked.
That sounds obvious, but it changes the shape of the system. It means memory is not just a pile of facts. It has to preserve evidence, select it carefully, reason over it, and decide whether it is enough.
How EchoMem Thinks About Memory
EchoMem treats memory as a pipeline rather than a single retrieval call.
First, conversations are converted into durable memory records that preserve enough context to be useful later. This matters because many real questions depend on more than semantic similarity. They depend on whether the evidence is current, specific, and relevant to the user's actual intent.
At query time, EchoMem evaluates whether the available memories are specific enough to answer responsibly. That includes handling changed facts, time-sensitive questions, user preferences, and cases where the premise of the question is not supported.
- Preserve useful memories from conversations.
- Retrieve evidence for the current question.
- Check whether that evidence actually supports the answer.
- Answer only when the evidence is sufficient.
- Abstain when it is not.
- Review failures to improve reliability.
What the Misses Taught Us
One reason LongMemEval was useful for EchoMem is that we did not treat the score as the only output. We also reviewed where the system succeeded, where it correctly abstained, and where it still missed.
| Outcome | Count | Meaning |
|---|---|---|
| Evidence-supported answers | 449 | The system answered from sufficient memory evidence |
| Correct abstentions | 30 | The system refused unsupported questions |
| Remaining misses | 21 | The system either lacked enough evidence or failed to use it correctly |
Most successful answers were supported by clear evidence. That is what we want: the system found enough memory context and answered from it. But the 30 correct abstentions are just as important. They show that the system can resist the temptation to answer when the premise is unsupported.
Giving the model more memories can help when the right evidence was missing. But it can also create a new problem: the model may combine related memories and produce an answer that sounds right, even though no single piece of evidence actually supports it. The goal is not simply more answers. The goal is answers that are truly warranted.
Why Abstention Matters
Abstention is easy to undervalue because it does not feel like a capability. But for memory systems, abstention is one of the clearest signs of reliability.
A personal memory system will often face questions where the premise is slightly wrong. The user may ask about the wrong person, the wrong location, the wrong time period, or a preference that was never actually expressed. In those cases, a system that always tries to answer can sound helpful while connecting memories that do not actually support the question.
LongMemEval includes these traps. A good score requires not only recalling facts, but also recognizing when the available memories do not support the requested answer.
What Still Needs Work
The remaining misses are concentrated in multi-session and temporal reasoning.
That is not surprising. These are the categories where real memory becomes hardest. Multi-session questions may require stitching together evidence from multiple conversations. Temporal questions require knowing whether a fact was true before, after, currently, or for a duration. Small wording differences can change the answer.
- Evidence exists, but is spread across multiple memories.
- The retrieved memory is related but missing a key qualifier.
- An older fact conflicts with a newer one.
- The answer requires a calculation over time.
- A preference is implied but not explicitly stated.
- A plausible answer is not the same as a supported answer.
How This Fits Into the Memory Landscape
Recent LongMemEval work has taken several different angles. Some approaches argue that strong retrieval pipelines can still reach impressive results. Others focus on production tradeoffs like latency, cost, and token efficiency. Others introduce new memory architectures inspired by how humans observe, compress, and reflect on experience.
EchoMem's emphasis is complementary: evidence discipline. We care less about whether the system surfaced something that sounds relevant and more about whether the final answer is warranted.
A memory system is not reliable because it always has something to say. It is reliable because it can separate evidence from vibes.
Methodology Notes
The result reported here comes from EchoMem's LongMemEval evaluation across all six active LongMemEval categories. Results were scored with the official LongMemEval GPT-4o scorer.
As with any benchmark, the score should be interpreted alongside methodology and failure analysis. LongMemEval is useful because it stresses memory behavior across multiple categories, but no benchmark fully captures production memory quality. Production systems also need to be evaluated for latency, cost, privacy, robustness, observability, and user experience.
Reliable memory means remembering with evidence.
The future of AI memory will not be defined only by larger context windows or more aggressive retrieval. It will be defined by systems that know the difference between a supported answer and a plausible guess.