Echo
Back

EchoMem Research · LongMemEval

Evidence-First Memory: EchoMem Reaches 95.8% on LongMemEval

Reliable AI memory is not remembering everything. It is knowing what the evidence actually supports.

Research noteLongMemEvalUpdated Jun 9, 2026

Long-term memory for AI assistants is usually framed as a retrieval problem. Can the system find the right fact? Can it surface the right conversation? Can it pull a relevant memory into context?

But retrieval is only the beginning. A useful memory system has to do something harder: it has to know what the evidence actually supports. It has to distinguish old facts from new ones, resolve details spread across sessions, infer preferences without hallucinating them, and refuse to answer when the available memory does not justify the question.

That is the lens we used to evaluate EchoMem on LongMemEval.

EchoMem reached 95.8% accuracy across 500 LongMemEval questions. A second run with GPT-4o scored 90.4% on the same benchmark. The result is encouraging, but the more interesting lesson is not simply that EchoMem remembered more. It is that reliable memory depends on evidence.

Why LongMemEval Matters

Many memory benchmarks test whether a system can retrieve a fact hidden in a long context. That is useful, but it does not fully capture what real AI assistants need to do.

Real assistant memory is messy. A user's preferences change. Facts become outdated. Important details appear across multiple sessions. Some questions contain false premises. Some answers require temporal reasoning rather than direct lookup. A good memory system has to reason over all of that without turning plausible guesses into confident answers.

Knowledge update

Whether the system can track changed facts

Multi-session

Whether it can combine evidence across conversations

Temporal reasoning

Whether it understands time, order, and duration

Single-session user

Whether it recalls user-side facts from one session

Single-session assistant

Whether it recalls assistant-side facts from one session

Single-session preference

Whether it can infer preferences from evidence

That makes LongMemEval more than a retrieval benchmark. It is closer to a behavioral test for memory: can the system answer only when the memory actually warrants the answer?

The Result

EchoMem scored 95.80% across 500 LongMemEval questions in the headline run. The GPT-4o run scored 90.40% using the same 500-question benchmark.

Model comparison

Category scores across the same 500-question benchmark.

GPT-4oHeadline run
KU
MS
TR
SSU
SSA
SSP
Overall
KU Knowledge update
MS Multi-session
TR Temporal reasoning
SSU Single-session user
SSA Single-session assistant
SSP Single-session preference
Overall Aggregate score

The headline run answered 479/500 questions correctly, with 21 misses and 0 unscored questions. The GPT-4o run answered 452/500 questions correctly, with 48 misses.

Both runs used the same EchoMem memory stack and were scored with the official LongMemEval GPT-4o scorer.

Benchmark Comparison

LongMemEval results are easiest to read in context. Below is the public comparison we use for the homepage claim, moved here so the article carries the benchmark detail directly.

SystemKUMSTRSSUSSASSPOverall
EchoMem100.0%92.5%94.0%98.6%100.0%93.3%95.8%
Supermemory99.0%93.0%91.0%97.0%100.0%90.0%95.0%
Mastra96.2%87.2%95.5%95.7%94.6%100.0%94.9%
Mem093.6%88.0%97.0%98.6%98.2%96.7%94.4%
Zep83.3%57.9%62.4%92.9%80.4%56.7%71.2%
KU Knowledge update
MS Multi-session
TR Temporal reasoning
SSU Single-session user
SSA Single-session assistant
SSP Single-session preference

Per-system scores are sourced from each vendor's published research. Best score per column is underlined.

The Core Lesson: Evidence Comes First

The biggest lesson from LongMemEval was not that memory systems need to store more information. It was that they need to be more careful about evidence.

A memory system can fail in several ways. It can fail to retrieve the right memory. It can retrieve the right memory but answer the wrong question. It can overgeneralize from related evidence. It can treat an outdated fact as current. It can answer a false-premise question instead of abstaining.

A memory-backed answer should only be given when the available evidence supports the specific question being asked.

That sounds obvious, but it changes the shape of the system. It means memory is not just a pile of facts. It has to preserve evidence, select it carefully, reason over it, and decide whether it is enough.

How EchoMem Thinks About Memory

EchoMem treats memory as a pipeline rather than a single retrieval call.

First, conversations are converted into durable memory records that preserve enough context to be useful later. This matters because many real questions depend on more than semantic similarity. They depend on whether the evidence is current, specific, and relevant to the user's actual intent.

At query time, EchoMem evaluates whether the available memories are specific enough to answer responsibly. That includes handling changed facts, time-sensitive questions, user preferences, and cases where the premise of the question is not supported.

  1. Preserve useful memories from conversations.
  2. Retrieve evidence for the current question.
  3. Check whether that evidence actually supports the answer.
  4. Answer only when the evidence is sufficient.
  5. Abstain when it is not.
  6. Review failures to improve reliability.

What the Misses Taught Us

One reason LongMemEval was useful for EchoMem is that we did not treat the score as the only output. We also reviewed where the system succeeded, where it correctly abstained, and where it still missed.

OutcomeCountMeaning
Evidence-supported answers449The system answered from sufficient memory evidence
Correct abstentions30The system refused unsupported questions
Remaining misses21The system either lacked enough evidence or failed to use it correctly

Most successful answers were supported by clear evidence. That is what we want: the system found enough memory context and answered from it. But the 30 correct abstentions are just as important. They show that the system can resist the temptation to answer when the premise is unsupported.

Giving the model more memories can help when the right evidence was missing. But it can also create a new problem: the model may combine related memories and produce an answer that sounds right, even though no single piece of evidence actually supports it. The goal is not simply more answers. The goal is answers that are truly warranted.

Why Abstention Matters

Abstention is easy to undervalue because it does not feel like a capability. But for memory systems, abstention is one of the clearest signs of reliability.

A personal memory system will often face questions where the premise is slightly wrong. The user may ask about the wrong person, the wrong location, the wrong time period, or a preference that was never actually expressed. In those cases, a system that always tries to answer can sound helpful while connecting memories that do not actually support the question.

LongMemEval includes these traps. A good score requires not only recalling facts, but also recognizing when the available memories do not support the requested answer.

What Still Needs Work

The remaining misses are concentrated in multi-session and temporal reasoning.

That is not surprising. These are the categories where real memory becomes hardest. Multi-session questions may require stitching together evidence from multiple conversations. Temporal questions require knowing whether a fact was true before, after, currently, or for a duration. Small wording differences can change the answer.

  • Evidence exists, but is spread across multiple memories.
  • The retrieved memory is related but missing a key qualifier.
  • An older fact conflicts with a newer one.
  • The answer requires a calculation over time.
  • A preference is implied but not explicitly stated.
  • A plausible answer is not the same as a supported answer.

How This Fits Into the Memory Landscape

Recent LongMemEval work has taken several different angles. Some approaches argue that strong retrieval pipelines can still reach impressive results. Others focus on production tradeoffs like latency, cost, and token efficiency. Others introduce new memory architectures inspired by how humans observe, compress, and reflect on experience.

EchoMem's emphasis is complementary: evidence discipline. We care less about whether the system surfaced something that sounds relevant and more about whether the final answer is warranted.

A memory system is not reliable because it always has something to say. It is reliable because it can separate evidence from vibes.

Methodology Notes

The result reported here comes from EchoMem's LongMemEval evaluation across all six active LongMemEval categories. Results were scored with the official LongMemEval GPT-4o scorer.

BenchmarkLongMemEval
Questions500
ScorerOfficial LongMemEval GPT-4o
Headline result479/500 · 95.80%
GPT-4o result452/500 · 90.40%
Unscored questions0

As with any benchmark, the score should be interpreted alongside methodology and failure analysis. LongMemEval is useful because it stresses memory behavior across multiple categories, but no benchmark fully captures production memory quality. Production systems also need to be evaluated for latency, cost, privacy, robustness, observability, and user experience.

Reliable memory means remembering with evidence.

The future of AI memory will not be defined only by larger context windows or more aggressive retrieval. It will be defined by systems that know the difference between a supported answer and a plausible guess.