Building a memory system for OpenClaw

2026-03-22openclawmemorycontext-engineimplementation

We built a memory plugin for OpenClaw that takes over the context lifecycle through the ContextEngine interface. Got a lot of decisions wrong before getting them right. This post walks through what we built and why.

The previous post ended with the observation that OpenClaw's memory system asks too much of its users: manual file management, structured templates, pruning routines, retrieval rules. This post is about what we built instead.

Two layers, one common mistake

The first thing we had to get straight: a memory system is solving two completely different problems.

Context engineering is about getting the right content in front of the model at the right time. The model can only work with what's in the context window right now. A million memories sitting in a database are worthless if they don't make it into context. This layer is a delivery problem.

Knowledge lifecycle is about keeping the delivered information accurate. User preferences change. Old decisions get reversed. Two records might contradict each other. This layer is an information quality problem.

Both matter. But there's an asymmetry that's easy to miss: if the underlying information is wrong or stale, better context engineering just delivers bad information more efficiently.

Most memory solutions in the ecosystem right now — swapping storage backends, optimizing retrieval, changing injection strategies — are working on the context engineering layer. That work has value. But almost nobody is doing serious work on knowledge lifecycle. That's the real reason agent memory feels like it's not quite there yet.

Break memory into types first

Early on we tried handling all memory the same way. That didn't last. Different types have different triggers, lifespans, and update patterns. One strategy for all of them will be mediocre at everything.

Durable facts. User preferences, decisions, long-term context. These need to persist indefinitely. When they get updated, the old version shouldn't be deleted — it should be marked as superseded. History stays, but retrieval skips it by default.

Execution state. Current task progress, intermediate steps, unfinished work. Lifespan is one task. Clean up when the task ends, but if the task gets interrupted, you need to be able to resume.

Session context. The last few turns of conversation. Temporary inside the context window — invisible after compaction — but the raw transcript is always in the session archive. The model just can't see it anymore.

Distilled knowledge. Stable judgments extracted from many fragmented interactions, like "this user always discusses design before coding." These can't be generated in real time. They accumulate in the background, asynchronously.

Context engineering: three hooks, three jobs

OpenClaw's ContextEngine interface exposes three hooks: assemble() before each turn, afterTurn() after each turn, and compact() when compaction fires. Each one has a specific job. Don't mix them.

ContextEngine flow: assemble, LLM, afterTurn, compact

assemble(): what goes in

Don't stuff every memory into context. Use the last user message as a query, over-retrieve (three times the target count), rerank, then truncate to the token budget.

Over-retrieval matters. If you grab the top 8 directly, you'll miss things. Grab the top 24, rerank, cut to 8. Recall quality is noticeably better.

Token budget depends on what the agent does. A coding assistant doesn't need memory eating much context (10-15% is enough) because the code itself takes up space. A knowledge-base agent needs a higher share (20-25%) because its whole job is answering from memory. There's no universal default.

One detail that's easy to miss: bootstrap files (SOUL.md, AGENTS.md, etc.) are already injected through OpenClaw's bootstrap mechanism. Downweight them during reranking. If they show up through retrieval too, you're wasting context space on duplicates.

afterTurn(): three tiers, processed async

Three things need to happen after each turn. They don't all need to happen immediately.

Tier one, synchronous: extract structured facts from new content in memory files. This has to be synchronous because the next turn might need those facts right away.

Tier two, batched: accumulate conversation content until you hit a threshold (say, 10 turns or 50KB), then trigger LLM extraction in bulk. Calling an LLM on every single turn is expensive, and fragmented context doesn't produce good extractions anyway. Batching works better.

Tier three, periodic maintenance: merge duplicate memories, clean up redundant relationships, promote frequently accessed content to durable facts. Least urgent. Run it every few dozen turns.

Separating "must do now" from "can do later" is how you keep memory from slowing down the main loop.

compact(): protect information, don't own truncation

This is where it's easiest to get the design wrong. We made this mistake early: we declared that the plugin owned compaction, but didn't implement truncation logic. OpenClaw assumed the plugin was handling it. Neither side did anything.

The correct design: the plugin does not own truncation. It only protects information before truncation happens. When compact() fires, do three things — flush the batch buffer and run extraction, extract and store memories from the messages that are about to be compressed, snapshot the current memory file. Then hand actual truncation back to OpenClaw's native mechanism.

If your real-time extraction missed something, the information is already saved structurally. One compaction event can't cause permanent loss.

Knowledge lifecycle: the actual hard part

Context engineering is an engineering problem with clear solution paths. Knowledge lifecycle is where things get genuinely difficult.

Extraction should be layered, not all-LLM

Our first version sent everything through an LLM for extraction. Cost scaled linearly with usage, and results were unstable — the same content produced different extractions on different runs.

We switched to three layers. Rules handle deterministic signals first: patterns like "user prefers...", "working on...", "decided to..." — no semantic understanding needed, just pattern matching. Fast, stable, covers maybe 60-70% of extraction scenarios. LLM handles semantically ambiguous cases, batched whenever possible. When neither of the first two layers extracts anything, a more permissive rule set acts as a fallback with lower confidence scores, downweighted during retrieval.

Noise filtering has to happen before any extraction logic runs. Refusal messages, greetings, "let me search that for you" — process narration that carries no information. If you let this through, you'll slowly accumulate low-quality memories that pollute retrieval results over time.

Update by superseding, not overwriting

User changed jobs. The old "works at Company A" record shouldn't be deleted. Mark it superseded. History stays, but default retrieval skips it.

Here's what that looks like in practice. Turn 5 of a conversation:

user: "I work at Stripe on the payments API team"
  → extracted: { works_at: "Stripe", team: "payments API" }
  → confidence: 0.95, source: user-stated

Forty turns later:

user: "I just joined Linear, starting next Monday"
  → extracted: { works_at: "Linear" }
  → confidence: 0.95, source: user-stated
  → supersedes: works_at: "Stripe"

The Stripe record is still in the database. But when assemble() retrieves memories, it skips superseded entries by default. The model sees "works at Linear" without being confused by the old record. If someone asks "where did this user work before?" the history is there to answer.

The value is traceability. You can see when information changed. You can look at historical state when you need to. Overwriting is simpler, but you lose the time dimension.

Contradiction detection is annotation, not arbitration

When you detect two memories that contradict each other, create a contradiction relationship between them. Don't prevent both from being retrieved. The system shouldn't decide which one is correct on the user's behalf.

But this means the caller has to handle contradictions. If you don't downweight contradicted memories, the model might see "user prefers TypeScript" and "user prefers Python" simultaneously and pick one at random. Contradiction detection only works when paired with retrieval-time handling logic.

Every memory needs provenance and confidence

User stated it directly? Highest confidence. Inferred from context? Medium. Auto-summarized by the system? Lowest. When two memories conflict, the one with more reliable provenance wins — not just the newer one.

This metadata also gives fallback extraction its value. Low-confidence memories don't interfere with high-quality information, but when nothing else is available, they're better than nothing.

Observability has to ship with the feature

There's a class of memory system bugs that don't show up in short-term testing. They only surface after a few weeks: contradictions accumulating silently, old information not being superseded correctly, a write failure that quietly stops all updates. The agent still looks "normal" — it's just answering with increasingly stale information.

These problems are all gradual. No error gets thrown. The only way to catch them is to build observability in from the start, not bolt it on after something breaks.

In our implementation, most critical paths have structured event logs: extraction stats per message (how many facts extracted, whether fallback triggered, how many contradictions found), write failure events, before/after snapshots around compaction, destructive-overwrite warnings when memory file line counts drop more than 50%. These events persist to the database for trend analysis. Are contradictions growing or being cleaned up? Is primary extraction frequently returning zero results? How much is fallback catching?

Even with all of that, retrieval degradation is the hardest problem to observe. You know how many results came back. You don't know if those results are getting worse. Are relevance score distributions shifting? Is match quality against actual user questions declining over time? These gradual signals don't have structured event coverage yet.

That's a real lesson from building this. It's not enough to add observability. The things that are hardest to observe tend to be the most important. A sudden write failure is easy to spot. Retrieval quality slowly declining over months is nearly invisible. If we were starting over, we'd instrument retrieval quality from day one.

Does any of this actually matter?

We ran db0 against two published conversational memory benchmarks: LoCoMo (Snap Research) and LongMemEval (Wu et al., 2025). Both test whether a memory system can answer questions about past conversations accurately.

On LoCoMo (199 queries, LLM-judged): db0 scored 76.9%, Mem0 scored 66.9%. On LongMemEval (50-question sample): db0 scored 80.0%, Mem0 scored 29%.

The gap is largest on knowledge-update questions, the category that specifically tests whether a system handles superseded facts correctly. That's not a coincidence. Most of the design decisions in this post, superseding, contradiction detection, provenance tracking, exist to handle exactly those cases. The context engineering layer gets you baseline recall. The knowledge lifecycle layer is what separates "usually right" from "right about the things that changed."

Full benchmark methodology and results are in the db0 benchmark package.