Why your AI agent forgets everything between sessions

2026-03-23memoryagentsarchitecture

Every developer building AI agents hits the same wall. The agent is brilliant within a conversation. It tracks context, follows instructions, remembers what was said ten messages ago. Then the user closes the tab and comes back tomorrow. Clean slate. The agent has no idea who they are.

The models themselves don't have persistent state. Claude, GPT-5, Gemini, they all start every request from zero. Everything the agent "knows" has to be assembled fresh from external storage and stuffed into the context window before each call. If you didn't build that assembly pipeline, the agent forgets.

This sounds like a simple engineering problem. It isn't. I keep watching teams go through the same painful progression, and they keep getting stuck at the same points.

The progression everyone goes through

Step 1: Save messages to a database. Works for a simple chatbot. Store the full conversation, load it next time. Then the user comes back 50 times. 50 sessions, 50 messages each, 200 tokens per message: 500,000 tokens. At GPT-5.4-mini pricing, that's $0.375 per request just on input tokens. Multiply by a hundred requests a day and you're at $1,100/month for a single user. The context window can't even hold it, and most of it is irrelevant anyway.

Step 2: Add a vector database for semantic search. Now you can search by meaning instead of loading everything. But you have two systems to synchronize. And here's the subtle bug that takes weeks to notice: the user said "I prefer Python" in session 3. In session 47, they said "Actually, I switched to TypeScript." Both are embedded. Both exist in the vector store. A similarity search for "preferred language" returns whichever is closer to the query embedding, which might be the outdated one. Vector similarity has no concept of time.

Step 3: Add session management, scoping, deduplication, decay. Now you have three systems and three failure modes. A write that succeeds in two and fails in the third leaves state inconsistent. You need scoping logic (which facts are session-specific vs. permanent), deduplication (don't store the same fact twice), decay (recent facts should rank higher), and extraction (what counts as a "fact" worth storing). Every team implementing this makes different tradeoffs and accumulates different technical debt.

This is the DIY tax. And almost every team pays it, because there isn't a standard answer yet.

Why the standard answers aren't working

Even the companies with the most resources haven't solved this. ChatGPT's memory feature still accumulates contradictions. Users regularly share screenshots of their memory stores holding entries like "User is a software engineer" alongside "User is a product manager" because they changed jobs. Claude's memory is better but still inconsistent about updates. Google's Gemini has the same problems.

The LOCOMO benchmark (Snap Research, 2024) tested this systematically. They evaluated multiple memory systems on conversations spanning 600+ turns, including questions about facts that changed mid-conversation. Every system they tested, including GPT-4 with the full conversation in context, showed significant degradation on temporal and update questions compared to simple factual recall.

The open-source memory tools have the same structural issue. Mem0 uses an LLM to decide on every write whether to add, update, or delete a memory. That makes the conflict resolution non-deterministic and expensive. Zep builds temporal knowledge graphs but requires running a separate server. Letta (formerly MemGPT) has the agent manage its own memory, which works until the agent's summaries start drifting from what actually happened. The CoALA paper (Sumers et al., 2024) provides a good taxonomy of these architectures through the lens of cognitive science.

None of these are bad tools. But they all share a blind spot.

The blind spot: this is a data management problem

Here's what I think the AI community keeps getting wrong. Agent memory is being treated as a retrieval problem (how do I find the right memory?) when it's actually a data management problem (how do I keep the stored information accurate over time?).

The retrieval side is well-understood. Embed facts, build an index, do similarity search, rank by relevance. There are good tools for this. The context engineering side, which Andrej Karpathy describes as "the art of filling the context window with just the right information," is getting a lot of attention.

But if the underlying data is wrong, better retrieval just delivers bad information faster.

The database community solved analogous problems decades ago. Slowly Changing Dimensions in data warehousing classify exactly this: Type 1 overwrites the old value, Type 2 keeps a version history, Type 3 stores both old and new. Bitemporal modeling tracks both when a fact was true and when the system learned about it. Event sourcing appends every change and derives the current state from the log.

These patterns haven't been adopted by the agent memory ecosystem. Instead, most memory tools treat every piece of information the same way: embed it, store it, retrieve it by similarity. That's a retrieval architecture applied to a data management problem.

Memory is actually four different problems

Part of why this is hard: "memory" is not one problem. It's four distinct problems with different triggers, retention policies, and query patterns.

Episodic context is the last few conversation turns. Short-lived, slides with the token budget. This is what the context window already handles, and what gets lost when you start a new session.

Durable facts are user preferences, decisions, and stated information. "Prefers TypeScript." "Based in Berlin." "Working on a Next.js project." These need to persist indefinitely, scoped to the right entity (user, project, team). When they change, the old value shouldn't be deleted. It should be superseded: preserved for history, excluded from retrieval.

Execution state is where a task was when it was interrupted. Checkpoints, intermediate tool results, pending actions. Lifespan is one task. Clean up when the task ends, but if the task gets interrupted, you need to resume.

Consolidated knowledge is patterns extracted from many interactions. "This user always discusses design before coding." These can't be generated in real time. They accumulate in the background from repeated observations.

Most memory tools collapse these into one abstraction. Everything is a "memory" with the same storage, the same retrieval, the same retention. That forces bad tradeoffs. Episodic context gets extracted into "facts" on every message, triggering LLM calls that mostly produce nothing useful. Durable facts get treated as ephemeral, lost when the context window slides. Execution state gets mixed with preferences. You can't checkpoint a task without also treating "user prefers dark mode" as a checkpoint.

What a good solution looks like

Before naming tools, here are the properties that matter:

Scoped. Different facts have different lifetimes. A scratch note from a sub-task should not persist permanently. A user preference should be visible in every future session. Scope should be a first-class concept, not an afterthought.

Superseding. When facts change, the old fact should be marked as replaced, not duplicated alongside the new one. "User prefers Python" and "User switched to TypeScript" should not coexist in search results. The old fact stays in the database for audit, but default retrieval skips it. This is Type 2 Slowly Changing Dimension, applied to agent memory.

Token-budget aware. Context assembly should be explicit and bounded. "Give me the most relevant memories for this query, within 4,000 tokens" is a better primitive than "give me the top 10 memories and hope they fit."

Framework-agnostic. If your memory breaks when you upgrade your orchestration framework, your memory was coupled to the wrong layer. Memory is a data concern, not an application concern.

A concrete implementation

Here's what this looks like with db0, an open-source storage engine for AI agents:

import { db0 } from "@db0-ai/core";
import { createSqliteBackend } from "@db0-ai/backends-sqlite";

const backend = await createSqliteBackend();

// Session 1 — user tells the agent about their project
const harness = db0.harness({
  agentId: "assistant",
  sessionId: "session-1",
  userId: "alice",
  backend,
});

await harness.context().ingest(
  "Alice is building a TypeScript project with Next.js",
  { scope: "user" },
);
// Stored, embedded, associated with userId "alice"
// Visible in every future session for this user

harness.close();

Weeks later, new session:

// Session 2 — different sessionId, same userId
const harness2 = db0.harness({
  agentId: "assistant",
  sessionId: "session-47",
  userId: "alice",
  backend,
});

const ctx = await harness2.context().pack(
  "help me set up my project",
  { tokenBudget: 2000 },
);

console.log(ctx.text);
// → "Alice is building a TypeScript project with Next.js"
// Retrieved semantically, ranked, within the token budget

context().pack() handles the semantic search, the ranking (similarity * 0.7 + recency * 0.2 + popularity * 0.1), and the token budgeting. The developer doesn't write retrieval logic. They specify the budget and get the most relevant context back.

The local-first argument

Most memory solutions default to a hosted setup. Mem0 has an open-source mode, but the primary path pushes you toward their cloud API. Supermemory is open source but nontrivial to self-host. Zep requires running a separate server process.

db0 runs on SQLite by default. Zero API keys for the memory layer. Zero network calls. Data stays on your machine.

This matters more for memory than for other infrastructure. Memory contains the most sensitive data in your system: user preferences, personal context, behavioral patterns. It's the one layer where "just use a cloud service" deserves more scrutiny. A SQLite file on your machine is a file you control, back up on your terms, and delete when you want.

When you need production persistence, swap to PostgreSQL with one line:

import { createPostgresBackend } from "@db0-ai/backends-postgres";
const backend = await createPostgresBackend("postgresql://user:pass@host/db0");
// Same API, same memory, different storage

Any hosted Postgres works: Neon, Supabase, Railway. The memory layer doesn't change.

The bottom line

Every AI agent will eventually need to remember things between sessions. The model won't do it for you. You have to build (or adopt) a storage layer that treats agent memory as a data management problem, not just a retrieval problem. That means scoped visibility, fact superseding, token-budget context assembly, and framework independence.

You can build it from scratch: Redis for sessions, a vector DB for semantic search, Postgres for logs, and custom glue to hold it together.

Or you can use a layer that already handles these concerns. db0 is one option: open source, SQLite for dev, Postgres for production, works with the AI SDK, LangChain.js, or any TypeScript agent.

db0 is open source (MIT). A working Next.js chat app with persistent memory is at examples/chat-agent.