How to give your Vercel AI SDK chatbot long-term memory

2026-03-23ai-sdkmemoryverceltutorial

Your chatbot has a problem. A user tells it they prefer concise responses. They come back two weeks later, start a new chat, and the agent has no idea. It greets them like a stranger.

This isn't a model problem. Claude and GPT-5 have excellent memory within a conversation. It's a storage problem. The AI SDK saves the current chat's messages. It does not connect one chat to another.

This post adds long-term memory to a standard useChat app in about 30 lines of code using db0, an open-source memory system for AI agents.

What AI SDK v5 already handles well

Let's be clear about what's not broken. AI SDK v5 has solid persistence for the current chat:

return toUIMessageStreamResponse(result, {
  onFinish: async ({ messages }) => {
    await db.save(chatId, messages) // works — includes assistant reply, no stale closure
  }
})

validateUIMessages() handles schema drift on load. consumeStream() ensures messages are saved even if the client disconnects. If all you need is to save and restore a single conversation, the SDK handles it.

db0 does not replace this. We're going to use the SDK's persistence exactly as intended.

The gap: chat history is not memory

Chat history is what happened in a conversation. Memory is what the agent knows about a user, regardless of which conversation they're in.

The AI SDK team knows this is a gap. Their memory docs describe three approaches: provider-specific tools (Anthropic's Memory Tool), third-party memory providers (Mem0, Letta), and building your own. All three are DIY. The SDK provides hooks like prepareStep and onFinish, but no built-in memory primitives. That's the right boundary. A streaming/generation SDK shouldn't own knowledge management. But it means someone else has to.

Here's the math on why you can't just "send everything." A user across 50 sessions, averaging 50 messages at ~200 tokens per message, accumulates 500,000 tokens of history. At GPT-5.4-mini pricing ($0.75/M input tokens), that's $0.375 per request, just on input. A hundred requests a day: over $1,100/month for a single user. The context window literally can't hold it, and even if it could, you're paying to send the agent a wall of text where 95% is irrelevant to the current question.

The naive workaround (load the last 20 messages) loses everything before that. The user's preference from session 3 is gone by session 25.

Keyword search doesn't help either. "I like short answers" doesn't match a search for "preferred communication style." You need semantic understanding.

What you actually need is:

Extract durable facts from each conversation ("user prefers concise responses")
Store them with embeddings so they're semantically searchable
Retrieve the most relevant facts for the current query, within a token budget
Inject them into the system prompt before the LLM call

That's a memory system. db0 provides it.

Adding memory in 30 lines

Install

npm install @db0-ai/core @db0-ai/backends-sqlite

Set up the harness

Create a lib/db0.ts that initializes db0 with SQLite storage:

// lib/db0.ts
import { db0, defaultEmbeddingFn, PROFILE_CONVERSATIONAL } from "@db0-ai/core";
import { createSqliteBackend } from "@db0-ai/backends-sqlite";

let backend: any = null;

export async function getHarness({ sessionId, userId }: { sessionId: string; userId: string }) {
  if (!backend) backend = await createSqliteBackend({ dbPath: "./memory.sqlite" });

  return db0.harness({
    agentId: "chat",
    sessionId,
    userId,
    backend,
    embeddingFn: defaultEmbeddingFn,
    profile: PROFILE_CONVERSATIONAL,
  });
}

That's the entire setup. SQLite, zero API keys for the memory layer, works offline.

Update your API route

Two calls added to your existing streamText route:

// app/api/chat/route.ts
import { openai } from "@ai-sdk/openai";
import { streamText } from "ai";
import { getHarness } from "@/lib/db0";

export async function POST(req: Request) {
  const { messages, chatId, userId } = await req.json();
  const harness = await getHarness({ sessionId: chatId, userId });

  // 1. Pack relevant memories into the system prompt
  const lastUserMessage = messages.at(-1)?.content ?? "";
  const ctx = await harness.context().pack(lastUserMessage, { tokenBudget: 2000 });

  const result = streamText({
    model: openai("gpt-5.4-mini"),
    system: ctx.count > 0
      ? `You are a helpful assistant.\n\nContext from past conversations:\n${ctx.text}`
      : "You are a helpful assistant.",
    messages,
    async onFinish({ text }) {
      // Your existing chat persistence here (AI SDK handles this)

      // 2. Extract durable facts for future sessions
      const extraction = harness.extraction();
      for (const fact of await extraction.extract(lastUserMessage)) {
        await harness.context().ingest(fact.content, { scope: fact.scope, tags: fact.tags });
      }
      for (const fact of await extraction.extract(text)) {
        await harness.context().ingest(fact.content, { scope: fact.scope, tags: fact.tags });
      }
    },
  });

  return result.toTextStreamResponse();
}

Two additions, marked 1 and 2:

context().pack() runs before the LLM call. It searches all past memories for this user, ranks them by relevance to the current message (semantic similarity * 0.7 + recency * 0.2 + popularity * 0.1), and assembles the most relevant ones within a 2,000-token budget. This is what Andrej Karpathy and Tobi Lutke call context engineering: filling the context window with exactly the right information.

extraction().extract() runs after the LLM responds. It scans both the user message and the assistant response for durable facts: "user prefers TypeScript," "always use bun," "working on a Next.js project." This uses rules-based pattern matching. Zero LLM calls, deterministic, near-zero latency. That last point matters more than you'd think. Mem0 and similar tools call an LLM on every message to decide what to extract. That's an extra 200-500ms and $0.001-0.01 per turn, and the results aren't deterministic. Rules handle 60-70% of extraction scenarios without any of that overhead.

The extracted facts are stored with user scope, meaning they're visible in every future session for this user.

See it working

Session 1:

User: My name is Alice and I always use TypeScript with strict mode.
Agent: Got it, Alice — I'll remember your TypeScript preference.

db0 extracts: "User's name is Alice," "User always uses TypeScript with strict mode." Both stored with user scope.

Session 2 (new chat, different chatId):

User: What do you know about me?

Before the LLM call, context().pack("What do you know about me?") searches memory and returns:

- User's name is Alice
- User always uses TypeScript with strict mode

Injected into the system prompt. The agent responds with context it shouldn't have:

Agent: You're Alice, and you prefer TypeScript with strict mode enabled.

No manual retrieval code. No loading all historical messages. The relevant facts crossed the session boundary because they were extracted in session 1 and packed in session 2.

Context window management

Discussion #2639 asked "Does the AI SDK support context window management?" It was locked with no official answer. Discussion #8192 asked where to insert compaction logic. The maintainer pointed to prepareStep, but acknowledged it's entirely DIY.

context().pack() is one answer. Instead of sending the full message history and hitting the limit:

This model's maximum context length is 128000 tokens.
However, your messages resulted in 209275 tokens.

You send only what's relevant, within a budget you control:

const ctx = await harness.context().pack(currentMessage, { tokenBudget: 4000 });
// ctx.count: 8 memories
// ctx.estimatedTokens: 3,847
// Always within budget. Most relevant history ranked first.

No manual truncation. No arbitrary sliding window. The 4,000-token budget means you're spending ~$0.003 per request on memory context instead of $0.375 on raw history. Two orders of magnitude.

When facts change

Users change their minds. This is where most memory systems break. ChatGPT's memory feature still accumulates contradictions. Users on r/ChatGPT regularly share screenshots of their memory stores holding entries like "User is a software engineer" alongside "User is a product manager" because they changed jobs between conversations. The LOCOMO benchmark (Snap Research) tested this systematically and found that all evaluated systems, including GPT-4 with full context, showed significant degradation on temporal/update questions.

The root cause is structural: vector similarity search is atemporal. When you embed "User prefers Python" and later "User switched to TypeScript," both embeddings are highly similar to the query "preferred language." The retriever returns both. Which one the model picks is a coin flip.

db0 handles this with superseding. When context().ingest() detects a contradiction, the old fact is preserved for audit but excluded from future searches:

Session 1: "User prefers Python"        → stored
Session 5: "User switched to Rust"      → stored, supersedes the Python preference
Session 6: context().pack()             → returns "User switched to Rust" (not both)

The old record is still there for history. But retrieval only returns the current fact. Deterministic, no coin flip.

SQLite for dev, Postgres for production

The example uses SQLite: zero setup, data in a local file. For production on Vercel (serverless, no persistent disk), swap to Postgres with one line:

// lib/db0.ts
import { createPostgresBackend } from "@db0-ai/backends-postgres";

const backend = await createPostgresBackend(process.env.DATABASE_URL!);

Any hosted Postgres works: Neon, Supabase, Railway. Same API, same memory, different storage.

SQLite does not work on Vercel serverless functions (no persistent disk). It works fine for local development, Railway, Fly.io, and single-server deployments.

Try the working example

The complete Next.js example is at examples/chat-agent in the db0 repo. It includes memory attribution (which memories backed each response), a memory dashboard (browse stored facts), and a thinking view (watch the agent search memory before responding).

git clone https://github.com/db0-ai/db0.git
cd db0/examples/chat-agent
npm install
echo "OPENAI_API_KEY=sk-..." > .env.local
npm run dev

Tell the agent your name and preferences. Click + New to start a fresh chat. Ask "What do you know about me?"

What db0 is and isn't

db0 is a memory system that sits alongside the AI SDK. It handles what the SDK deliberately leaves open: long-term knowledge that spans conversations.

It is not a replacement for message persistence. The SDK's onFinish callback saves your chat history. db0 adds a layer on top: extracted facts, semantic search, scoped visibility, token-budget context assembly.

The two work together. AI SDK handles the current conversation. db0 handles everything the agent should know from past conversations.

db0 is open source (MIT). Works with any AI SDK provider: Anthropic, OpenAI, Google. Zero API keys required for the memory layer.