AI Engineering
Chatbot Memory: How LLMs Remember Context - Buffer, Summary, Vector Memory
Цели урока
- Understand why LLMs lack built-in memory and how the backend solves this
- Implement Buffer Memory and Window Memory for simple chatbots
- Master Summary Memory for long conversations with summarization
- Apply Vector (episodic) Memory as RAG over conversation history
- Design Hybrid Memory with PostgreSQL persistence for production
Предварительные знания
- RAG pipeline and vector search
- Chat Completions API
LLMs remember nothing. Every request is a blank slate. The illusion of memory in ChatGPT is simply the entire conversation context being resent on each call. Character.ai does this for 20 billion messages per day. MemGPT (Letta) went further: the LLM manages its own memory, deciding what to record and what to forget. Chatbot retention without proper memory architecture - 15%. With hybrid memory - 60%.
- ChatGPT uses window memory + server-side persistence in PostgreSQL - every chat is stored and can be resumed a month later
- Character.AI processes 20B+ messages/day - personality memory across sessions is implemented via vector store + entity extraction
- Intercom AI Support - hybrid memory with vector search over client ticket history, the agent knows all past interactions
- MemGPT (Letta, 2023) - the LLM manages its own memory: main context + archival storage + recall storage, with entity extraction from conversation
Memory was born from the context-window limit
Tracking dialogue state is an old problem from task-oriented chatbots, but memory as an LLM concern arrived with the ChatGPT era. When ChatGPT launched on 30 November 2022, its context window was about 4,096 tokens - once a conversation outgrew it, the earliest turns were pushed out and the model simply forgot them. That hard limit is the reason every memory architecture exists. **LangChain** (Harrison Chase, October 2022) shipped the first widely used memory abstractions - buffer, window, and summary memory - giving developers ready patterns instead of hand-rolled history management. In 2023 **MemGPT** (Charles Packer and colleagues, UC Berkeley) pushed further: treat the context window like an operating system's RAM and let the LLM page facts in and out of external storage, deciding for itself what to keep. The project later became Letta.
Stateless LLM: Why the Model Forgets Everything
LLMs remember nothing. Every request starts from a blank slate. The illusion of memory in ChatGPT is simply the entire conversation history being sent again with each call. Character.ai does this for 20 billion messages a day. Not magic - engineering.
This isn't a bug or an implementation limitation. LLMs are **stateless by architecture** - like HTTP without cookies. Every `chat.completions.create()` is an independent mathematical operation: vectors in, vectors out, no state left behind. Memory is an engineering layer built on top of a stateless API, not a built-in feature of the model.
The problem isn't just the principle - it's the scale. Passing the entire history on every call means paying for every token of every past message. 50K tokens in history x 1000 requests/day x `USD 2.5/1M` input tokens = **`USD 125/day`** on context alone. That's where real memory engineering begins.
| Problem | Description | Consequence |
|---|---|---|
| Context window limit | GPT-4o: 128K tokens ~ 200 pages. A 500-message conversation can exceed the limit | API error or context truncation |
| Cost | Every token in messages is billed. 50K tokens x USD 2.5/1M = USD 0.125 per request | Costs grow linearly with conversation length |
| Latency | More tokens - longer processing. 100K tokens = 2-5 seconds of extra delay | UX degrades on long conversations |
| Lost in the middle | LLMs process information in the middle of long contexts worse | Early messages are 'forgotten' even if passed |
"Lost in the middle" - a Stanford study (2023): LLMs recall the beginning and end of context well but lose information in the middle. With 100+ messages, accuracy drops 20-30% for facts from the middle of the conversation.
LLMs remember conversations between sessions - like a human after a meeting
The model is stateless by architecture. Any memory between sessions is an engineering layer: a database, Redis, and vectors. Without it, every session is a first introduction
ChatGPT 'remembers' past conversations because OpenAI stores them in a database and injects relevant context into the system prompt. That's RAG over chat history, not built-in model memory. Remove the database - and GPT-4o is a fresh instance with no recollection whatsoever
Why doesn't the LLM remember previous messages between API calls?
Buffer Memory: Full Message History
The simplest approach: store **all** messages and pass them with every request. This is **Buffer Memory** - the model sees the complete conversation history. No compression, no clever tricks - just a growing list.
Buffer Memory is honest: it loses nothing and hides nothing. That's exactly why it works well for short code-generation sessions - where every line of the dialogue matters. But the price of that honesty grows linearly. Message #100 drags messages #1-99 along like an anchor.
| Pros | Cons |
|---|---|
| Complete context - nothing is lost | Cost grows linearly: 100 messages x USD 0.002 = USD 0.20 per request |
| Simplest implementation | Context window overflows on long conversations |
| Model sees the full conversation flow | Lost in the middle: early messages degrade |
| Works for short sessions (10-20 messages) | Not suitable for customer support (100+ messages per session) |
A Map in Node.js memory is for prototypes only. In production, use PostgreSQL or Redis. When the server restarts, the Map is wiped, and all conversations are lost.
A conversation is 200 messages long. With Buffer Memory, the cost of each new request...
Window Memory: Sliding Window of the Last N Messages
Why pass the entire conversation when the last 10-20 messages contain 90% of the needed context? **Sliding window** is the same logic transformers use when processing long documents: take a fixed-size window and move it through the text. **Window Memory** keeps only the last N messages (or K tokens) and drops everything outside the frame.
A more precise variant - limiting by tokens rather than message count:
Window Memory is a good default for most chatbots. Recommendation: maxTokens = 25% of the model's context window. For GPT-4o (128K) - 32K tokens for history, the rest for system prompt + RAG context + generation.
With Window Memory limited to 20 messages: at message #50, the user references a topic from message #5. What happens?
Summary Memory: LLM as History Compressor
Window Memory silently drops the past. **Summarization memory** takes a different angle: instead of deleting old messages - compress them through the same LLM. A cheap model (gpt-4o-mini, `USD 0.15/1M` tokens) compresses 50 messages into 2-3 paragraphs, preserving key facts. The expensive model gets the compact version and never knows the difference.
Summarization fires on a threshold - that's the key design decision. An extra LLM call after every message would kill latency. But at a 2000-token threshold, it's an infrequent operation, nearly invisible to the user. Summarization errors accumulate like in a game of telephone - which is exactly why critical data needs a different approach.
| Pros | Cons |
|---|---|
| Fixed cost - summary doesn't grow indefinitely | Loss of detail during summarization |
| Key facts are preserved | Extra LLM call for summarization (+latency, +cost) |
| Suitable for long conversations (100+ messages) | Summarization errors accumulate ('telephone game') |
| Controlled context size | Harder to debug - which facts were lost? |
Summarization loses nuance. The user said 'budget is roughly `USD 50K`, but could be `USD 70K` if marketing is included.' Summary: 'budget `USD 50`-70K.' The context 'if marketing is included' is gone. For critical data, vector memory is a better fit.
Summary Memory summarizes old messages when...
Vector Memory: Semantic Search Over History
Buffer drowns in cost. Window loses the past. Summary loses the details. **Episodic (vector) memory** takes a fundamentally different approach: each message is stored as an embedding via `text-embedding-3-small` (1536 dims, `USD 0.02/1M` tokens), and with every request an HNSW index finds **semantically close** fragments from the entire history in 3 ms. RAG, but over the conversation itself.
This principle is the foundation of **MemGPT** (now Letta) - a system where the LLM manages its own memory: deciding what to write to long-term storage, what to retrieve, and what to forget. Entity extraction on top of the vector layer is another tier: named entities (names, companies, tasks) are automatically pulled from conversations and stored separately. A user mentions 'our startup TokenFlow' - the next session already has the context.
| Pros | Cons |
|---|---|
| Finds relevant context from any point in history | Embedding call for every message (+latency, +cost) |
| Scales to 1000+ messages | Loses sequentiality - messages are pulled out of order |
| Works like RAG over conversation history | More complex implementation - needs a vector DB |
| Fixed context size | Not suitable for step-by-step instructions (order matters) |
When processing a new message, Vector Memory...
Hybrid Memory and PostgreSQL Persistence
Every memory type covers one angle and misses others. The production solution is **hybrid**: Window provides immediate context, Summary holds compressed history, Vector pulls semantically relevant content from any point in time. Three layers, one request. This is exactly how enterprise AI assistants work - and why they don't forget what a client said three months ago.
PostgreSQL Schema for Production
Comparison of all memory strategies - when to use each:
| Strategy | Conversation length | Cost | Accuracy | Scenario |
|---|---|---|---|---|
| Buffer | < 20 messages | $$$ | 100% | Short tasks: code generation, translation |
| Window (last N) | Any | $ | 70-80% | Customer support, casual chat |
| Summary | 20-200 messages | $$ | 80-85% | Consulting, coaching, long sessions |
| Vector (episodic) | Any | $$ | 85-90% for relevant | Technical support, knowledge workers |
| Hybrid | Any | $$$ | 90-95% | Production chatbots, AI assistants |
Start with Window Memory (the simplest). If users complain about 'forgetfulness' - add Summary. If precision for specific facts is needed - add Vector. Hybrid - for production AI assistants. Redis is ideal for session storage: O(1) access to recent messages, TTL for stale sessions, pub/sub for streaming.
Chatbot memory is about preserving the full message history. The longer the buffer, the smarter the bot.
Memory is about managing the trade-off between context window, cost, and relevance. A hybrid setup (Window + Summary + Vector) almost always yields better quality at lower cost than simply growing the buffer.
Intuition from human memory: remember more, understand better. LLMs do not work that way - extra messages in context dilute attention, add noise to retrieval, and grow cost linearly. Past 20-30 turns a raw buffer starts losing to a summary strategy on answer accuracy, not just on price.
Hybrid Memory combines three context sources. Which one is responsible for 'the model remembers a fact from message #5 when processing message #150'?
LLMs remember conversations between sessions - like a human after a meeting
The model is stateless by architecture. Any memory between sessions is an engineering layer: a database, Redis, and vectors. Without it, every session is a first introduction
ChatGPT 'remembers' past conversations because OpenAI stores them in a database and injects relevant context into the system prompt. That's RAG over chat history, not built-in model memory. Remove the database - and GPT-4o is a fresh instance with no recollection whatsoever
Key Takeaways
- LLM API is stateless - every call starts from scratch, 'memory' is the backend's responsibility
- Buffer Memory: full history, ideal for short tasks, expensive and slow for long conversations
- Window (sliding) Memory: last N messages - a sensible default for 80% of chatbots
- Summary Memory: gpt-4o-mini compresses history on a threshold - preserves the essence, loses the nuance
- Vector (episodic) Memory: text-embedding-3-small + HNSW = RAG over conversation history, entity extraction on top
- Hybrid (Window + Summary + Vector) - production standard; Redis for session storage, PostgreSQL with pgvector for persistence
Вопросы для размышления
- In what chatbot scenario is Buffer Memory justified even at 50+ messages - and what does that say about the task?
- How does entity extraction on top of vector memory change response quality compared to raw message vectors?
- MemGPT lets the LLM decide what to remember. What are the failure modes of that architecture?
What's Next
Memory lets the chatbot remember context. The next step is giving it the ability to act: call functions, access APIs, perform tasks.
- Tool Calling — How LLMs call functions - function calling, tool use, structured actions
- Agent Fundamentals — From chatbot to agent - planning, reasoning, tool use in a loop
- Caching & Optimization — How to cache responses and reduce costs for memory-heavy chatbots
Связанные уроки
- aie-04-tokens-context-window — Context window limit is why memory exists
- aie-09-embeddings — Vector memory uses similarity search over embeddings
- aie-16-tool-calling — Memory + tools = fully persistent agent
- aie-17-agent-fundamentals — Long-term memory is a key agent component
- aut-07-attention-memory — Working vs long-term memory - same architectural tradeoff
- aie-12-rag-fundamentals — Vector memory is RAG applied to conversation history
- prob-17
- db-19-redis