AI Engineering

Chatbot Memory: How LLMs Remember Context - Buffer, Summary, Vector Memory

Цели урока

Understand why LLMs lack built-in memory and how the backend solves this
Implement Buffer Memory and Window Memory for simple chatbots
Master Summary Memory for long conversations with summarization
Apply Vector (episodic) Memory as RAG over conversation history
Design Hybrid Memory with PostgreSQL persistence for production

Предварительные знания

RAG pipeline and vector search
Chat Completions API

LLMs remember nothing. Every request is a blank slate. The illusion of memory in ChatGPT is simply the entire conversation context being resent on each call. Character.ai does this for 20 billion messages per day. MemGPT (Letta) went further: the LLM manages its own memory, deciding what to record and what to forget. Chatbot retention without proper memory architecture - 15%. With hybrid memory - 60%.

ChatGPT uses window memory + server-side persistence in PostgreSQL - every chat is stored and can be resumed a month later
Character.AI processes 20B+ messages/day - personality memory across sessions is implemented via vector store + entity extraction
Intercom AI Support - hybrid memory with vector search over client ticket history, the agent knows all past interactions
MemGPT (Letta, 2023) - the LLM manages its own memory: main context + archival storage + recall storage, with entity extraction from conversation

Memory was born from the context-window limit

Tracking dialogue state is an old problem from task-oriented chatbots, but memory as an LLM concern arrived with the ChatGPT era. When ChatGPT launched on 30 November 2022, its context window was about 4,096 tokens - once a conversation outgrew it, the earliest turns were pushed out and the model simply forgot them. That hard limit is the reason every memory architecture exists. **LangChain** (Harrison Chase, October 2022) shipped the first widely used memory abstractions - buffer, window, and summary memory - giving developers ready patterns instead of hand-rolled history management. In 2023 **MemGPT** (Charles Packer and colleagues, UC Berkeley) pushed further: treat the context window like an operating system's RAM and let the LLM page facts in and out of external storage, deciding for itself what to keep. The project later became Letta.

Stateless LLM: Why the Model Forgets Everything

LLMs remember nothing. Every request starts from a blank slate. The illusion of memory in ChatGPT is simply the entire conversation history being sent again with each call. Character.ai does this for 20 billion messages a day. Not magic - engineering.

This isn't a bug or an implementation limitation. LLMs are **stateless by architecture** - like HTTP without cookies. Every `chat.completions.create()` is an independent mathematical operation: vectors in, vectors out, no state left behind. Memory is an engineering layer built on top of a stateless API, not a built-in feature of the model.

The problem isn't just the principle - it's the scale. Passing the entire history on every call means paying for every token of every past message. 50K tokens in history x 1000 requests/day x `USD 2.5/1M` input tokens = **`USD 125/day`** on context alone. That's where real memory engineering begins.

Problem	Description	Consequence
Context window limit	GPT-4o: 128K tokens ~ 200 pages. A 500-message conversation can exceed the limit	API error or context truncation
Cost	Every token in messages is billed. 50K tokens x USD 2.5/1M = USD 0.125 per request	Costs grow linearly with conversation length
Latency	More tokens - longer processing. 100K tokens = 2-5 seconds of extra delay	UX degrades on long conversations
Lost in the middle	LLMs process information in the middle of long contexts worse	Early messages are 'forgotten' even if passed

"Lost in the middle" - a Stanford study (2023): LLMs recall the beginning and end of context well but lose information in the middle. With 100+ messages, accuracy drops 20-30% for facts from the middle of the conversation.

LLMs remember conversations between sessions - like a human after a meeting

The model is stateless by architecture. Any memory between sessions is an engineering layer: a database, Redis, and vectors. Without it, every session is a first introduction

ChatGPT 'remembers' past conversations because OpenAI stores them in a database and injects relevant context into the system prompt. That's RAG over chat history, not built-in model memory. Remove the database - and GPT-4o is a fresh instance with no recollection whatsoever

Why doesn't the LLM remember previous messages between API calls?

Buffer Memory: Full Message History

The simplest approach: store **all** messages and pass them with every request. This is **Buffer Memory** - the model sees the complete conversation history. No compression, no clever tricks - just a growing list.

Buffer Memory is honest: it loses nothing and hides nothing. That's exactly why it works well for short code-generation sessions - where every line of the dialogue matters. But the price of that honesty grows linearly. Message #100 drags messages #1-99 along like an anchor.

Pros	Cons
Complete context - nothing is lost	Cost grows linearly: 100 messages x USD 0.002 = USD 0.20 per request
Simplest implementation	Context window overflows on long conversations
Model sees the full conversation flow	Lost in the middle: early messages degrade
Works for short sessions (10-20 messages)	Not suitable for customer support (100+ messages per session)

A Map in Node.js memory is for prototypes only. In production, use PostgreSQL or Redis. When the server restarts, the Map is wiped, and all conversations are lost.

A conversation is 200 messages long. With Buffer Memory, the cost of each new request...

Window Memory: Sliding Window of the Last N Messages

Why pass the entire conversation when the last 10-20 messages contain 90% of the needed context? **Sliding window** is the same logic transformers use when processing long documents: take a fixed-size window and move it through the text. **Window Memory** keeps only the last N messages (or K tokens) and drops everything outside the frame.

A more precise variant - limiting by tokens rather than message count:

Window Memory is a good default for most chatbots. Recommendation: maxTokens = 25% of the model's context window. For GPT-4o (128K) - 32K tokens for history, the rest for system prompt + RAG context + generation.

With Window Memory limited to 20 messages: at message #50, the user references a topic from message #5. What happens?

Summary Memory: LLM as History Compressor

Window Memory silently drops the past. **Summarization memory** takes a different angle: instead of deleting old messages - compress them through the same LLM. A cheap model (gpt-4o-mini, `USD 0.15/1M` tokens) compresses 50 messages into 2-3 paragraphs, preserving key facts. The expensive model gets the compact version and never knows the difference.

Summarization fires on a threshold - that's the key design decision. An extra LLM call after every message would kill latency. But at a 2000-token threshold, it's an infrequent operation, nearly invisible to the user. Summarization errors accumulate like in a game of telephone - which is exactly why critical data needs a different approach.

Pros	Cons
Fixed cost - summary doesn't grow indefinitely	Loss of detail during summarization
Key facts are preserved	Extra LLM call for summarization (+latency, +cost)
Suitable for long conversations (100+ messages)	Summarization errors accumulate ('telephone game')
Controlled context size	Harder to debug - which facts were lost?

Summarization loses nuance. The user said 'budget is roughly `USD 50K`, but could be `USD 70K` if marketing is included.' Summary: 'budget `USD 50`-70K.' The context 'if marketing is included' is gone. For critical data, vector memory is a better fit.

Summary Memory summarizes old messages when...

Vector Memory: Semantic Search Over History

Buffer drowns in cost. Window loses the past. Summary loses the details. **Episodic (vector) memory** takes a fundamentally different approach: each message is stored as an embedding via `text-embedding-3-small` (1536 dims, `USD 0.02/1M` tokens), and with every request an HNSW index finds **semantically close** fragments from the entire history in 3 ms. RAG, but over the conversation itself.

This principle is the foundation of **MemGPT** (now Letta) - a system where the LLM manages its own memory: deciding what to write to long-term storage, what to retrieve, and what to forget. Entity extraction on top of the vector layer is another tier: named entities (names, companies, tasks) are automatically pulled from conversations and stored separately. A user mentions 'our startup TokenFlow' - the next session already has the context.

Pros	Cons
Finds relevant context from any point in history	Embedding call for every message (+latency, +cost)
Scales to 1000+ messages	Loses sequentiality - messages are pulled out of order
Works like RAG over conversation history	More complex implementation - needs a vector DB
Fixed context size	Not suitable for step-by-step instructions (order matters)

When processing a new message, Vector Memory...

Hybrid Memory and PostgreSQL Persistence

Every memory type covers one angle and misses others. The production solution is **hybrid**: Window provides immediate context, Summary holds compressed history, Vector pulls semantically relevant content from any point in time. Three layers, one request. This is exactly how enterprise AI assistants work - and why they don't forget what a client said three months ago.

PostgreSQL Schema for Production

Comparison of all memory strategies - when to use each:

Strategy	Conversation length	Cost	Accuracy	Scenario
Buffer	< 20 messages	$$$	100%	Short tasks: code generation, translation
Window (last N)	Any	$	70-80%	Customer support, casual chat
Summary	20-200 messages	$$	80-85%	Consulting, coaching, long sessions
Vector (episodic)	Any	$$	85-90% for relevant	Technical support, knowledge workers
Hybrid	Any	$$$	90-95%	Production chatbots, AI assistants

Start with Window Memory (the simplest). If users complain about 'forgetfulness' - add Summary. If precision for specific facts is needed - add Vector. Hybrid - for production AI assistants. Redis is ideal for session storage: O(1) access to recent messages, TTL for stale sessions, pub/sub for streaming.

Chatbot memory is about preserving the full message history. The longer the buffer, the smarter the bot.

Memory is about managing the trade-off between context window, cost, and relevance. A hybrid setup (Window + Summary + Vector) almost always yields better quality at lower cost than simply growing the buffer.

Intuition from human memory: remember more, understand better. LLMs do not work that way - extra messages in context dilute attention, add noise to retrieval, and grow cost linearly. Past 20-30 turns a raw buffer starts losing to a summary strategy on answer accuracy, not just on price.

Hybrid Memory combines three context sources. Which one is responsible for 'the model remembers a fact from message #5 when processing message #150'?

LLMs remember conversations between sessions - like a human after a meeting

The model is stateless by architecture. Any memory between sessions is an engineering layer: a database, Redis, and vectors. Without it, every session is a first introduction

Key Takeaways

LLM API is stateless - every call starts from scratch, 'memory' is the backend's responsibility
Buffer Memory: full history, ideal for short tasks, expensive and slow for long conversations
Window (sliding) Memory: last N messages - a sensible default for 80% of chatbots
Summary Memory: gpt-4o-mini compresses history on a threshold - preserves the essence, loses the nuance
Vector (episodic) Memory: text-embedding-3-small + HNSW = RAG over conversation history, entity extraction on top
Hybrid (Window + Summary + Vector) - production standard; Redis for session storage, PostgreSQL with pgvector for persistence

Вопросы для размышления

In what chatbot scenario is Buffer Memory justified even at 50+ messages - and what does that say about the task?
How does entity extraction on top of vector memory change response quality compared to raw message vectors?
MemGPT lets the LLM decide what to remember. What are the failure modes of that architecture?

What's Next

Memory lets the chatbot remember context. The next step is giving it the ability to act: call functions, access APIs, perform tasks.

Tool Calling — How LLMs call functions - function calling, tool use, structured actions
Agent Fundamentals — From chatbot to agent - planning, reasoning, tool use in a loop
Caching & Optimization — How to cache responses and reduce costs for memory-heavy chatbots

Связанные уроки

aie-04-tokens-context-window — Context window limit is why memory exists
aie-09-embeddings — Vector memory uses similarity search over embeddings
aie-16-tool-calling — Memory + tools = fully persistent agent
aie-17-agent-fundamentals — Long-term memory is a key agent component
aut-07-attention-memory — Working vs long-term memory - same architectural tradeoff
aie-12-rag-fundamentals — Vector memory is RAG applied to conversation history
prob-17
db-19-redis

AI Engineering

Chatbot Memory: How LLMs Remember Context - Buffer, Summary, Vector Memory

Цели урока

Understand why LLMs lack built-in memory and how the backend solves this
Implement Buffer Memory and Window Memory for simple chatbots
Master Summary Memory for long conversations with summarization
Apply Vector (episodic) Memory as RAG over conversation history
Design Hybrid Memory with PostgreSQL persistence for production

Предварительные знания

RAG pipeline and vector search
Chat Completions API

ChatGPT uses window memory + server-side persistence in PostgreSQL - every chat is stored and can be resumed a month later
Character.AI processes 20B+ messages/day - personality memory across sessions is implemented via vector store + entity extraction
Intercom AI Support - hybrid memory with vector search over client ticket history, the agent knows all past interactions
MemGPT (Letta, 2023) - the LLM manages its own memory: main context + archival storage + recall storage, with entity extraction from conversation

Memory was born from the context-window limit

Stateless LLM: Why the Model Forgets Everything

Problem	Description	Consequence
Context window limit	GPT-4o: 128K tokens ~ 200 pages. A 500-message conversation can exceed the limit	API error or context truncation
Cost	Every token in messages is billed. 50K tokens x USD 2.5/1M = USD 0.125 per request	Costs grow linearly with conversation length
Latency	More tokens - longer processing. 100K tokens = 2-5 seconds of extra delay	UX degrades on long conversations
Lost in the middle	LLMs process information in the middle of long contexts worse	Early messages are 'forgotten' even if passed

LLMs remember conversations between sessions - like a human after a meeting

The model is stateless by architecture. Any memory between sessions is an engineering layer: a database, Redis, and vectors. Without it, every session is a first introduction

Why doesn't the LLM remember previous messages between API calls?

Buffer Memory: Full Message History

Pros	Cons
Complete context - nothing is lost	Cost grows linearly: 100 messages x USD 0.002 = USD 0.20 per request
Simplest implementation	Context window overflows on long conversations
Model sees the full conversation flow	Lost in the middle: early messages degrade
Works for short sessions (10-20 messages)	Not suitable for customer support (100+ messages per session)

A Map in Node.js memory is for prototypes only. In production, use PostgreSQL or Redis. When the server restarts, the Map is wiped, and all conversations are lost.

A conversation is 200 messages long. With Buffer Memory, the cost of each new request...

Window Memory: Sliding Window of the Last N Messages

A more precise variant - limiting by tokens rather than message count:

With Window Memory limited to 20 messages: at message #50, the user references a topic from message #5. What happens?

Summary Memory: LLM as History Compressor

Pros	Cons
Fixed cost - summary doesn't grow indefinitely	Loss of detail during summarization
Key facts are preserved	Extra LLM call for summarization (+latency, +cost)
Suitable for long conversations (100+ messages)	Summarization errors accumulate ('telephone game')
Controlled context size	Harder to debug - which facts were lost?

Summary Memory summarizes old messages when...

Vector Memory: Semantic Search Over History

Pros	Cons
Finds relevant context from any point in history	Embedding call for every message (+latency, +cost)
Scales to 1000+ messages	Loses sequentiality - messages are pulled out of order
Works like RAG over conversation history	More complex implementation - needs a vector DB
Fixed context size	Not suitable for step-by-step instructions (order matters)

When processing a new message, Vector Memory...

Hybrid Memory and PostgreSQL Persistence

PostgreSQL Schema for Production

Comparison of all memory strategies - when to use each:

Strategy	Conversation length	Cost	Accuracy	Scenario
Buffer	< 20 messages	$$$	100%	Short tasks: code generation, translation
Window (last N)	Any	$	70-80%	Customer support, casual chat
Summary	20-200 messages	$$	80-85%	Consulting, coaching, long sessions
Vector (episodic)	Any	$$	85-90% for relevant	Technical support, knowledge workers
Hybrid	Any	$$$	90-95%	Production chatbots, AI assistants

Chatbot memory is about preserving the full message history. The longer the buffer, the smarter the bot.

Hybrid Memory combines three context sources. Which one is responsible for 'the model remembers a fact from message #5 when processing message #150'?

LLMs remember conversations between sessions - like a human after a meeting

The model is stateless by architecture. Any memory between sessions is an engineering layer: a database, Redis, and vectors. Without it, every session is a first introduction

Key Takeaways

LLM API is stateless - every call starts from scratch, 'memory' is the backend's responsibility
Buffer Memory: full history, ideal for short tasks, expensive and slow for long conversations
Window (sliding) Memory: last N messages - a sensible default for 80% of chatbots
Summary Memory: gpt-4o-mini compresses history on a threshold - preserves the essence, loses the nuance
Vector (episodic) Memory: text-embedding-3-small + HNSW = RAG over conversation history, entity extraction on top
Hybrid (Window + Summary + Vector) - production standard; Redis for session storage, PostgreSQL with pgvector for persistence

Вопросы для размышления

In what chatbot scenario is Buffer Memory justified even at 50+ messages - and what does that say about the task?
How does entity extraction on top of vector memory change response quality compared to raw message vectors?
MemGPT lets the LLM decide what to remember. What are the failure modes of that architecture?

What's Next

Memory lets the chatbot remember context. The next step is giving it the ability to act: call functions, access APIs, perform tasks.

Tool Calling — How LLMs call functions - function calling, tool use, structured actions
Agent Fundamentals — From chatbot to agent - planning, reasoning, tool use in a loop
Caching & Optimization — How to cache responses and reduce costs for memory-heavy chatbots

Связанные уроки

aie-04-tokens-context-window — Context window limit is why memory exists
aie-09-embeddings — Vector memory uses similarity search over embeddings
aie-16-tool-calling — Memory + tools = fully persistent agent
aie-17-agent-fundamentals — Long-term memory is a key agent component
aut-07-attention-memory — Working vs long-term memory - same architectural tradeoff
aie-12-rag-fundamentals — Vector memory is RAG applied to conversation history
prob-17
db-19-redis