AI Engineering
Chunking: How to Properly Split Documents for RAG
Цели урока
- Understand the impact of chunking on RAG quality and choose the right strategy
- Implement fixed-size and recursive character splitting
- Master semantic chunking based on embeddings
- Apply document-aware chunking for Markdown, code, and FAQs
- Set up an evaluation pipeline to compare chunking strategies
Предварительные знания
- RAG pipeline, embeddings, vector search
Chunking is the most boring name for the most important decision in RAG. Chunk size determines precision, recall, and the cost of indexing. A wrong chunk size means the entire knowledge base gets re-indexed from scratch. Two identical RAG pipelines, same LLM, same pgvector. First one - 52% accuracy. Second - 84%. The only difference: chunking strategy. And it is a free optimization.
- LangChain RecursiveCharacterTextSplitter - the most widely used chunker in production (100K+ projects): splits by \n\n, \n, periods - hierarchically
- Pinecone: semantic chunking delivers +30% retrieval precision for enterprise RAG with long heterogeneous documents
- GitHub Copilot chunks source code by functions and classes - whole semantic units, not character counts
- RAPTOR (Stanford, 2024): hierarchical chunking - small chunks first, then summarized into larger ones - best recall on long documents
Chunking emerged with RAG
Splitting text into smaller pieces for retrieval is an old idea in information retrieval, but chunking as a named practice grew out of the RAG era after 2020, when retrieval was bolted onto generative models. The tooling that made it routine arrived in late 2022: Harrison Chase released **LangChain** in October 2022, and its `RecursiveCharacterTextSplitter` became the de facto default, splitting on paragraph, line, and sentence boundaries in order. Around the same time Jerry Liu released **GPT Index**, later renamed **LlamaIndex** (November 2022), which popularized node-based and document-aware splitting. Semantic chunking - cutting where embedding similarity drops - and hierarchical methods like RAPTOR (Stanford, 2024) came later, as teams learned that chunk boundaries quietly cap retrieval quality.
Why Chunking Defines RAG Quality
Chunking is the most boring name for the most important decision in RAG. Chunk size determines precision, recall, and the cost of indexing. A wrong chunk size means the entire knowledge base gets re-indexed from scratch.
An embedding model takes text and returns a single vector. Feed it an entire book - the vector averages across all topics and becomes useless for retrieval. Feed it a single sentence - context evaporates. The goal of chunking is to find the **right fragment size** that an embedding model can represent accurately.
text-embedding-3-small (`USD 0.02/1M` tokens) is optimal for 256-512 tokens. Give it 1500 tokens and vector quality degrades. That is why chunk size is an engineering decision, not a default setting.
| Chunk size | Pros | Cons |
|---|---|---|
| Small (100-200 tokens) | Precise retrieval, each chunk = one idea | Loss of context, fragment is ripped from surroundings |
| Medium (300-600 tokens) | Balance of precision and context | May cut a thought in the middle |
| Large (800-1500 tokens) | Full context, self-contained fragment | Noise - chunk contains a lot of irrelevant text |
Rule of thumb: chunk size = 1/5 of the model's context window, but no more than 1000 tokens. For GPT-4o (128K context) that is 512-1000 tokens. For models with 4K context - 200-400 tokens.
Three factors that determine optimal chunk size:
- **Content type** - code requires larger chunks (entire function), FAQs require smaller ones (single Q&A pair)
- **Embedding model** - text-embedding-3-small is optimal for 256-512 tokens, larger chunks degrade quality
- **Query type** - specific questions call for small chunks, analytical queries for large ones
Chunk size of 100 tokens for technical documentation. A user asks 'how to set up Redis caching'. What is the most likely problem?
Fixed-size and Recursive Character Splitting
Fixed-size Chunking
The simplest approach: split text into fixed-size fragments with overlap. 512 tokens, 10-20% overlap - these are standard starting parameters. Overlap prevents meaning from getting lost at chunk boundaries.
Fixed-size chunking cuts sentences in half. "The server runs on port" | "3000 and accepts WebSocket connections" - two meaningless chunks. For production, a smarter approach is needed.
Recursive Character Splitting
The industry standard - LangChain RecursiveCharacterTextSplitter, used in 100K+ production projects. The idea: first try splitting by large boundaries (\n\n - paragraph boundary), if the chunk is still too big - by (\n), then by sentences (. ), then by words. Each next level is a less clean cut, but still better than slicing through a phrase.
Recommended starting parameters: **chunkSize = 512**, **overlap = 50-100 tokens** (10-20% of chunk size). For code - overlap by whole functions, not by characters.
Recursive character splitting tries separators in order: \n\n → \n → . → space. Why this hierarchy?
Semantic Chunking: Splitting by Meaning
Recursive splitting is honest about one thing: it does not understand meaning. It just respects syntax. But a paragraph can contain two different topics - and conversely, two paragraphs can be about the same thing. **Semantic chunking** finds boundaries where meaning shifts, not where line breaks happen to appear.
The algorithm: split text into sentences, compute an embedding for each via text-embedding-3-small, find points of sharp cosine similarity drop - that is where the topic switches. Those drops mark the semantic boundaries for new chunks.
| Method | Cost | Boundary quality | When to use |
|---|---|---|---|
| Fixed-size | Free | Poor - cuts in the middle | Prototypes, very uniform texts |
| Recursive | Free | Good - by syntactic boundaries | Production default, 80% of cases |
| Semantic | USD 0.01-0.1 per document (embedding calls) | Excellent - by semantic boundaries | Long texts with topic switches |
Semantic chunking requires an embedding call for every sentence. A document with 1000 sentences = 1000 embedding calls at `USD 0.02/1M` tokens. For large corpora - batch processing and caching are a must.
Semantic chunking determines chunk boundaries by...
Document-aware Chunking and Metadata Enrichment
Documents are not plain text. Markdown has headings, code has functions, HTML has sections. A recursive splitter is blind to all of that - it just counts characters. **Document-aware chunking** reads document structure and respects it.
GitHub Copilot does not slice source code into 512-character blocks. It chunks by functions and classes - whole semantic units. Notion AI uses the block structure of a document as natural boundaries. That is the document-aware approach in production.
Markdown-aware Chunking
Metadata Enrichment
A chunk without metadata is like a book without a table of contents. The embedding of "Every employee is entitled to 28 days" will be far from the query "vacation policy". Prepend "Document: HR Policy, Section: Vacation" before embedding - and the vector becomes accurate. A free optimization that gets skipped in 90% of projects.
**Prepending metadata** to text before embedding is a proven technique. The embedding for "Document: HR Policy, Section: Vacation. Every employee has..." will be more relevant to the query "vacation policy" than the embedding of bare text.
Why add metadata (document title, section) to the chunk text BEFORE embedding?
Tuning and Evaluation: How to Find the Optimal Strategy
There is no perfect chunk size. The optimal strategy depends on data, queries, and model. The only way to find the optimum is **empirical testing** with metrics.
A/B Testing Chunking Strategies
Recommendations by Content Type
| Content type | Strategy | Chunk size | Overlap |
|---|---|---|---|
| Technical documentation | Markdown-aware | 500-800 tokens | 50-100 tokens |
| FAQ / Knowledge Base | By Q&A pair (each Q&A = chunk) | Varies | None |
| Source code | By functions / classes | Entire function | Signature + docstring |
| Legal documents | By paragraph + semantic | 300-500 tokens | 100 tokens |
| Chat logs / tickets | By message or thread | Entire thread | None |
| Scientific papers | By section (Abstract, Methods...) | 600-1000 tokens | 100 tokens |
Overlap: How Much Is Enough?
Overlap solves the problem of information loss at chunk boundaries. Rule: **10-20% of chunk size**. Too much overlap creates duplication (one idea in three chunks), too little loses context.
A golden dataset for evaluating chunking consists of 50-100 questions with correct answers and references to specific documentation fragments. Creating a golden dataset is an investment that pays off with every iteration of RAG improvement.
There is a universally best chunk size - 512 tokens with 64 overlap. Those numbers can be lifted from a tutorial and never touched again.
Optimal chunk size depends on content type, query type, and embedding model. The 512/64 figures are a reasonable starting point, not a final answer. Measurement on a golden dataset of 50-100 question-answer pairs typically shifts the optimum by 30-50%.
Tutorials pin a single config because they have to pick one. Students treat it as the optimum because they lack a metric to challenge it. In practice recall@k and MRR drop by 15-30 points when the same 512-token preset is applied to source code or legal documents.
A document contains 10,000 tokens. Fixed-size chunking is used: chunk size = 500 tokens, overlap = 100 tokens. How many chunks will be created?
Larger chunk size = more context = better RAG results
A large chunk reduces precision: the embedding vector averages across all topics, and the relevant fragment drowns in noise
text-embedding-3-small is optimal for 256-512 tokens - confirmed by MTEB benchmarks. At 1500 tokens, embedding quality degrades: the model cannot accurately represent all content in a single vector. Retrieval finds the chunk, but the LLM receives 1400 irrelevant tokens alongside the 100 it actually needed - precision drops, generation cost rises.
Key Takeaways
- Chunk size is the most impactful and cheapest decision in RAG - it drives precision, recall, and cost
- Fixed-size (512 tokens, 10-20% overlap) is the prototype starting point. Recursive character splitting is the production default
- Semantic chunking: boundaries at drops in cosine similarity between embeddings of adjacent sentences
- Document-aware: Markdown by headings, code by functions - exactly how GitHub Copilot and Notion AI work
- Metadata prepending before embedding is a free retrieval accuracy boost skipped in 90% of projects
- Bigger chunk does not mean better: embedding degrades, precision drops, LLM cost increases
Вопросы для размышления
- Which chunking strategy fits a corpus of 10,000 PDF contracts of 200 pages each, and why would a fixed 512-token split give low recall here?
- What happens to retrieval quality when overlap is raised from 20% to 50% - does boundary fact precision improve, or does duplication hurt precision overall?
- How is a golden dataset for chunking evaluation built without pre-existing correct answers: which properties should the questions have, and who annotates them?
What's Next
Documents are chunked, the RAG pipeline is working. But a real chatbot needs one more piece - memory of previous messages.
- Conversation Memory — Buffer, summary, vector memory - how LLMs remember conversation context
- Advanced RAG — Hybrid search and re-ranking to improve retrieval on top of chunking
- Document Processing — How to extract text from PDF, HTML, DOCX before chunking
Связанные уроки
- aie-11-document-processing — Document parsing precedes chunking
- aie-12-rag-fundamentals — Chunking is the first step of any RAG pipeline
- aie-13-advanced-rag — Advanced RAG breaks without proper chunking
- aie-09-embeddings — Chunk size directly affects embedding information density
- calc-16-taylor — Both methods seek local approximations of a large object
- db-26-caching