AI Engineering

Chunking: How to Properly Split Documents for RAG

Цели урока

Understand the impact of chunking on RAG quality and choose the right strategy
Implement fixed-size and recursive character splitting
Master semantic chunking based on embeddings
Apply document-aware chunking for Markdown, code, and FAQs
Set up an evaluation pipeline to compare chunking strategies

Предварительные знания

RAG pipeline, embeddings, vector search

RAG Fundamentals

Chunking is the most boring name for the most important decision in RAG. Chunk size determines precision, recall, and the cost of indexing. A wrong chunk size means the entire knowledge base gets re-indexed from scratch. Two identical RAG pipelines, same LLM, same pgvector. First one - 52% accuracy. Second - 84%. The only difference: chunking strategy. And it is a free optimization.

LangChain RecursiveCharacterTextSplitter - the most widely used chunker in production (100K+ projects): splits by \n\n, \n, periods - hierarchically
Pinecone: semantic chunking delivers +30% retrieval precision for enterprise RAG with long heterogeneous documents
GitHub Copilot chunks source code by functions and classes - whole semantic units, not character counts
RAPTOR (Stanford, 2024): hierarchical chunking - small chunks first, then summarized into larger ones - best recall on long documents

Chunking emerged with RAG

Splitting text into smaller pieces for retrieval is an old idea in information retrieval, but chunking as a named practice grew out of the RAG era after 2020, when retrieval was bolted onto generative models. The tooling that made it routine arrived in late 2022: Harrison Chase released **LangChain** in October 2022, and its `RecursiveCharacterTextSplitter` became the de facto default, splitting on paragraph, line, and sentence boundaries in order. Around the same time Jerry Liu released **GPT Index**, later renamed **LlamaIndex** (November 2022), which popularized node-based and document-aware splitting. Semantic chunking - cutting where embedding similarity drops - and hierarchical methods like RAPTOR (Stanford, 2024) came later, as teams learned that chunk boundaries quietly cap retrieval quality.

Why Chunking Defines RAG Quality

An embedding model takes text and returns a single vector. Feed it an entire book - the vector averages across all topics and becomes useless for retrieval. Feed it a single sentence - context evaporates. The goal of chunking is to find the **right fragment size** that an embedding model can represent accurately.

text-embedding-3-small (`USD 0.02/1M` tokens) is optimal for 256-512 tokens. Give it 1500 tokens and vector quality degrades. That is why chunk size is an engineering decision, not a default setting.

Chunk size	Pros	Cons
Small (100-200 tokens)	Precise retrieval, each chunk = one idea	Loss of context, fragment is ripped from surroundings
Medium (300-600 tokens)	Balance of precision and context	May cut a thought in the middle
Large (800-1500 tokens)	Full context, self-contained fragment	Noise - chunk contains a lot of irrelevant text

Rule of thumb: chunk size = 1/5 of the model's context window, but no more than 1000 tokens. For GPT-4o (128K context) that is 512-1000 tokens. For models with 4K context - 200-400 tokens.

Three factors that determine optimal chunk size:

**Content type** - code requires larger chunks (entire function), FAQs require smaller ones (single Q&A pair)
**Embedding model** - text-embedding-3-small is optimal for 256-512 tokens, larger chunks degrade quality
**Query type** - specific questions call for small chunks, analytical queries for large ones

Chunk size of 100 tokens for technical documentation. A user asks 'how to set up Redis caching'. What is the most likely problem?

Fixed-size and Recursive Character Splitting

Fixed-size Chunking

The simplest approach: split text into fixed-size fragments with overlap. 512 tokens, 10-20% overlap - these are standard starting parameters. Overlap prevents meaning from getting lost at chunk boundaries.

Fixed-size chunking cuts sentences in half. "The server runs on port" | "3000 and accepts WebSocket connections" - two meaningless chunks. For production, a smarter approach is needed.

Recursive Character Splitting

The industry standard - LangChain RecursiveCharacterTextSplitter, used in 100K+ production projects. The idea: first try splitting by large boundaries (\n\n - paragraph boundary), if the chunk is still too big - by (\n), then by sentences (. ), then by words. Each next level is a less clean cut, but still better than slicing through a phrase.

Recommended starting parameters: **chunkSize = 512**, **overlap = 50-100 tokens** (10-20% of chunk size). For code - overlap by whole functions, not by characters.

Recursive character splitting tries separators in order: \n\n → \n → . → space. Why this hierarchy?

Semantic Chunking: Splitting by Meaning

Recursive splitting is honest about one thing: it does not understand meaning. It just respects syntax. But a paragraph can contain two different topics - and conversely, two paragraphs can be about the same thing. **Semantic chunking** finds boundaries where meaning shifts, not where line breaks happen to appear.

The algorithm: split text into sentences, compute an embedding for each via text-embedding-3-small, find points of sharp cosine similarity drop - that is where the topic switches. Those drops mark the semantic boundaries for new chunks.

Method	Cost	Boundary quality	When to use
Fixed-size	Free	Poor - cuts in the middle	Prototypes, very uniform texts
Recursive	Free	Good - by syntactic boundaries	Production default, 80% of cases
Semantic	USD 0.01-0.1 per document (embedding calls)	Excellent - by semantic boundaries	Long texts with topic switches

Semantic chunking requires an embedding call for every sentence. A document with 1000 sentences = 1000 embedding calls at `USD 0.02/1M` tokens. For large corpora - batch processing and caching are a must.

Semantic chunking determines chunk boundaries by...

Document-aware Chunking and Metadata Enrichment

Documents are not plain text. Markdown has headings, code has functions, HTML has sections. A recursive splitter is blind to all of that - it just counts characters. **Document-aware chunking** reads document structure and respects it.

GitHub Copilot does not slice source code into 512-character blocks. It chunks by functions and classes - whole semantic units. Notion AI uses the block structure of a document as natural boundaries. That is the document-aware approach in production.

Markdown-aware Chunking

Metadata Enrichment

A chunk without metadata is like a book without a table of contents. The embedding of "Every employee is entitled to 28 days" will be far from the query "vacation policy". Prepend "Document: HR Policy, Section: Vacation" before embedding - and the vector becomes accurate. A free optimization that gets skipped in 90% of projects.

**Prepending metadata** to text before embedding is a proven technique. The embedding for "Document: HR Policy, Section: Vacation. Every employee has..." will be more relevant to the query "vacation policy" than the embedding of bare text.

Why add metadata (document title, section) to the chunk text BEFORE embedding?

Tuning and Evaluation: How to Find the Optimal Strategy

There is no perfect chunk size. The optimal strategy depends on data, queries, and model. The only way to find the optimum is **empirical testing** with metrics.

A/B Testing Chunking Strategies

Recommendations by Content Type

Content type	Strategy	Chunk size	Overlap
Technical documentation	Markdown-aware	500-800 tokens	50-100 tokens
FAQ / Knowledge Base	By Q&A pair (each Q&A = chunk)	Varies	None
Source code	By functions / classes	Entire function	Signature + docstring
Legal documents	By paragraph + semantic	300-500 tokens	100 tokens
Chat logs / tickets	By message or thread	Entire thread	None
Scientific papers	By section (Abstract, Methods...)	600-1000 tokens	100 tokens

Overlap: How Much Is Enough?

Overlap solves the problem of information loss at chunk boundaries. Rule: **10-20% of chunk size**. Too much overlap creates duplication (one idea in three chunks), too little loses context.

A golden dataset for evaluating chunking consists of 50-100 questions with correct answers and references to specific documentation fragments. Creating a golden dataset is an investment that pays off with every iteration of RAG improvement.

There is a universally best chunk size - 512 tokens with 64 overlap. Those numbers can be lifted from a tutorial and never touched again.

Optimal chunk size depends on content type, query type, and embedding model. The 512/64 figures are a reasonable starting point, not a final answer. Measurement on a golden dataset of 50-100 question-answer pairs typically shifts the optimum by 30-50%.

Tutorials pin a single config because they have to pick one. Students treat it as the optimum because they lack a metric to challenge it. In practice recall@k and MRR drop by 15-30 points when the same 512-token preset is applied to source code or legal documents.

A document contains 10,000 tokens. Fixed-size chunking is used: chunk size = 500 tokens, overlap = 100 tokens. How many chunks will be created?

Larger chunk size = more context = better RAG results

A large chunk reduces precision: the embedding vector averages across all topics, and the relevant fragment drowns in noise

text-embedding-3-small is optimal for 256-512 tokens - confirmed by MTEB benchmarks. At 1500 tokens, embedding quality degrades: the model cannot accurately represent all content in a single vector. Retrieval finds the chunk, but the LLM receives 1400 irrelevant tokens alongside the 100 it actually needed - precision drops, generation cost rises.

Key Takeaways

Chunk size is the most impactful and cheapest decision in RAG - it drives precision, recall, and cost
Fixed-size (512 tokens, 10-20% overlap) is the prototype starting point. Recursive character splitting is the production default
Semantic chunking: boundaries at drops in cosine similarity between embeddings of adjacent sentences
Document-aware: Markdown by headings, code by functions - exactly how GitHub Copilot and Notion AI work
Metadata prepending before embedding is a free retrieval accuracy boost skipped in 90% of projects
Bigger chunk does not mean better: embedding degrades, precision drops, LLM cost increases

Вопросы для размышления

Which chunking strategy fits a corpus of 10,000 PDF contracts of 200 pages each, and why would a fixed 512-token split give low recall here?
What happens to retrieval quality when overlap is raised from 20% to 50% - does boundary fact precision improve, or does duplication hurt precision overall?
How is a golden dataset for chunking evaluation built without pre-existing correct answers: which properties should the questions have, and who annotates them?

What's Next

Documents are chunked, the RAG pipeline is working. But a real chatbot needs one more piece - memory of previous messages.

Conversation Memory — Buffer, summary, vector memory - how LLMs remember conversation context
Advanced RAG — Hybrid search and re-ranking to improve retrieval on top of chunking
Document Processing — How to extract text from PDF, HTML, DOCX before chunking

Связанные уроки

aie-11-document-processing — Document parsing precedes chunking
aie-12-rag-fundamentals — Chunking is the first step of any RAG pipeline
aie-13-advanced-rag — Advanced RAG breaks without proper chunking
aie-09-embeddings — Chunk size directly affects embedding information density
calc-16-taylor — Both methods seek local approximations of a large object
db-26-caching