Chunking Strategies for RAG — Fixed-Size, Recursive, Semantic, and Document-Aware
Four ways to split documents. Each one is the right answer for some doc type, and the wrong answer for others. The mistake is using the same chunker for everything.
If you only do one thing for your RAG quality, do this: stop using a 500-token fixed-size splitter on everything. Different document types want different chunkers, and the difference shows up in precision@5 the moment you measure.
The four chunkers worth knowing:
Fixed-size with overlap. 500 tokens, 80 overlap. Dumb but predictable. Works on prose and transcripts. Cuts code, tables, and headings in half. Fast.
Recursive character splitting. Tries to split on paragraph breaks first, then sentences, then characters. Preserves boundaries better than fixed-size. The default in most libraries because it doesn't actively damage prose.
Semantic chunking. Embeds rolling sentence windows and starts a new chunk when the embedding distance jumps. Captures topic boundaries naturally. Slow at ingest time (you embed twice — once to chunk, once to store). Worth it for technical docs.
Document-aware. Parses the document structure (markdown headings, PDF sections, code blocks, tables) and chunks on natural boundaries. Most expensive to implement, best results on structured content.
A small benchmark on a 200-page mixed corpus (markdown docs + PDFs + chat transcripts), precision@5 against a 50-question eval set:
| Chunker | Precision@5 | Ingest time |
|---|---|---|
| Fixed 500/0 | 58% | 1.2s |
| Fixed 500/80 | 64% | 1.3s |
| Recursive 500/80 | 71% | 1.8s |
| Semantic | 78% | 14s |
| Document-aware | 81% | 6s |
The numbers move a lot. More than the gap between embedding models.
In C# with Semantic Kernel:
// Fixed/recursive — built in
var chunks = TextChunker.SplitMarkdownParagraphs(text, maxTokensPerParagraph: 500, overlapTokens: 80);
// Document-aware — write your own splitter that respects headings
var sections = Regex.Split(text, @"\n(?=#{1,3} )");
foreach (var section in sections.SelectMany(s => SplitIfLong(s, 500, 80)))
yield return section;
My "default to this" by document type, after enough hard lessons:
- Markdown docs: document-aware. Headings are free structure, use them.
- PDFs (policy, manuals): recursive with 600/100, or document-aware if the PDF has clean structure.
- Chat transcripts: speaker-turn chunking. Time-window if the turns are short.
- Source code: language-aware splitter (one function per chunk). Never fixed-size — you'll cut classes in half.
- Tables / structured data: not in a vector store. Use SQL or a typed index.
The trap is reaching for semantic chunking when document-aware would have worked. Semantic is slower, costs more at ingest, and rarely beats a decent document-aware splitter on docs that already have structure. Use semantic for unstructured prose where structure is hidden. Use document-aware for everything else.