June 4, 2025·RAG·3 min read·Mid-level developers

Why Your RAG Pipeline Returns Garbage (And It's Probably Your Chunking)

Your retriever pulls the right documents and the answers still come out wrong. Nine times out of ten the problem isn't where you're looking. It's upstream.

You ship the RAG demo. The stakeholders clap. A week later your support inbox is full of screenshots where the bot is quoting things the document doesn't actually say. You upgrade the embedding model. No change. You swap LLMs. Slight change. The real culprit was sitting in your preprocessing script the whole time.

It's almost always the chunking.

What chunking actually decides

People treat chunking like a preprocessing detail. The same energy as resizing an image before upload. It isn't. Chunking decides what a single "thought" looks like to your retriever. That decides what context the LLM sees. That decides whether the answer is grounded or fully hallucinated with confidence.

Imagine you take a cookbook, cut it into strips with scissors, and shuffle the pile. Someone asks how to make carbonara. If you cut on page boundaries, half the recipe is on one strip and the other half is two metres away. The retriever finds a strip. The LLM helpfully invents the missing half. That's most production RAG pipelines I've seen.

The fix: structure-aware, overlapping chunks

from typing import Iterator

def chunk_markdown(
    text: str,
    target_tokens: int = 400,
    overlap_tokens: int = 80,
    tok = lambda s: s.split(),       # swap for tiktoken in real code
) -> Iterator[dict]:
    # Split on semantic boundaries first, not character counts.
    sections = re.split(r"\n(?=#{1,3} )", text)
    for section in sections:
        words = tok(section)
        if len(words) <= target_tokens:
            yield {"text": section, "tokens": len(words)}
            continue
        # Slide a window with overlap so concepts that span a break survive.
        step = target_tokens - overlap_tokens
        for i in range(0, len(words), step):
            window = words[i : i + target_tokens]
            yield {"text": " ".join(window), "tokens": len(window)}

Why it works

Splitting on headers first respects the document's own idea of a topic. You stop slicing through the middle of a recipe. The overlap window then patches the seams between chunks so a sentence that straddles a break appears in both neighbours. Retrieval scores end up reflecting topical relevance instead of whatever keywords got lucky.

When to use this

This should be your default for anything with detectable semantic boundaries. Markdown, HTML, well-formatted PDFs, source code, anything someone wrote with a structure in mind. The few hours you put into a decent splitter return more lift than swapping embedding models. I've watched teams burn three weeks evaluating models when the answer was 30 lines of Python.

When not to

Skip it for very short documents that fit in one chunk anyway. Also skip it for transcripts and chat logs where headers are meaningless. There you want speaker-turn chunking or fixed time-window chunking instead. Use the right knife for the document.

How chunks flow into retrieval

flowchart LR
    A[Raw document] --> B[Split on headers]
    B --> C{Section larger<br/>than target?}
    C -- No --> E[Emit chunk]
    C -- Yes --> D[Sliding window<br/>with overlap]
    D --> E
    E --> F[Embed]
    F --> G[(Vector store)]
    G --> H[Top-k retrieval]
    H --> I[LLM context]

Conclusion

Before you touch the embedding model or add a reranker, dump 20 chunks your current pipeline produces and read them out loud. If half of them are mid-sentence or missing the heading that gives them meaning, fix that first. Everything downstream gets cheaper, faster, and less embarrassing.