January 22, 2025·RAG·2 min read·Senior developers

Why Your RAG Hallucinates — A Debugging Checklist

Ten reasons RAG systems lie, in order of how often I see them. Each one has a symptom you can spot and a fix that takes less than a day.

"The RAG is hallucinating" is a useless bug report. It's also the most common one. Here's the checklist I run when someone hands me a misbehaving RAG, ordered by how often each turns out to be the real culprit.

1. Bad chunking. Symptom: answers cite half-sentences or mix unrelated topics. Diagnostic: dump 20 random chunks. If they look bad to you, they look bad to the retriever. Fix: switch to document-aware or recursive chunking with sensible overlap.

2. No reranker. Symptom: top-1 retrieved chunk is related but not the answer. Diagnostic: print top-5 chunks alongside the LLM answer. The answer is usually in chunk 3 or 4. Fix: add Cohere Rerank between retrieval and the prompt.

3. Wrong embedding model for the domain. Symptom: legal/medical/code queries return generic-looking chunks. Diagnostic: run a 50-question eval against 2-3 embedding models. The winner usually beats by 10%+. Fix: switch models. Voyage-3 for English RAG today is the boring safe bet.

4. Context window stuffing. Symptom: longer context makes answers worse. Diagnostic: try top-3 vs top-10. If top-3 wins, you're feeding the LLM noise. Fix: trim aggressively, or add a small relevance filter before the LLM.

5. Stale index. Symptom: answers cite information the user knows was changed last week. Diagnostic: check updated_at on a few chunks. Fix: a re-indexing job, plus version tracking on chunks so you know what's stale.

6. Top-K too low. Symptom: known-answerable questions return "I don't know." Diagnostic: increase K to 20 and check if the answer chunk now appears. Fix: pull more candidates, then rerank, then keep top 5.

7. Prompt template leaking instructions. Symptom: the answer parrots your system prompt. Diagnostic: read the actual final prompt being sent. Yes, you have to read it. Fix: clearly separate context, instructions, and the user question.

8. No grounding instruction. Symptom: confident, well-written, wrong. Diagnostic: check the system prompt for "answer only from the provided context." Fix: add it. Yes, just that line. It moves numbers.

9. Hallucination-prone base model. Symptom: even with perfect context, the answer wanders. Diagnostic: paste the exact retrieved context + question into the model directly. If it still hallucinates, the model is the issue. Fix: switch models. Smaller models hallucinate more on long contexts.

10. No citation enforcement. Symptom: users can't verify what's true. Diagnostic: do answers include chunk IDs or quotes? Fix: require the model to cite the chunk it used. "For each claim, cite the source ID." This also catches hallucinations because the model literally can't cite a thing that isn't there.

Run this list top to bottom. Don't skip ahead. The number of times "switch to GPT-5" was the answer is much smaller than the number of times "fix the chunker" was the answer.