From Naive RAG to Agentic RAG: A Migration Story in 5 Steps
Naive RAG works fine until your users start asking compound questions. Here's how we turned a single-shot retriever into something that plans, retrieves, and verifies. Without a rewrite.
Our first internal docs bot handled single-fact questions well. Then product asked it to compare two services across four dimensions and the wheels came off. The pipeline kept retrieving the right docs for part of the question and confidently making up the rest. Classic.
The fix wasn't a rewrite. It was five small migrations, each shippable on its own.
What "agentic RAG" actually changes
Naive RAG is a function. Query in, answer out. Agentic RAG is a loop. The LLM decides what to ask the retriever, reads the result, decides whether it has enough, and either asks again or commits to an answer. The retriever stops being a search engine. It becomes a tool the model calls.
That shift sounds bigger than it is. The code is small. The mindset isn't.
The five steps, one example
async def agentic_answer(question: str, max_steps: int = 4) -> str:
history = [{"role": "user", "content": question}]
for step in range(max_steps):
plan = await llm.chat(
history,
tools=[search_tool, lookup_tool, finalize_tool],
)
if plan.tool_name == "finalize":
return plan.tool_args["answer"]
result = await TOOLS[plan.tool_name](**plan.tool_args)
history.append({"role": "tool", "name": plan.tool_name, "content": result})
# Step 4: self-check before the next loop.
if step >= 1 and await is_sufficient(question, history):
return await llm.synthesize(history)
return await llm.synthesize(history) # fall back if budget exhausted
Why it works
The model writes the retrieval queries instead of the user. And the model knows what it still needs to know. Compound questions decompose naturally because each loop iteration handles one sub-question. The is_sufficient check is what separates an agent that ships from an agent that wanders for ten loops and times out.
The five steps, named
- Query rewriting. Before retrieval, ask the LLM to rephrase the user query into something a vector store will actually like. Cheap. High impact. Do this first.
- A router. Classify whether the question even needs retrieval. Small talk and meta questions should not hit the vector store.
- Retrieval as a tool. The model calls it on demand instead of always-first. This is the conceptual unlock.
- A sufficiency check. After each retrieval ask "do we have enough to answer?" Loop if not. This is your guard against runaway costs.
- Verification. Before returning, ask the model to cite spans from retrieved context. If it can't, retry. This catches the last 10% of hallucinations.
When to use it
When your eval set has multi-part questions, comparison questions, or "find X and use it to look up Y" patterns. Any time users keep saying the bot answered "half" of what they asked. Also when retrieval quality is already decent. Agentic RAG amplifies a working retriever. It does not fix a broken one.
When not to
If 90% of your queries are single-fact lookups, the extra latency and cost are not worth it. Also avoid it in domains where wrong-but-fast beats right-but-slow. Autocomplete, instant search, anything in front of a live keystroke. And don't start here on a greenfield project. Ship naive RAG first. Evolve.
The control flow
sequenceDiagram
participant U as User
participant A as Agent (LLM)
participant R as Retriever
U->>A: Compare services X and Y on cost
A->>R: search("service X pricing")
R-->>A: docs about X
A->>A: sufficient? no
A->>R: search("service Y pricing")
R-->>A: docs about Y
A->>A: sufficient? yes
A->>A: verify citations
A-->>U: grounded comparisonConclusion
Pick step 1, query rewriting, and ship it this week. It's 30 lines of code, no new infrastructure, and you get a baseline lift you can measure before you commit to the full agent loop. That's the bit I wish someone had told me before I rewrote the retriever twice.