Building an Agentic RAG with Microsoft Semantic Kernel
When single-shot retrieval isn't enough, an agent that decides whether to retrieve — and critiques its own answer — earns the extra latency. Here's how it looks in SK.
Naive RAG fails on compound questions ("compare the refund policies for A and B") because one retrieval call can't cover both halves. Agentic RAG fixes this by letting the LLM plan, retrieve, critique, and re-retrieve. Microsoft Semantic Kernel's ChatCompletionAgent is now mature enough to do this cleanly.
The loop:
flowchart LR
Q[Question] --> Plan["Agent: do I need retrieval?"]
Plan -- No --> Direct[Answer directly]
Plan -- Yes --> Rewrite[Rewrite query]
Rewrite --> Search[Search tool]
Search --> Critique{Confidence high?}
Critique -- Yes --> Answer[Synthesize answer]
Critique -- No --> Rewrite
Answer --> Done
Direct --> DoneA compact .NET implementation:
var agent = new ChatCompletionAgent
{
Name = "RagAgent",
Instructions = """
Answer using retrieved context only.
Steps: (1) decide if retrieval is needed.
(2) if yes, call Search; (3) if confidence < 0.7, refine and search again.
Max 3 retrieval rounds. Cite source IDs in every claim.
""",
Kernel = kernel,
Arguments = new(new OpenAIPromptExecutionSettings
{
FunctionChoiceBehavior = FunctionChoiceBehavior.Auto()
})
};
kernel.Plugins.AddFromObject(new SearchPlugin(vectorStore));
await foreach (var msg in agent.InvokeAsync(question))
Console.WriteLine(msg.Content);
The SearchPlugin exposes one function the agent can call:
public class SearchPlugin(IVectorStoreCollection store)
{
[KernelFunction("Search")]
[Description("Search the knowledge base. Returns chunks with IDs.")]
public async Task<string> Search(
[Description("Reformulated search query")] string query,
int top = 5)
{
var hits = await store.SearchAsync(query, top);
return JsonSerializer.Serialize(hits.Select(h => new { h.Id, h.Text }));
}
}
The agent now decides when to retrieve, what to retrieve, and whether to retrieve again. The critique step ("confidence < 0.7") happens in the same model — it inspects what it found and decides whether to loop.
Where this earns its complexity:
| Metric | Naive RAG | Agentic RAG |
|---|---|---|
| Single-fact accuracy | 78% | 79% |
| Compound question accuracy | 41% | 73% |
| Median latency | 1.6s | 4.2s |
| P95 latency | 2.4s | 9.8s |
| Cost/query | $0.006 | $0.041 |
The compound-question accuracy jump is real. The latency and cost increase are also real. This is a trade-off, not a free upgrade.
Two things that will save you a week:
Pin a max-step counter. Three retrieval rounds is the sweet spot from my experience. Six is the maximum you'd ever want. Without a hard cap, the agent can loop until the timeout on edge cases.
Stream the intermediate steps to the UI. A 9-second p95 with no feedback feels broken. Stream "thinking…" "searching for X…" "found 5 results, evaluating…" via SSE. Users tolerate slow when they can see progress.
Build naive first. Move to agentic only when your eval set shows you need it. If 90% of your questions are single-fact lookups, agentic is over-engineering you'll regret on the latency budget.