September 8, 2025·RAG·2 min read·Senior developers

Building an Agentic RAG with Microsoft Semantic Kernel

When single-shot retrieval isn't enough, an agent that decides whether to retrieve — and critiques its own answer — earns the extra latency. Here's how it looks in SK.

Naive RAG fails on compound questions ("compare the refund policies for A and B") because one retrieval call can't cover both halves. Agentic RAG fixes this by letting the LLM plan, retrieve, critique, and re-retrieve. Microsoft Semantic Kernel's ChatCompletionAgent is now mature enough to do this cleanly.

The loop:

flowchart LR
    Q[Question] --> Plan["Agent: do I need retrieval?"]
    Plan -- No --> Direct[Answer directly]
    Plan -- Yes --> Rewrite[Rewrite query]
    Rewrite --> Search[Search tool]
    Search --> Critique{Confidence high?}
    Critique -- Yes --> Answer[Synthesize answer]
    Critique -- No --> Rewrite
    Answer --> Done
    Direct --> Done

A compact .NET implementation:

var agent = new ChatCompletionAgent
{
    Name = "RagAgent",
    Instructions = """
        Answer using retrieved context only.
        Steps: (1) decide if retrieval is needed.
        (2) if yes, call Search; (3) if confidence < 0.7, refine and search again.
        Max 3 retrieval rounds. Cite source IDs in every claim.
        """,
    Kernel = kernel,
    Arguments = new(new OpenAIPromptExecutionSettings
    {
        FunctionChoiceBehavior = FunctionChoiceBehavior.Auto()
    })
};

kernel.Plugins.AddFromObject(new SearchPlugin(vectorStore));

await foreach (var msg in agent.InvokeAsync(question))
    Console.WriteLine(msg.Content);

The SearchPlugin exposes one function the agent can call:

public class SearchPlugin(IVectorStoreCollection store)
{
    [KernelFunction("Search")]
    [Description("Search the knowledge base. Returns chunks with IDs.")]
    public async Task<string> Search(
        [Description("Reformulated search query")] string query,
        int top = 5)
    {
        var hits = await store.SearchAsync(query, top);
        return JsonSerializer.Serialize(hits.Select(h => new { h.Id, h.Text }));
    }
}

The agent now decides when to retrieve, what to retrieve, and whether to retrieve again. The critique step ("confidence < 0.7") happens in the same model — it inspects what it found and decides whether to loop.

Where this earns its complexity:

Metric	Naive RAG	Agentic RAG
Single-fact accuracy	78%	79%
Compound question accuracy	41%	73%
Median latency	1.6s	4.2s
P95 latency	2.4s	9.8s
Cost/query	$0.006	$0.041

The compound-question accuracy jump is real. The latency and cost increase are also real. This is a trade-off, not a free upgrade.

Two things that will save you a week:

Pin a max-step counter. Three retrieval rounds is the sweet spot from my experience. Six is the maximum you'd ever want. Without a hard cap, the agent can loop until the timeout on edge cases.

Stream the intermediate steps to the UI. A 9-second p95 with no feedback feels broken. Stream "thinking…" "searching for X…" "found 5 results, evaluating…" via SSE. Users tolerate slow when they can see progress.

Build naive first. Move to agentic only when your eval set shows you need it. If 90% of your questions are single-fact lookups, agentic is over-engineering you'll regret on the latency budget.