August 21, 2024·RAG·2 min read·Mid-level developers

Building Your First RAG Pipeline in .NET with Semantic Kernel and Qdrant

A document Q&A app in C# that doesn't depend on Python and doesn't take a week. Semantic Kernel, Qdrant in Docker, Azure OpenAI (or Ollama if you're cheap).

Most RAG tutorials assume you're in Python. You don't have to be. The .NET story has caught up enough that you can build a working document Q&A endpoint in an afternoon with Semantic Kernel, Qdrant, and a Minimal API.

The shape: ingest PDFs, embed and store the chunks in Qdrant, then expose /ask that retrieves and augments the prompt.

Start Qdrant locally:

docker run -p 6333:6333 qdrant/qdrant

Wire up the kernel and embeddings:

var builder = Kernel.CreateBuilder()
    .AddAzureOpenAIChatCompletion("gpt-4o-mini", endpoint, key)
    .AddAzureOpenAITextEmbeddingGeneration("text-embedding-3-small", endpoint, key);
var kernel = builder.Build();
var embeddings = kernel.GetRequiredService<ITextEmbeddingGenerationService>();

var qdrant = new QdrantVectorStore(new QdrantClient("localhost"));
var collection = qdrant.GetCollection<Guid, DocChunk>("docs");
await collection.CreateCollectionIfNotExistsAsync();

Ingestion is just: split the doc, embed, upsert.

foreach (var chunk in TextChunker.SplitPlainTextLines(text, 500))
{
    var vec = await embeddings.GenerateEmbeddingAsync(chunk);
    await collection.UpsertAsync(new DocChunk
    {
        Id = Guid.NewGuid(), Text = chunk, Embedding = vec
    });
}

And /ask is barely longer:

app.MapPost("/ask", async (AskRequest req) =>
{
    var qVec = await embeddings.GenerateEmbeddingAsync(req.Question);
    var results = collection.SearchAsync(qVec, top: 5);
    var context = string.Join("\n---\n", await results.Select(r => r.Record.Text).ToListAsync());

    var prompt = $"Use only the context to answer.\n\nContext:\n{context}\n\nQ: {req.Question}";
    var answer = await kernel.InvokePromptAsync(prompt);
    return Results.Ok(new { answer = answer.ToString() });
});

That's a working RAG. It will demo beautifully. It will also hide three problems that bite the second week:

Cold-start latency. First call after a quiet hour is slow because the embedding model and the chat model both warm up separately. Either keep them warm with a ping job or accept it and tell your frontend.

Embedding versioning. The day you change from text-embedding-3-small to a bigger model, every existing vector becomes garbage. Tag chunks with the embedding model name and rebuild lazily. Don't find out in production.

Chunk overlap tuning. 500 tokens with zero overlap will cut sentences in half. Start with 500/80 (size/overlap) and adjust based on doc type. PDFs of policy documents want more overlap than chat transcripts.

If you only ship one thing from this post: store the embedding model name on every record. The first migration is painful enough without also having to guess what shape your vectors were.