·RAG·2 min read·Senior developers

Evaluating RAG Systems — RAGAS, Faithfulness, and Setting Up an Eval Harness in .NET

"Looks right to me" is not evaluation. Four metrics that catch regressions before users do, and how to wire them into your test suite.

If your RAG system has no eval suite, every change you make is a guess. The standard "asked it three questions, looked good" approach is how silent regressions ship. RAGAS gives you the metrics; the rest is wiring them into a test harness you actually run.

The four metrics that matter:

  • Faithfulness: does the answer only use facts from the retrieved context? Catches hallucinations.
  • Answer relevancy: does the answer actually address the question? Catches the "verbose but off-topic" failure.
  • Context precision: did the retriever rank the right chunks at the top? Catches reranker regressions.
  • Context recall: did the retriever surface the needed information at all? Catches retrieval gaps.

You don't need a separate eval library to compute these in .NET. The "LLM as a judge" pattern works well — use a strong model (GPT-4 class) to score outputs against a reference. The wiring in xUnit:

public class RagEvalTests
{
    static readonly EvalCase[] Cases =
    {
        new("What's the refund window for digital goods?", "14 days from purchase"),
        new("Who approves expenses over $5000?", "Director-level approver"),
        // ... 30-50 golden questions
    };

    [Theory]
    [MemberData(nameof(AllCases))]
    public async Task RagPipeline_MeetsQualityBar(EvalCase c)
    {
        var (answer, context) = await Pipeline.RunAsync(c.Question);
        var scores = await Judge.ScoreAsync(c.Question, c.Expected, answer, context);

        Assert.True(scores.Faithfulness >= 0.85, $"Faithfulness {scores.Faithfulness} on '{c.Question}'");
        Assert.True(scores.ContextRecall >= 0.80, $"Recall {scores.ContextRecall} on '{c.Question}'");
    }
}

The judge prompt for faithfulness:

You are scoring a RAG answer. The user asked:
{question}
The retrieved context was:
{context}
The system answered:
{answer}

For each claim in the answer, decide if it is supported by the context.
Reply as JSON: {"supported": N, "total": N, "unsupported_claims": ["..."] }

Compute faithfulness as supported / total. The unsupported claims list is the actual debugging payload — that's where the hallucinations are.

How I'd integrate this:

  • Build a 30-50 question golden set from real user questions. Don't generate them with an LLM; use real ones.
  • Run the eval suite on every PR that touches the RAG pipeline. Cache scores so unchanged questions don't re-run.
  • Fail CI on regression, not on absolute thresholds. The right gate is "faithfulness dropped by more than 5% vs main" — absolute thresholds change when the corpus changes.
  • Track scores over time. A simple CSV per commit, plotted, is enough. You'll catch the slow drift that a single threshold never sees.

The thing nobody tells you: the eval is the most expensive part of your RAG bill. A 50-question suite with full RAGAS metrics costs around $1-2 per run. That's negligible per PR, painful if you run it on every commit. Cache aggressively.

The pattern: golden set in version control, judge prompts in version control, scores tracked over commits. The first time the suite catches a regression before it ships to users, the eval pays for itself in one afternoon.