·RAG·2 min read·Mid-level developers

Reranking in RAG — When Top-K Vector Search Isn't Enough

Vector search has a precision ceiling. A cross-encoder reranker breaks through it for the cost of one extra API call. Worth the 200ms more often than you'd guess.

Your top 5 retrieved chunks are not your top 5 relevant chunks. They're the closest in cosine distance, which is a different thing. Cross-encoder rerankers fix this by re-scoring the candidates with a model that sees the query and chunk together. The cost is one more API call. The benefit is the top of your list actually answering the question.

A real example. User asks "what's the refund window for digital goods?" My bi-encoder retriever returns:

  1. "Returns and refunds policy" (general)
  2. "Refund timelines for international orders"
  3. "Digital downloads: how to redeem"
  4. "Refund window for digital goods is 14 days from purchase" ← the actual answer
  5. "Subscription cancellation policy"

The correct chunk is 4th. The LLM will read 1-5, get distracted by the general refund policy chunk in position 1, and answer wrong about half the time.

Plug Cohere Rerank between retrieval and the LLM:

var hits = await collection.SearchAsync(qVec, top: 20);
var docs = await hits.Select(h => h.Record.Text).ToListAsync();

var cohere = new CohereClient(apiKey);
var reranked = await cohere.RerankAsync(new RerankRequest
{
    Model = "rerank-v3.5",
    Query = question,
    Documents = docs,
    TopN = 5
});

var topChunks = reranked.Results.Select(r => docs[r.Index]);

The "Refund window for digital goods is 14 days" chunk now sits at position 1. The LLM gets the right context.

Cost trade-off is real but smaller than people expect: ~150-250ms added per query depending on the candidate count, and a per-1k-search fee. The recall and precision numbers I see consistently:

Setup Recall@5 Precision@5
Vector only, top 5 71% 64%
Vector top 20 → rerank → top 5 92% 86%

That's not a small jump. It's the difference between "this RAG kinda works" and "this RAG ships."

Two caveats worth shouting:

The reranker doesn't help if your retriever didn't surface the right chunk in the first place. Recall@20 has to be high before reranking can do anything. If recall@20 is 50%, the reranker rearranges deck chairs.

Open-source rerankers (BGE-reranker-v2, Voyage rerank-2 OSS variants) are catching up but still slightly behind Cohere on English. If you're privacy-constrained or cost-constrained, self-host. Otherwise just use Cohere and move on. The 200ms ceiling is the lowest-effort precision win in RAG today.