Reranking in RAG — When Top-K Vector Search Isn't Enough
Vector search has a precision ceiling. A cross-encoder reranker breaks through it for the cost of one extra API call. Worth the 200ms more often than you'd guess.
Your top 5 retrieved chunks are not your top 5 relevant chunks. They're the closest in cosine distance, which is a different thing. Cross-encoder rerankers fix this by re-scoring the candidates with a model that sees the query and chunk together. The cost is one more API call. The benefit is the top of your list actually answering the question.
A real example. User asks "what's the refund window for digital goods?" My bi-encoder retriever returns:
- "Returns and refunds policy" (general)
- "Refund timelines for international orders"
- "Digital downloads: how to redeem"
- "Refund window for digital goods is 14 days from purchase" ← the actual answer
- "Subscription cancellation policy"
The correct chunk is 4th. The LLM will read 1-5, get distracted by the general refund policy chunk in position 1, and answer wrong about half the time.
Plug Cohere Rerank between retrieval and the LLM:
var hits = await collection.SearchAsync(qVec, top: 20);
var docs = await hits.Select(h => h.Record.Text).ToListAsync();
var cohere = new CohereClient(apiKey);
var reranked = await cohere.RerankAsync(new RerankRequest
{
Model = "rerank-v3.5",
Query = question,
Documents = docs,
TopN = 5
});
var topChunks = reranked.Results.Select(r => docs[r.Index]);
The "Refund window for digital goods is 14 days" chunk now sits at position 1. The LLM gets the right context.
Cost trade-off is real but smaller than people expect: ~150-250ms added per query depending on the candidate count, and a per-1k-search fee. The recall and precision numbers I see consistently:
| Setup | Recall@5 | Precision@5 |
|---|---|---|
| Vector only, top 5 | 71% | 64% |
| Vector top 20 → rerank → top 5 | 92% | 86% |
That's not a small jump. It's the difference between "this RAG kinda works" and "this RAG ships."
Two caveats worth shouting:
The reranker doesn't help if your retriever didn't surface the right chunk in the first place. Recall@20 has to be high before reranking can do anything. If recall@20 is 50%, the reranker rearranges deck chairs.
Open-source rerankers (BGE-reranker-v2, Voyage rerank-2 OSS variants) are catching up but still slightly behind Cohere on English. If you're privacy-constrained or cost-constrained, self-host. Otherwise just use Cohere and move on. The 200ms ceiling is the lowest-effort precision win in RAG today.