·RAG·2 min read·Senior developers

The Hidden Cost of Re-Ranking: Benchmarking Cross-Encoders in Production RAG

Cross-encoder rerankers boost recall on paper. In production, they doubled our p95 latency and the lift didn't show up in user metrics. The benchmark we wish we'd run first.

The retrieval team rolled out a cross-encoder reranker after their offline eval showed +12% NDCG@5. Two weeks later support filed a ticket: the assistant felt slow. p95 had crept from 1.4s up to 2.9s. The "free" quality win had quietly eaten a quarter of our latency budget.

This is the post I wish someone had handed me before I approved that deploy.

What a reranker actually does

A bi-encoder embeds query and document independently. That's what makes vector search fast. A cross-encoder takes the query and a candidate document together and scores them as a pair. The pair-aware scoring is genuinely more accurate. The catch is that you cannot precompute it. Every candidate is a fresh forward pass at request time.

If retrieval hands you 50 candidates, the cross-encoder runs the model 50 times per query. That's the hidden cost no one's blog post leads with.

A realistic benchmark harness

import time, statistics
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("BAAI/bge-reranker-base", max_length=512)

def benchmark(queries, candidates_per_query):
    p95s, lifts = [], []
    for q, cands in zip(queries, candidates_per_query):
        pairs = [(q, c["text"]) for c in cands]
        t0 = time.perf_counter()
        scores = reranker.predict(pairs, batch_size=32)
        elapsed = (time.perf_counter() - t0) * 1000  # ms
        p95s.append(elapsed)

        # Compare rerank order vs original retrieval order on labels.
        baseline_top1 = cands[0]["label"]
        reranked_top1 = sorted(zip(scores, cands), reverse=True)[0][1]["label"]
        lifts.append(int(reranked_top1) - int(baseline_top1))

    return {
        "p50_ms": statistics.median(p95s),
        "p95_ms": statistics.quantiles(p95s, n=20)[18],
        "lift_top1": statistics.mean(lifts),
    }

Why this matters

Offline metrics flatter rerankers because they're computed on labelled pairs the model has already seen during eval prep. The production question is different: "does the model also rerank correctly on noisy candidates in real traffic?" Usually a smaller lift than the eval claimed. Meanwhile latency is always worse. Deterministically. Regardless of what the eval said.

When to use it

When retrieval recall@50 is high but precision@5 is the bottleneck. That is, the right answer is in your top 50 but rarely top 3. Also when the cost of a wrong top result is high. Medical, legal, anything where a user might act on the first citation without reading the rest.

When not to

If your p95 budget is under 1.5s and you can't batch effectively, do not ship a reranker. Try query rewriting and hybrid search first. They're cheaper. Also skip it when your candidate set is already small (k=5 retrieval). The reranker has nothing to work with.

Where the latency lands

flowchart LR
    Q[Query] -->|2ms| EM[Embed query]
    EM -->|15ms| VS["Vector search<br/>k=50"]
    VS -->|600ms| RR[Cross-encoder<br/>rerank 50]
    RR -->|2ms| TOP[Top 5]
    TOP -->|800ms| LLM[LLM call]
    LLM -->|stream| U[User]

    style RR fill:#fee2e2,stroke:#dc2626

Conclusion

Before you flip the reranker on, run a one-day shadow benchmark. Log the rerank scores and the latency in production but keep serving the original order. Two days of that data tells you whether the lift is worth the budget. And you won't have to roll back in front of users, which is how I learned this in the first place.