July 15, 2025·RAG·3 min read·Mid-level developers

Hybrid Search in RAG: When Vector Similarity Alone Isn't Enough

Pure vector search misses exact-match queries. Product SKUs, error codes, function names. Hybrid search fixes that without giving up the semantic recall you actually like.

Your support bot answers "how do I reset my password" beautifully and then completely whiffs when a user types "what does error E_AUTH_4012 mean?" The retriever is doing its job. It's just doing the wrong job. Embeddings are bad at exact tokens.

I learned this the hard way when a customer demoed our beautiful semantic search by typing in a SKU. Crickets.

The mismatch

Embeddings smear meaning across a vector space. That's a feature when the user phrases a question differently to the docs. It's a bug when the user types something that only works if you find that exact string. Error codes, SKU numbers, function names, legal citations. Those are needles, and your embeddings keep handing you haystacks.

Think of it like asking a librarian for a book. Semantic search is "I want something about grief." Keyword search is "give me ISBN 978-0345804327." You need both, and you need them in the same answer.

Hybrid search in 25 lines

from rank_bm25 import BM25Okapi

class HybridRetriever:
    def __init__(self, docs, embedder, vector_index, alpha=0.5):
        self.docs = docs
        self.bm25 = BM25Okapi([d["text"].split() for d in docs])
        self.embedder = embedder
        self.vector_index = vector_index
        self.alpha = alpha  # 0=pure BM25, 1=pure vector

    def search(self, query: str, k: int = 5):
        bm25_scores = self.bm25.get_scores(query.split())
        bm25_norm = bm25_scores / (bm25_scores.max() + 1e-9)

        q_vec = self.embedder.encode(query)
        vec_hits = self.vector_index.search(q_vec, k=len(self.docs))
        vec_norm = {h.id: h.score for h in vec_hits}

        merged = []
        for i, doc in enumerate(self.docs):
            score = self.alpha * vec_norm.get(doc["id"], 0) + \
                    (1 - self.alpha) * bm25_norm[i]
            merged.append((score, doc))
        return [d for _, d in sorted(merged, reverse=True)[:k]]

Why it works

BM25 rewards exact token matches and rare terms. Exactly what error codes and identifiers are. The vector side rewards meaning. Normalising both to a 0-1 range lets alpha express one simple idea: how much do I trust paraphrase versus exact match in this domain. That single knob is what tuning hybrid search looks like in real life.

When to use it

Any domain with mixed query types. Product catalogues, API docs, support tickets, legal corpora. If your eval set has queries that look like both questions and identifiers, you need hybrid. Start with alpha=0.5 and tune from there. The numbers usually settle after a couple of afternoons with real traffic.

When not to

Pure prose corpora where users never type identifiers. Blogs, journalism archives, recipe sites. The BM25 side adds operational cost for no real lift. Also skip it if your corpus is under about 10k documents. At that size, the LLM can re-rank a smaller candidate set itself and you don't need the extra moving parts.

Hybrid flow

flowchart TD
    Q[User query] --> B[BM25 search]
    Q --> V[Vector search]
    B --> N1[Normalize scores]
    V --> N2[Normalize scores]
    N1 --> M["Weighted merge<br/>alpha · vec + 1-alpha · bm25"]
    N2 --> M
    M --> R[Top-k results]
    R --> L[LLM context]

Conclusion

Build a 50-query eval set today. Half phrased like real questions, half containing exact identifiers from your corpus. Run pure-vector, pure-BM25, and hybrid against it. The right alpha falls out of the data inside an afternoon. Faster than you can spend a week reading benchmark papers, which I have also done so you don't have to.