Embeddings Are Not Created Equal — Choosing the Right Model for Your RAG Domain
A legal-tech RAG needs different embeddings to a customer-support one. The MTEB leaderboard hides this. Three things to check before you commit.
I keep meeting teams who picked their embedding model the same way they pick a coffee order: whatever was at the top of a list. The list in question is usually MTEB, and the leaderboard hides the only question that matters: how does this model perform on your corpus?
The realistic options today, in the shape I'd consider them:
- OpenAI
text-embedding-3-small(1536d): cheap, fast, good general baseline. Boring choice. Often the right one. - OpenAI
text-embedding-3-large(3072d): better on long text and technical content. Costs more, indexes are bigger. - Cohere
embed-v3: strong on retrieval, especially for multilingual. - Voyage
voyage-3: currently leads on English retrieval benchmarks. Worth testing if you're English-only and quality-sensitive. - BGE
bge-large-en-v1.5/nomic-embed-text-v1.5: open-source, good, self-hostable. Worth it if you're privacy-constrained.
The three things to actually check:
Does it know your jargon? Embed 50 sample queries and 50 sample document chunks from your domain. Eyeball the top-5 nearest-neighbour pairs. If the model thinks "policy" and "Polish" are close because it's confused, you found the problem. Legal, medical, scientific, and code-heavy corpora consistently expose this. The fix is sometimes "use a domain-tuned model" and sometimes "fine-tune the embeddings on your corpus."
What's the cost at 10M chunks? OpenAI 3-large at 3072d is 4x the storage of 3-small at 768d. That's real money on your vector store. Not enough teams do this math before committing.
What's the embedding versioning story? The day you change models, every vector in your store is incompatible with new queries. Tag every chunk with the model name and version. Have a re-embedding job ready. The first migration is painful enough without also having to guess what shape your old vectors were.
Swapping providers behind Semantic Kernel is mostly painless:
builder.AddAzureOpenAITextEmbeddingGeneration("text-embedding-3-small", endpoint, key);
// vs
builder.AddOllamaTextEmbeddingGeneration("nomic-embed-text", "http://localhost:11434");
ITextEmbeddingGenerationService is the abstraction. Pick the implementation by config, not by code, and you save a lot of pain later.
My current default: start with OpenAI text-embedding-3-small, evaluate against your real corpus with a 50-question test set, and only move if the eval says you should. Don't switch because a benchmark moved. Switch because your eval moved.