RAG Caching Strategies — Semantic Caching, Embedding Reuse, and the Cost Math
Three layers of caching that cut 60-80% of your LLM bill in a busy RAG system. Plus the one cache that will silently break your UX.
A typical RAG query hits the embedding model, the vector store, optionally a reranker, then the LLM. That's three to four external calls. At any real volume, caching them properly is the difference between an affordable RAG and a quarterly bill that triggers a finance meeting.
Three caching layers, in the order I'd build them.
1. Embedding cache. Never re-embed the same chunk twice. Hash the chunk text, store the vector keyed by hash. New chunk arrives during ingest, check the hash first. Saves you on re-ingestion runs.
var hash = SHA256.HashData(Encoding.UTF8.GetBytes(text));
var key = Convert.ToHexString(hash);
var cached = await redis.GetAsync($"emb:{key}");
if (cached is not null) return cached.AsVector();
var vec = await embeddings.GenerateEmbeddingAsync(text);
await redis.SetAsync($"emb:{key}", vec, TimeSpan.FromDays(30));
return vec;
Cost saved: 100% on duplicate chunks. Not free in storage, but vectors are 6-12KB each, which is nothing at scale.
2. Exact-match response cache. Hash the full request (question + relevant config), store the answer. Same question, return cached answer. Use .NET 9's HybridCache if you want local + Redis with one API.
var key = $"q:{HashRequest(req)}";
return await hybridCache.GetOrCreateAsync(key, async ct =>
await RunRagPipeline(req, ct),
new HybridCacheEntryOptions { Expiration = TimeSpan.FromMinutes(15) });
This kills duplicate traffic — FAQs, popular searches, the same question asked across multiple users.
3. Semantic cache. This is the one that earns the most savings, and the one most likely to break UX if you're not careful. Embed the incoming question, search a cache of recent question-answer pairs by vector similarity. If similarity > 0.95, return the cached answer.
var qVec = await embeddings.GenerateEmbeddingAsync(question);
var hits = await cacheCollection.SearchAsync(qVec, top: 1);
if (hits.FirstOrDefault() is { Score: > 0.95 } hit)
return hit.Record.CachedAnswer;
The cost math, on a real B2B SaaS where I tracked this for a quarter:
| Layer | LLM calls saved | Monthly $ saved (at 10k req/day) |
|---|---|---|
| No caching | 0% | $0 baseline (~$2,800) |
| Embedding cache only | 8% | $220 |
| + Exact-match response cache | 32% | $900 |
| + Semantic cache (0.95 threshold) | 71% | $2,000 |
That's a 70%+ bill reduction. Real money.
The UX trap with semantic caching: a threshold that's too lax returns "almost-right" answers to slightly different questions. 0.95 is conservative and usually safe. 0.90 will silently merge "What's the refund policy?" with "What's the return policy?" — which sound similar but aren't. Always log semantic cache hits with the original question pair so you can audit. If you see a pair that shouldn't have hit, tighten the threshold.
What never to cache: anything personalised, anything time-sensitive ("what's the latest…"), anything where the user expects variety. Tag those requests cacheable=false at the call site. The decision belongs to the caller, not the cache.
Build embedding cache first (cheap, no UX risk). Add exact-match response cache second. Add semantic cache only with the audit log wired up.