Caching LLM Responses: When It Saves Money and When It Silently Breaks UX
Caching LLM calls is irresistible. Same input, same output, free. Except outputs aren't always supposed to be the same, and a stale cache hit looks like a broken product.
The cost dashboard had a great week. The support inbox did not. Users kept clicking "regenerate" because the assistant kept giving them the same answer to slightly different questions. The cache was working perfectly. That was the problem.
This is the post I should have written before I shipped that cache.
Two kinds of caching, and why people confuse them
People mean two different things when they say "cache the LLM." The first is exact-match response caching. Hash the prompt, store the result, return it on repeat. The second is prompt prefix caching, provider-side or KV-cache, where the model skips recomputing the front of the prompt but still generates fresh tokens.
The first saves money and can break UX. The second saves latency and is nearly always safe. Most teams reach for the first when they wanted the second.
A pragmatic response cache
import hashlib, json, time
def cache_key(messages, model, **opts):
payload = json.dumps({"m": messages, "model": model, **opts}, sort_keys=True)
return hashlib.sha256(payload.encode()).hexdigest()
async def cached_chat(messages, model, *, ttl_seconds=900, cacheable=True):
if not cacheable:
return await llm.chat(messages, model=model)
key = cache_key(messages, model)
hit = await redis.get(f"llm:{key}")
if hit:
metrics.incr("llm.cache.hit")
return json.loads(hit)
result = await llm.chat(messages, model=model)
await redis.setex(f"llm:{key}", ttl_seconds, json.dumps(result))
metrics.incr("llm.cache.miss")
return result
Why it works when it does
The hash covers every input that could change the output. Messages, model, sampling params. A TTL bounds staleness. The cacheable flag lets the caller declare whether a response is safe to reuse, instead of the cache assuming it always is. That single boolean prevents about 80% of "why is the bot repeating itself" complaints. Maybe more.
When caching saves money
Deterministic, idempotent, factual responses. Extracting fields from a fixed document, classifying with a fixed taxonomy, summarising static content. Also any traffic where the same query repeats across many users. Autocomplete, popular search intents, common support questions.
When caching silently breaks UX
Anything where the user expects variety or recency. Creative generation ("write me a tweet"), personalised responses, anything time-dependent ("what's the latest..."), or conversational follow-ups where context bleeds across turns. Cache those and users will assume the product is broken. They won't tell you. They'll just leave.
Decision flow
flowchart TD
Q[Incoming LLM call] --> C{Cacheable?<br/>caller-declared}
C -- No --> L[Call LLM directly]
C -- Yes --> H{Cache hit<br/>within TTL?}
H -- Yes --> R[Return cached]
H -- No --> N[Call LLM] --> S[Store with TTL] --> R2[Return]
R --> Done[Response]
L --> Done
R2 --> DoneConclusion
Add one line to every cache hit. Log the original query and the cached response, with a flag. Spend ten minutes a week reading the log. The first time you spot a "wait, that should not have been cached" entry, you'll know exactly which call site to mark cacheable=False. Cheap insurance, takes five minutes to wire up.