Skip to content

Cloud & Infrastructure · Performance

Caching LLM Responses: When It Helps, When It Hurts, and How to Implement It

LLM calls are slow and expensive. Caching them is the obvious move. But caching the wrong responses breaks the user experience in ways that are subtle and hard to debug. Here's a practical guide to doing it right.

Anurag Verma

Anurag Verma

7 min read

Caching LLM Responses: When It Helps, When It Hurts, and How to Implement It

Sponsored

Share

The first time a user asks your AI chatbot “What’s your return policy?” the model takes 1.2 seconds and costs $0.003. The ten-thousandth time a user asks the same question, you’ve spent $30 and 3.3 hours of total user waiting on a response that hasn’t changed.

Caching solves this. But LLM caching has failure modes that traditional caching doesn’t — and they show up in places users notice.

The Two Types of LLM Caching

Exact Match Caching

The simplest approach: hash the prompt and conversation state, look up the hash in a cache store, return the stored response if found.

import hashlib
import json
import redis

redis_client = redis.Redis(host='localhost', port=6379, db=0)

def build_cache_key(model: str, messages: list) -> str:
    content = json.dumps({
        "model": model,
        "messages": messages
    }, sort_keys=True)
    return f"llm:{hashlib.sha256(content.encode()).hexdigest()}"

async def cached_completion(model: str, messages: list, ttl: int = 86400) -> dict:
    key = build_cache_key(model, messages)
    
    cached = redis_client.get(key)
    if cached:
        return json.loads(cached)
    
    response = await call_llm(model, messages)
    
    redis_client.setex(key, ttl, json.dumps(response))
    return response

This works well for:

  • FAQ bots where the same questions repeat
  • Documentation assistants with stable content
  • Classification tasks where the same input recurs (email categorization, product tagging)
  • Batch processing pipelines that may re-encounter the same items

The hit rate for exact match caching depends entirely on how varied your inputs are. A support bot answering questions from a defined FAQ set might see 60-80% cache hits. A general-purpose chatbot answering unique user questions might see less than 5%.

Semantic Caching

Exact match caching misses the case where users ask the same question different ways: “What’s the return policy?” and “How do I return something?” and “Can I get a refund?” are semantically equivalent but produce different hashes.

Semantic caching solves this by embedding the user query into a vector, searching for similar cached queries, and returning the cached response if similarity exceeds a threshold.

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

def embed_query(text: str) -> np.ndarray:
    return model.encode(text, normalize_embeddings=True)

def find_semantic_cache_hit(
    query: str, 
    cache: list[dict],
    threshold: float = 0.92
) -> dict | None:
    query_embedding = embed_query(query)
    
    for entry in cache:
        similarity = np.dot(query_embedding, entry["embedding"])
        if similarity >= threshold:
            return entry["response"]
    
    return None

In production, you’d store embeddings in a vector database (Qdrant, Weaviate, pgvector) rather than in memory.

The threshold matters significantly:

  • 0.95+: Very conservative. Only nearly-identical queries hit the cache.
  • 0.90-0.94: Most paraphrases match. Some false positives possible.
  • Below 0.90: High hit rate but responses may not actually answer the query asked.

Tune this threshold on real query data. Set it too low and users get responses that don’t quite match their question. Set it too high and you miss most of the potential savings.

What Not to Cache

Knowing when not to cache is as important as knowing how.

User-specific context: Any query that implicitly depends on who’s asking should not be cached. “What’s the status of my order?” is dangerous to cache — the same string means different things for different users.

def should_cache(messages: list, user_context: dict | None) -> bool:
    last_message = messages[-1]["content"].lower()
    
    personal_indicators = [
        "my order", "my account", "my subscription",
        "i want", "i need", "my history"
    ]
    
    if any(indicator in last_message for indicator in personal_indicators):
        return False
    
    if user_context and user_context.get("has_active_session"):
        return False
    
    return True

Time-sensitive queries: Questions about current prices, availability, schedules, or news change constantly. A cached answer from two hours ago may be wrong.

Creative generation: If you’re generating unique content — marketing copy, code for a specific problem, tailored advice — caching defeats the purpose. Users expect unique responses.

Multi-turn conversations: Caching mid-conversation is particularly risky. The same user message means something different at turn 3 than it would as a standalone question. Cache the full conversation state or don’t cache mid-conversation at all.

TTL Strategy

How long to keep a cached response depends on how quickly the correct answer changes:

Content TypeSuggested TTLWhy
Return/refund policies24 hoursChanges infrequently
Product documentation6 hoursMay update with releases
FAQ answers12 hoursReasonably stable
News/current eventsDon’t cacheChanges constantly
Price informationDon’t cacheChanges frequently
Creative generationDon’t cacheShould be unique
Code generation1 hourLanguage/library docs evolve

For any content where the underlying data can change (documentation pulled from an API, policy documents stored in a CMS), invalidate cache entries when the source data changes rather than relying solely on TTL:

def invalidate_topic_cache(topic_id: str):
    """Called when a documentation page is updated."""
    pattern = f"llm:topic:{topic_id}:*"
    keys = redis_client.keys(pattern)
    if keys:
        redis_client.delete(*keys)

Provider-Side Prompt Caching

Separate from application-level caching, several LLM providers now offer prompt caching at the API level. When you send a request with a long system prompt, the provider caches the KV (key-value) activations for that prompt. Subsequent requests with the same prefix pay a lower price and get lower latency.

This is distinct from what we’ve been discussing: provider-side caching applies to part of the input (your static system prompt), while application-level caching applies to the full response.

To take advantage of provider-side caching, structure your prompts so static content comes first:

messages = [
    {
        "role": "system",
        "content": system_prompt  # Static: cached by provider after first call
    },
    # Dynamic: varies per request, after the cached prefix
    *conversation_history,
    {
        "role": "user", 
        "content": user_message
    }
]

If your system prompt is 2,000 tokens, provider-side caching can reduce input costs by 60-90% for that prefix on most providers.

Monitoring Cache Performance

Track these to understand if your caching strategy is working:

class CacheMetrics:
    def record_hit(self, cache_type: str, latency_saved_ms: float):
        metrics.increment(f"llm_cache.hit.{cache_type}")
        metrics.histogram("llm_cache.latency_saved_ms", latency_saved_ms)
    
    def record_miss(self, cache_type: str):
        metrics.increment(f"llm_cache.miss.{cache_type}")
    
    def record_invalidation(self, reason: str):
        metrics.increment(f"llm_cache.invalidation.{reason}")

Monitor:

  • Cache hit rate by feature and endpoint
  • Latency distribution for cache hits vs. misses
  • Cost per request before and after caching
  • Cache invalidation frequency (high rates suggest TTL is too aggressive or data changes faster than you thought)

If hit rate is below 20% on a feature that seems cacheable, check whether your cache key includes unnecessary uniqueness — user session IDs, timestamps, or request IDs that should be excluded.

A Simple Starting Point

If you’re adding caching to an existing AI feature for the first time:

  1. Log every LLM request with the full prompt hash for one week. Count how often identical hashes appear. This tells you your theoretical exact-match cache hit rate.
  2. If it’s above 15%, add exact-match caching with a conservative TTL. Measure latency and cost impact.
  3. If it’s below 15% but the feature has natural clusters of similar questions, prototype semantic caching with a high similarity threshold (0.93+) on a sample of real queries. If the precision looks good, ship it.
  4. For any cached response, add a cache hit indicator to your logs so you can separately analyze quality of cached vs. live responses.

The last point matters more than it sounds. A wrong cached response is harder to catch than a wrong live response — it recurs on every cache hit until TTL expires. Knowing which responses came from cache helps you investigate when users report issues.

Caching LLM responses is worth doing in most production AI applications. The savings in latency and cost are real. The failure modes are avoidable if you think carefully about what’s actually repeating and what’s actually changing.

Sponsored

Enjoyed it? Pass it on.

Share this article.

Sponsored

The dispatch

Working notes from
the studio.

A short letter twice a month — what we shipped, what broke, and the AI tools earning their keep.

No spam, ever. Unsubscribe anytime.

Discussion

Join the conversation.

Comments are powered by GitHub Discussions. Sign in with your GitHub account to leave a comment.

Sponsored