Prompt Caching: Cutting LLM Costs and Latency by Reusing Computation

Your RAG application sends a 2000-token system prompt, 1500 tokens of few-shot examples, and 3000 tokens of retrieved context with every single request. The system prompt and examples are identical across all requests. You are paying to process 3500 tokens of static content on every one of your 100,000 daily requests. That is 350 million tokens of redundant computation per day. At $3 per million input tokens, you are burning $1,050/day processing the exact same prefix over and over.

Prompt caching fixes this by computing the KV cache for your static prefix once and reusing it across requests. The first request pays full price. Every subsequent request with the same prefix skips the expensive prefill computation and jumps straight to processing the unique portion. Your 3500-token static prefix goes from costing $3/M tokens to $0.30/M tokens - a 90% reduction.

This is not application-level response caching (storing final answers). This is infrastructure-level computation caching - the model still generates fresh, unique responses for each request, but it does not redundantly process the parts of the prompt that have not changed.

What prompt caching actually is

When an LLM processes a prompt, it computes Key and Value vectors for every token in every attention layer. This is the “prefill” phase - it is computationally expensive and determines time-to-first-token (TTFT). For a 128-layer model with 96 attention heads, each input token produces thousands of floating-point values that get stored in the KV cache.

Prompt caching stores this computed KV cache so that when the same token prefix appears in a subsequent request, the computation is skipped. The model loads the pre-computed KV cache and only computes new KV values for the tokens that differ.

graph TD
  subgraph nocache["Without Caching - Every Request"]
      N1["System Prompt (2000 tokens)
Compute KV cache $$"]
      N2["Few-shot Examples (1500 tokens)
Compute KV cache $$"]
      N3["Retrieved Context (3000 tokens)
Compute KV cache $$"]
      N4["User Query (100 tokens)
Compute KV cache"]
      N5["Generate Response"]
  end
  subgraph cache["With Caching - Subsequent Requests"]
      C1["System Prompt (2000 tokens)
Load from cache ✓"]
      C2["Few-shot Examples (1500 tokens)
Load from cache ✓"]
      C3["Retrieved Context (3000 tokens)
Compute KV cache $$"]
      C4["User Query (100 tokens)
Compute KV cache"]
      C5["Generate Response"]
  end

  style N1 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
  style N2 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
  style C1 fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style C2 fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style C3 fill:#FAEEDA,stroke:#854F0B,color:#633806
  style N3 fill:#FAEEDA,stroke:#854F0B,color:#633806

How it works technically

Prefix matching

The cache works on exact prefix matches. If your prompt starts with the same sequence of tokens as a previously processed prompt, the cached KV values for that prefix are reused. The match must be exact - even a single different token in the middle breaks the cache for all subsequent tokens.

Request 1: [System Prompt][Examples][Context A][Query 1]
            ^^^^^^^^^^^^^^^^^^^^^^^^^ cached after this request

Request 2: [System Prompt][Examples][Context B][Query 2]
            ^^^^^^^^^^^^^^^^^^^^^^^^^ cache hit! Skip computation
                                     ^^^^^^^^^^^^^^^^^ compute fresh

Cache key mechanism

The cache key is typically a hash of the token sequence. Providers implement this differently:

Anthropic: Automatic prefix caching with minimum 1024-token prefix length. Caches persist for ~5 minutes of inactivity.
OpenAI: Automatic caching for prompts with shared prefixes over 1024 tokens. Cache lifetime of ~5-10 minutes.
Self-hosted (vLLM): Configurable prefix caching with radix tree for efficient prefix lookup.

Minimum prefix length

Most providers require a minimum prefix length (typically 1024-2048 tokens) before caching kicks in. Short prompts are not cached because the computation savings do not justify the memory overhead of storing the KV cache.

Cache eviction

KV caches are large (proportional to model size × sequence length × number of layers). Providers evict cached prefixes after inactivity periods (5-10 minutes typically). High-traffic applications keep their caches warm naturally. Low-traffic applications may need explicit keep-alive strategies.

Provider implementations

Anthropic Claude

Anthropic offers explicit cache control via cache_control markers:

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    system=[
        {
            "type": "text",
            "text": "You are a technical support agent...",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": query}]
)

Pricing: cached tokens are charged at 10% of regular input token price. Writing to cache costs 25% extra on the first request.

OpenAI

OpenAI implements automatic prefix caching - no code changes needed:

# Same system prompt across requests = automatic cache hit
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": long_system_prompt},  # Cached automatically
        {"role": "user", "content": unique_query}
    ]
)

Pricing: cached tokens are charged at 50% of regular input token price. No extra charge for cache writes.

Self-hosted (vLLM)

# Enable prefix caching in vLLM
llm = LLM(model="meta-llama/Llama-3-70b", enable_prefix_caching=True)

vLLM uses a radix tree to efficiently match shared prefixes across requests. No cost difference (you own the hardware), but significant latency and throughput improvements.

graph LR
  subgraph pricing["Cost Comparison (per 1M tokens)"]
      P1["Regular Input
$3.00"]
      P2["Cache Write
$3.75 (first time)"]
      P3["Cache Read
$0.30 (subsequent)"]
  end
  subgraph savings["Savings Example"]
      S1["100K requests/day
3500 static tokens each
= 350M redundant tokens"]
      S2["Without cache: $1,050/day"]
      S3["With cache: $105/day"]
      S4["Saving: $945/day"]
  end

  style P1 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
  style P2 fill:#FAEEDA,stroke:#854F0B,color:#633806
  style P3 fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style S4 fill:#E1F5EE,stroke:#0F6E56,color:#085041

Where prompt caching breaks or gets interesting

Ordering sensitivity

The cache only matches prefixes. If you put dynamic content (user name, timestamp) before your static content, nothing gets cached:

# BAD: Dynamic content first breaks cache
[Hello {user_name}!][System Prompt][Examples][Query]
 ^^^ different every time = no cache hit for anything after

# GOOD: Static content first enables cache
[System Prompt][Examples][Hello {user_name}!][Query]
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ cached prefix

Design your prompts with static content at the beginning and dynamic content at the end.

Cache thrashing

If you have many slightly different system prompt variants (A/B testing, per-customer configurations), each variant creates a separate cache entry. With 100 variants and limited cache memory, entries get evicted before they are reused. Consolidate prompt variants where possible.

Multi-turn conversations

In multi-turn chats, each new message changes the full prompt. But the prefix (system prompt + earlier turns) stays the same:

Turn 1: [System][User1]                     → cache system
Turn 2: [System][User1][Asst1][User2]       → cache hit on [System][User1]
Turn 3: [System][User1][Asst1][User2][Asst2][User3] → cache hit on longer prefix

Longer conversations have progressively longer cached prefixes, making each new turn cheaper.

RAG context positioning

If retrieved documents come before the user query, they are part of the varying portion (different documents per query). If you have a fixed knowledge base section in your prompt, place it in the cacheable prefix:

# Partially cacheable
[System Prompt][Fixed KB Section][Retrieved Docs][Query]
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ cacheable if KB is stable

Cache warmup

The first request with a new prefix pays full price (plus a small write overhead on some providers). For latency-sensitive applications, send a “warmup” request at startup to populate the cache before real traffic arrives.

Real-world impact

Anthropic Claude - reports customers seeing 85-90% cost reductions on workloads with stable system prompts, especially in code generation and document analysis pipelines
OpenAI - automatic caching reduced median time-to-first-token by 60-80% for applications with long system prompts
vLLM deployments - prefix caching increases throughput by 2-4x for multi-tenant serving (many users share the same system prompt)
Cursor IDE - benefits heavily from caching because every code completion request shares the same large system prompt and similar code context
Enterprise chatbots - fixed knowledge bases and compliance instructions in system prompts get cached across all user sessions

How to apply in practice

Audit your prompt structure. Identify which parts are static (same across all requests) and which are dynamic (change per request). Move all static content to the front.

Measure your cache hit rate. Providers report this in usage stats or headers. If your hit rate is below 80%, you have ordering or variation issues to fix.

Consolidate prompt variants. If you have per-customer system prompts that differ by 5%, consider a single shared prompt with a small dynamic section rather than N completely separate prompts.

Budget the first-request cost. Cache writes cost more than regular tokens on some providers. For very low-traffic endpoints (< 1 request per 5 minutes), the cache will keep expiring and you pay the write premium repeatedly without getting read benefits.

Consider cache TTL in your architecture. If your cache expires every 5 minutes and you have bursty traffic (high volume for 2 minutes, then quiet for 10 minutes), you will miss the cache on every burst start. Implement keep-alive pings for critical caches during quiet periods.

# Keep-alive to prevent cache eviction
async def keep_cache_warm():
    while True:
        await asyncio.sleep(240)  # Every 4 minutes (before 5-min TTL)
        await client.messages.create(
            model="claude-sonnet-4-20250514",
            system=[{"type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"}}],
            messages=[{"role": "user", "content": "ping"}],
            max_tokens=1
        )

FAQ

Q: Is prompt caching the same as caching the model’s response?

No. Response caching (memoization) stores the final answer and returns it for identical queries - you get zero freshness. Prompt caching stores intermediate computation (the KV cache for the prefix) but the model still generates a unique response for each request. The output is fresh; only the redundant input processing is skipped. Both are useful, but they solve different problems.

Q: What if my system prompt changes? Do I lose the cache?

Yes. Any change to the cached prefix invalidates the cache for that prefix. If you update your system prompt, all subsequent requests compute fresh KV values. This is why you should minimize system prompt changes in production - batch updates during low-traffic periods rather than making frequent small edits.

Q: Does prompt caching affect output quality?

No. The model produces mathematically identical outputs whether the KV cache was freshly computed or loaded from cache. Caching does not change the computation - it just avoids redoing computation that has already been done. The model does not know whether its KV cache came from a fresh prefill or a cache load.

Interview questions

Q: Your LLM application processes 50,000 requests/day with a 3000-token system prompt. Calculate the cost savings from prompt caching and identify any architectural changes needed.

Without caching: 50,000 × 3,000 = 150M input tokens/day just for the system prompt. At $3/M = $450/day. With caching (90% hit rate, cached at 10% price): 5,000 requests × $3/M × 3K = $45 (cold) + 45,000 × $0.30/M × 3K = $40.50 (cached) = $85.50/day. Saving: ~$365/day or $133K/year. Architectural changes: ensure the system prompt is positioned first in every request (before any dynamic content), use consistent prompt templates without per-request variations, and implement cache warmup on deployment to avoid cold-start costs during traffic spikes.

Q: You are serving 100 different customers, each with slightly different system prompts (company name, product details, tone preferences). How do you maximize cache hit rates?

Restructure prompts into shared and customer-specific sections: [Large shared system prompt with common instructions][Small customer-specific section][User query]. The large shared section (90% of tokens) gets cached across all customers. Alternatively, templatize: use a single system prompt with slot variables filled from a config, and move customer-specific context to a separate message after the system prompt. If the providers support it, use explicit cache breakpoints to cache the shared portion independently. Evaluate whether the per-customer differences truly need to be in the system prompt or can be in the user message.

Q: Compare prompt caching to fine-tuning as cost optimization strategies. When would you choose each?

Prompt caching: reduces per-request cost of processing long static prompts. Best when you have a large system prompt or many few-shot examples that repeat across requests. Maintains full flexibility - change the prompt instantly. Fine-tuning: encodes knowledge and behavior into model weights, eliminating the need for long prompts entirely. Best when you have stable, well-understood behavior that rarely changes and high volume (thousands of daily requests). Fine-tuning reduces both input tokens (shorter prompts needed) and output tokens (model learns concise response patterns). Choose caching when you need flexibility and fast iteration. Choose fine-tuning when behavior is stable and you need maximum cost/latency reduction at scale. Many production systems use both: fine-tuned model + cached residual system prompt.