Latency Optimization for LLMs: Making AI Applications Feel Fast


Your AI chatbot takes 4.5 seconds to respond. Users on mobile start typing their next message, assuming the bot is broken. You lose 30% of users who abandon before seeing the response. You need the experience to feel instant - first token in under 500ms, full response streaming smoothly. But the model itself takes 800ms just to start generating after a 3000-token prompt.

LLM latency is not one number. It decomposes into: retrieval time + prompt assembly + prefill (processing input tokens) + time-to-first-token + generation (output tokens × time-per-token). Each component can be optimized independently.

Latency breakdown

graph LR
  subgraph breakdown["Request Latency Components"]
      R["Retrieval
50-200ms"]
      PA["Prompt Assembly
10-50ms"]
      PF["Prefill
200-2000ms
(scales with input length)"]
      TTFT["Time to First Token
= R + PA + PF"]
      GEN["Generation
20-50ms per token
× output length"]
  end

  R --> PA --> PF --> TTFT
  TTFT --> GEN

  style R fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style PF fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
  style GEN fill:#FAEEDA,stroke:#854F0B,color:#633806

Optimization techniques

1. Reduce input tokens (biggest impact on TTFT)

Prefill time scales linearly with input token count. Cutting your prompt from 5000 to 2000 tokens can halve TTFT:

  • Compress system prompts (remove redundancy)
  • Retrieve fewer, more relevant chunks (3 instead of 10)
  • Summarize conversation history instead of including verbatim
  • Use prompt caching for static prefix portions

2. Use streaming (biggest impact on perceived latency)

Streaming does not reduce actual generation time but makes the first token visible in ~200ms instead of waiting for the full response:

# User sees first word almost immediately
stream = await client.chat.completions.create(messages=msgs, stream=True)

3. Choose the right model size

Smaller models are faster:

  • GPT-4o: ~20ms/token generation
  • GPT-4o-mini: ~10ms/token generation
  • Claude Haiku: ~8ms/token generation
  • Local 7B model on GPU: ~5ms/token generation

Route simple queries to smaller, faster models.

4. Prompt caching

Reuse computed KV cache for repeated prompt prefixes. Reduces TTFT by 60-80% for subsequent requests:

First request: 800ms TTFT (cold, full computation)
Cached requests: 150ms TTFT (cache hit, only new tokens computed)

5. Speculative decoding

Use a small “draft” model to generate candidate tokens, then verify with the large model in parallel. Can achieve 2-3x speedup for generation:

Draft model generates 5 tokens quickly
Large model verifies all 5 in one forward pass
If correct: accept all 5 (5 tokens at cost of 1 step)
If wrong: accept up to the first wrong token, discard rest

6. Parallel retrieval

Do not retrieve sequentially. Run vector search, BM25, and metadata queries in parallel:

vector_results, keyword_results, metadata = await asyncio.gather(
    vector_search(query),
    bm25_search(query),
    fetch_user_metadata(user_id)
)

7. Pre-computation

For predictable queries (common questions, scheduled reports), generate responses ahead of time and cache them.

8. Reduce output tokens

Set appropriate max_tokens per task type. Classification: 5 tokens. Short answer: 100. Full explanation: 500. Never leave it at the default (which can generate thousands of unnecessary tokens).

Real-world latency targets

ApplicationTTFT targetTotal target
Chat/conversational<500msStreaming (continuous)
Autocomplete/suggestions<200ms<500ms total
Search/RAG<1s<3s total
Code generation<1s<5s total
Batch processingN/AThroughput matters more

How to apply in practice

Measure per-component. Instrument retrieval, prompt assembly, TTFT, and generation separately. Optimize the biggest bottleneck first.

Prompt caching is often the single biggest win. If your system prompt is >1000 tokens and consistent, enabling prompt caching can cut TTFT by 50%+ with zero code changes.

Always stream for user-facing applications. The perception of speed matters more than actual speed. Streaming turns a 4-second wait into a 200ms wait + reading time.

Monitor p99, not average. Average latency might be 1.2s but p99 is 8s. Those 1% of users with 8-second waits are your most frustrated users. Optimize tails.

FAQ

Q: What is more important - TTFT or total generation time?

TTFT for interactive applications (users need to see something happening). Total time for batch/automated workflows. For chat, TTFT under 500ms with streaming makes even a 5-second total generation feel responsive.

Q: Can I make the model generate faster tokens?

You cannot change the model’s per-token speed (that is hardware-dependent). You can: use a faster/smaller model, reduce output length, use speculative decoding, or pre-generate common responses. The tokens/second rate is fixed for a given model on given hardware.

Interview questions

Q: Your AI chatbot has 3.2s average TTFT. Users are complaining about slowness. Decompose the latency and propose optimizations to get under 800ms.

Decomposition: measure each component. Likely breakdown: RAG retrieval (400ms) + prompt assembly (50ms) + network to API (100ms) + prefill of 4000 tokens (2200ms) + network back (50ms) = ~2800ms + overhead ≈ 3.2s. Optimizations by impact: (1) Prompt caching: prefix is 2500 tokens static → cache reduces prefill to ~800ms. New TTFT: ~1.4s. (2) Reduce prompt: trim from 4000 to 2000 tokens by retrieving 3 chunks instead of 8 and compressing system prompt. Prefill drops further. (3) Parallel retrieval: run vector search and metadata fetch concurrently. Saves 100-200ms. (4) Use a faster model for simple queries (route 60% to mini). Combined result: <800ms TTFT for cached requests, <1.2s for uncached.