Latency Optimization for LLMs: Making AI Applications Feel Fast

Your AI chatbot takes 4.5 seconds to respond. Users on mobile start typing their next message, assuming the bot is broken. You lose 30% of users who abandon before seeing the response. You need the experience to feel instant - first token in under 500ms, full response streaming smoothly. But the model itself takes 800ms just to start generating after a 3000-token prompt.

LLM latency is not one number. It decomposes into: retrieval time + prompt assembly + prefill (processing input tokens) + time-to-first-token + generation (output tokens × time-per-token). Each component can be optimized independently.

Latency breakdown

graph LR
  subgraph breakdown["Request Latency Components"]
      R["Retrieval
50-200ms"]
      PA["Prompt Assembly
10-50ms"]
      PF["Prefill
200-2000ms
(scales with input length)"]
      TTFT["Time to First Token
= R + PA + PF"]
      GEN["Generation
20-50ms per token
× output length"]
  end

  R --> PA --> PF --> TTFT
  TTFT --> GEN

  style R fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style PF fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
  style GEN fill:#FAEEDA,stroke:#854F0B,color:#633806

Optimization techniques

1. Reduce input tokens (biggest impact on TTFT)

Prefill time scales linearly with input token count. Cutting your prompt from 5000 to 2000 tokens can halve TTFT:

Compress system prompts (remove redundancy)
Retrieve fewer, more relevant chunks (3 instead of 10)
Summarize conversation history instead of including verbatim
Use prompt caching for static prefix portions

2. Use streaming (biggest impact on perceived latency)

Streaming does not reduce actual generation time but makes the first token visible in ~200ms instead of waiting for the full response:

# User sees first word almost immediately
stream = await client.chat.completions.create(messages=msgs, stream=True)

3. Choose the right model size

Smaller models are faster:

GPT-4o: ~20ms/token generation
GPT-4o-mini: ~10ms/token generation
Claude Haiku: ~8ms/token generation
Local 7B model on GPU: ~5ms/token generation

Route simple queries to smaller, faster models.

4. Prompt caching

Reuse computed KV cache for repeated prompt prefixes. Reduces TTFT by 60-80% for subsequent requests:

First request: 800ms TTFT (cold, full computation)
Cached requests: 150ms TTFT (cache hit, only new tokens computed)

5. Speculative decoding

Use a small “draft” model to generate candidate tokens, then verify with the large model in parallel. Can achieve 2-3x speedup for generation:

Draft model generates 5 tokens quickly
Large model verifies all 5 in one forward pass
If correct: accept all 5 (5 tokens at cost of 1 step)
If wrong: accept up to the first wrong token, discard rest

6. Parallel retrieval

Do not retrieve sequentially. Run vector search, BM25, and metadata queries in parallel:

vector_results, keyword_results, metadata = await asyncio.gather(
    vector_search(query),
    bm25_search(query),
    fetch_user_metadata(user_id)
)

7. Pre-computation

For predictable queries (common questions, scheduled reports), generate responses ahead of time and cache them.

8. Reduce output tokens

Set appropriate max_tokens per task type. Classification: 5 tokens. Short answer: 100. Full explanation: 500. Never leave it at the default (which can generate thousands of unnecessary tokens).

Real-world latency targets

Application	TTFT target	Total target
Chat/conversational	<500ms	Streaming (continuous)
Autocomplete/suggestions	<200ms	<500ms total
Search/RAG	<1s	<3s total
Code generation	<1s	<5s total
Batch processing	N/A	Throughput matters more

How to apply in practice

Measure per-component. Instrument retrieval, prompt assembly, TTFT, and generation separately. Optimize the biggest bottleneck first.

Prompt caching is often the single biggest win. If your system prompt is >1000 tokens and consistent, enabling prompt caching can cut TTFT by 50%+ with zero code changes.

Always stream for user-facing applications. The perception of speed matters more than actual speed. Streaming turns a 4-second wait into a 200ms wait + reading time.

Monitor p99, not average. Average latency might be 1.2s but p99 is 8s. Those 1% of users with 8-second waits are your most frustrated users. Optimize tails.

FAQ

Q: What is more important - TTFT or total generation time?

TTFT for interactive applications (users need to see something happening). Total time for batch/automated workflows. For chat, TTFT under 500ms with streaming makes even a 5-second total generation feel responsive.

Q: Can I make the model generate faster tokens?

You cannot change the model’s per-token speed (that is hardware-dependent). You can: use a faster/smaller model, reduce output length, use speculative decoding, or pre-generate common responses. The tokens/second rate is fixed for a given model on given hardware.

Interview questions

Q: Your AI chatbot has 3.2s average TTFT. Users are complaining about slowness. Decompose the latency and propose optimizations to get under 800ms.

Decomposition: measure each component. Likely breakdown: RAG retrieval (400ms) + prompt assembly (50ms) + network to API (100ms) + prefill of 4000 tokens (2200ms) + network back (50ms) = ~2800ms + overhead ≈ 3.2s. Optimizations by impact: (1) Prompt caching: prefix is 2500 tokens static → cache reduces prefill to ~800ms. New TTFT: ~1.4s. (2) Reduce prompt: trim from 4000 to 2000 tokens by retrieving 3 chunks instead of 8 and compressing system prompt. Prefill drops further. (3) Parallel retrieval: run vector search and metadata fetch concurrently. Saves 100-200ms. (4) Use a faster model for simple queries (route 60% to mini). Combined result: <800ms TTFT for cached requests, <1.2s for uncached.