Latency Optimization for LLMs: Making AI Applications Feel Fast
Your AI chatbot takes 4.5 seconds to respond. Users on mobile start typing their next message, assuming the bot is broken. You lose 30% of users who abandon before seeing the response. You need the experience to feel instant - first token in under 500ms, full response streaming smoothly. But the model itself takes 800ms just to start generating after a 3000-token prompt.
LLM latency is not one number. It decomposes into: retrieval time + prompt assembly + prefill (processing input tokens) + time-to-first-token + generation (output tokens × time-per-token). Each component can be optimized independently.
Latency breakdown
graph LR
subgraph breakdown["Request Latency Components"]
R["Retrieval
50-200ms"]
PA["Prompt Assembly
10-50ms"]
PF["Prefill
200-2000ms
(scales with input length)"]
TTFT["Time to First Token
= R + PA + PF"]
GEN["Generation
20-50ms per token
× output length"]
end
R --> PA --> PF --> TTFT
TTFT --> GEN
style R fill:#E1F5EE,stroke:#0F6E56,color:#085041
style PF fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
style GEN fill:#FAEEDA,stroke:#854F0B,color:#633806
Optimization techniques
1. Reduce input tokens (biggest impact on TTFT)
Prefill time scales linearly with input token count. Cutting your prompt from 5000 to 2000 tokens can halve TTFT:
- Compress system prompts (remove redundancy)
- Retrieve fewer, more relevant chunks (3 instead of 10)
- Summarize conversation history instead of including verbatim
- Use prompt caching for static prefix portions
2. Use streaming (biggest impact on perceived latency)
Streaming does not reduce actual generation time but makes the first token visible in ~200ms instead of waiting for the full response:
# User sees first word almost immediately
stream = await client.chat.completions.create(messages=msgs, stream=True)
3. Choose the right model size
Smaller models are faster:
- GPT-4o: ~20ms/token generation
- GPT-4o-mini: ~10ms/token generation
- Claude Haiku: ~8ms/token generation
- Local 7B model on GPU: ~5ms/token generation
Route simple queries to smaller, faster models.
4. Prompt caching
Reuse computed KV cache for repeated prompt prefixes. Reduces TTFT by 60-80% for subsequent requests:
First request: 800ms TTFT (cold, full computation)
Cached requests: 150ms TTFT (cache hit, only new tokens computed)
5. Speculative decoding
Use a small “draft” model to generate candidate tokens, then verify with the large model in parallel. Can achieve 2-3x speedup for generation:
Draft model generates 5 tokens quickly
Large model verifies all 5 in one forward pass
If correct: accept all 5 (5 tokens at cost of 1 step)
If wrong: accept up to the first wrong token, discard rest
6. Parallel retrieval
Do not retrieve sequentially. Run vector search, BM25, and metadata queries in parallel:
vector_results, keyword_results, metadata = await asyncio.gather(
vector_search(query),
bm25_search(query),
fetch_user_metadata(user_id)
)
7. Pre-computation
For predictable queries (common questions, scheduled reports), generate responses ahead of time and cache them.
8. Reduce output tokens
Set appropriate max_tokens per task type. Classification: 5 tokens. Short answer: 100. Full explanation: 500. Never leave it at the default (which can generate thousands of unnecessary tokens).
Real-world latency targets
| Application | TTFT target | Total target |
|---|---|---|
| Chat/conversational | <500ms | Streaming (continuous) |
| Autocomplete/suggestions | <200ms | <500ms total |
| Search/RAG | <1s | <3s total |
| Code generation | <1s | <5s total |
| Batch processing | N/A | Throughput matters more |
How to apply in practice
Measure per-component. Instrument retrieval, prompt assembly, TTFT, and generation separately. Optimize the biggest bottleneck first.
Prompt caching is often the single biggest win. If your system prompt is >1000 tokens and consistent, enabling prompt caching can cut TTFT by 50%+ with zero code changes.
Always stream for user-facing applications. The perception of speed matters more than actual speed. Streaming turns a 4-second wait into a 200ms wait + reading time.
Monitor p99, not average. Average latency might be 1.2s but p99 is 8s. Those 1% of users with 8-second waits are your most frustrated users. Optimize tails.
FAQ
Q: What is more important - TTFT or total generation time?
TTFT for interactive applications (users need to see something happening). Total time for batch/automated workflows. For chat, TTFT under 500ms with streaming makes even a 5-second total generation feel responsive.
Q: Can I make the model generate faster tokens?
You cannot change the model’s per-token speed (that is hardware-dependent). You can: use a faster/smaller model, reduce output length, use speculative decoding, or pre-generate common responses. The tokens/second rate is fixed for a given model on given hardware.
Interview questions
Q: Your AI chatbot has 3.2s average TTFT. Users are complaining about slowness. Decompose the latency and propose optimizations to get under 800ms.
Decomposition: measure each component. Likely breakdown: RAG retrieval (400ms) + prompt assembly (50ms) + network to API (100ms) + prefill of 4000 tokens (2200ms) + network back (50ms) = ~2800ms + overhead ≈ 3.2s. Optimizations by impact: (1) Prompt caching: prefix is 2500 tokens static → cache reduces prefill to ~800ms. New TTFT: ~1.4s. (2) Reduce prompt: trim from 4000 to 2000 tokens by retrieving 3 chunks instead of 8 and compressing system prompt. Prefill drops further. (3) Parallel retrieval: run vector search and metadata fetch concurrently. Saves 100-200ms. (4) Use a faster model for simple queries (route 60% to mini). Combined result: <800ms TTFT for cached requests, <1.2s for uncached.