Latency vs Throughput: The Two Clocks Every Engineer Confuses
Your API feels slow. You profile it, find a database query taking 80ms, add an index, and it drops to 5ms. You deploy. Users still complain. You check your dashboards and see p99 latency is 2 seconds. The index helped individual queries, but the system is still slow under load.
What you optimized was the latency of a single request in isolation. What users experience is latency under concurrency - which is a completely different problem. Understanding the difference between latency and throughput, and how they interact, is one of the most practically useful things you can internalize as an engineer.
What they actually mean
Latency is the time it takes to complete a single operation. The duration from when a request is sent to when the response is received. It is measured per request: p50, p95, p99, p999.
Throughput is the number of operations completed per unit of time. Requests per second, transactions per second, messages per second. It is a rate, not a duration.
They sound like they should move together - faster requests should mean more requests per second. But they don’t, and understanding why is the key.
How they interact - Little’s Law
The relationship between latency, throughput, and concurrency is captured by Little’s Law:
L = λ × W
Where:
- L = average number of requests in the system (concurrency)
- λ = average arrival rate (throughput)
- W = average time a request spends in the system (latency)
Rearranged: Throughput = Concurrency / Latency
If your system handles 100 concurrent requests and each takes 100ms, your throughput is 1000 req/s. If latency doubles to 200ms with the same concurrency, throughput halves to 500 req/s.
This is why adding more servers (increasing concurrency) can increase throughput even if individual request latency stays the same.
graph LR subgraph littles["Little's Law: L = λ × W"] L["L = Concurrency<br/>(requests in flight)"] LA["λ = Throughput<br/>(req/sec)"] W["W = Latency<br/>(sec/request)"] L -->|"L / W"| LA LA -->|"λ × W"| L end subgraph example["Example"] E1["100 concurrent requests<br/>x 100ms latency<br/>= 1000 req/s throughput"] E2["100 concurrent requests<br/>x 200ms latency<br/>= 500 req/s throughput"] end style L fill:#EEEDFE,stroke:#534AB7,color:#3C3489 style LA fill:#E1F5EE,stroke:#0F6E56,color:#085041 style W fill:#FAEEDA,stroke:#854F0B,color:#633806 style E1 fill:#E1F5EE,stroke:#0F6E56,color:#085041 style E2 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
The latency-throughput curve
As you push more load into a system, something interesting happens. Up to a point, throughput increases linearly with load. Then it starts to flatten. Then, past a saturation point, throughput actually decreases while latency explodes.
This is the USL (Universal Scalability Law) in action. The culprit is queueing. When requests arrive faster than they can be processed, they queue. Queued requests add to latency. More latency means more requests in flight at any moment. More requests in flight means more contention for shared resources (locks, connections, CPU). More contention means even higher latency. It is a feedback loop.
The practical implication: a system running at 70% utilization has very different latency characteristics than one at 90% utilization, even though the throughput difference is only 20%.
graph TB subgraph curve["Throughput vs Load"] L1["Low load<br/>Latency: 10ms<br/>Throughput: 500 req/s"] L2["Medium load<br/>Latency: 15ms<br/>Throughput: 900 req/s"] L3["High load 70pct<br/>Latency: 30ms<br/>Throughput: 1100 req/s"] L4["Saturation 90pct<br/>Latency: 200ms<br/>Throughput: 1050 req/s"] L5["Overload<br/>Latency: 2000ms<br/>Throughput: 800 req/s"] L1 --> L2 --> L3 --> L4 --> L5 end style L1 fill:#E1F5EE,stroke:#0F6E56,color:#085041 style L2 fill:#E1F5EE,stroke:#0F6E56,color:#085041 style L3 fill:#FAEEDA,stroke:#854F0B,color:#633806 style L4 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F style L5 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
Where it breaks or gets interesting
Optimizing latency can hurt throughput
Batching is a classic example. Writing 1000 records individually to a database: each write is fast (low latency), but you pay the round-trip overhead 1000 times. Batching them into one write: higher latency for the batch operation, but dramatically higher throughput because you amortize the overhead.
Kafka producers use this explicitly. The linger.ms setting tells the producer to wait up to N milliseconds before sending a batch, deliberately adding latency to individual messages in exchange for higher throughput.
Optimizing throughput can hurt latency
Connection pooling increases throughput by reusing connections. But if the pool is exhausted, new requests queue waiting for a connection. The p99 latency spikes even though average throughput is high.
Tail latency is the real enemy
Average latency is almost meaningless for user experience. If your p50 is 10ms but your p99 is 2 seconds, 1 in 100 users has a terrible experience. At scale, “1 in 100” is a lot of users.
Tail latency is often caused by:
- Garbage collection pauses (JVM, Go)
- Lock contention under high concurrency
- Slow disk I/O on a shared host
- Network retransmits
The “hedged request” pattern addresses this: send the same request to two replicas simultaneously, use whichever responds first, cancel the other. Google uses this in Bigtable. It trades throughput (double the requests) for dramatically lower tail latency.
The database connection pool trap
A common mistake: you have a web server with 100 threads, each needing a database connection. You set the pool size to 100. Under load, all 100 threads are waiting for database responses. New requests queue. Latency spikes.
The fix is counterintuitive: reduce the pool size. With fewer connections, the database is less contended. Each query runs faster. Throughput goes up even though you have fewer connections. HikariCP’s documentation recommends (core_count * 2) + effective_spindle_count as a starting point - often much smaller than people expect.
Real-world systems and their tradeoffs
Latency-optimized:
- Redis - Single-threaded, in-memory. Sub-millisecond reads. Throughput is limited by single-thread CPU, but latency is extremely predictable.
- Memcached - Multi-threaded in-memory cache. Optimized for low-latency reads at the cost of no persistence.
- CDNs - Serve content from edge nodes close to users. Latency drops from 100ms to 5ms. Throughput is handled by horizontal scaling.
Throughput-optimized:
- Kafka - Sequential disk writes, batching, compression. Millions of messages per second. Individual message latency is higher than a direct write.
- ClickHouse - Columnar storage, vectorized execution. Scans billions of rows per second. Not designed for low-latency point lookups.
- Hadoop MapReduce - Batch processing. High throughput for large datasets. Latency is minutes to hours.
Balanced:
- PostgreSQL - Tunable. OLTP workloads optimize for latency. Analytical queries optimize for throughput via parallel query execution.
- Cassandra - Tunable consistency lets you trade latency for durability. Write throughput is very high; read latency depends on consistency level.
How to apply it in practice
Measure the right thing
For user-facing APIs: measure p50, p95, p99, p999. Set SLOs on p99. Average latency hides the tail.
For batch jobs: measure throughput (records/second) and total job duration.
For message processing: measure both - consumer lag (throughput indicator) and processing time per message (latency indicator).
The four optimization levers
- Reduce work per request - Better algorithms, caching, avoiding N+1 queries. Directly reduces latency.
- Increase parallelism - More threads, more servers, async I/O. Increases throughput without changing per-request latency.
- Reduce contention - Smaller transactions, optimistic locking, sharding. Reduces tail latency under load.
- Batch work - Group small operations. Increases throughput, increases individual operation latency.
Load testing correctly
Don’t test with a single client sending requests sequentially. That measures latency but not throughput. Use a tool like wrk, k6, or Gatling that sends concurrent requests. Ramp up concurrency until you find the saturation point. That is your system’s actual capacity.
FAQ
Q: My p50 latency is 5ms but p99 is 500ms. Where do I start?
The gap between p50 and p99 almost always points to contention or queueing, not algorithmic slowness. Start by looking at lock wait times in your database, thread pool queue depths, and GC pause logs. A 100x difference between median and tail is a classic sign that some requests are waiting for a shared resource that is occasionally saturated.
Q: We need both low latency and high throughput. Is that possible?
Yes, but it requires careful design. The key insight is that latency and throughput are in tension only when a shared resource is the bottleneck. If you can eliminate shared bottlenecks - through sharding, caching, or async processing - you can achieve both. Disruptor (LMAX’s ring buffer library) achieves millions of operations per second with microsecond latency by eliminating locks entirely through careful memory layout and single-writer principle.
Q: How does async/await help with throughput?
Async I/O lets a single thread handle many concurrent requests. Instead of blocking a thread while waiting for a database response, the thread is released to handle other requests. When the database responds, the thread picks up where it left off. This dramatically increases throughput (more requests per thread) without changing the latency of any individual request. Node.js and async Python/Go use this model. The tradeoff is complexity - async code is harder to reason about and debug.
Interview questions
Q1: Your API has p50 latency of 20ms and p99 of 3 seconds. You need to get p99 under 200ms. Walk through your investigation.
Strong answer: Start with the data. Pull latency histograms broken down by endpoint, then by downstream dependency (database, cache, external APIs). The 3-second p99 is almost certainly caused by one specific slow path, not a general slowness. Common culprits: a database query that occasionally does a full table scan (check slow query logs), a connection pool that occasionally exhausts (check pool wait time metrics), a GC pause (check GC logs for stop-the-world events), or a downstream service with its own tail latency problem (check per-dependency latency). Once you find the culprit, the fix is specific to the cause - add an index, increase pool size, tune GC, add a timeout and fallback.
Q2: You’re designing a payment processing system. It needs to handle 10,000 transactions per second with p99 latency under 100ms. How do you approach this?
Strong answer: First, establish the baseline. 10k TPS at 100ms p99 means up to 1000 transactions in flight at any moment (Little’s Law). Each transaction needs a database write, so the database must handle 10k writes/second with sub-100ms latency. A single PostgreSQL instance can handle roughly 5-10k simple writes/second on good hardware. So you’re at the edge of single-instance capacity. Options: shard by account ID (each shard handles a subset of transactions), use a write-optimized database like CockroachDB or TiDB for horizontal scaling, or use an async pattern where the API acknowledges immediately and processes asynchronously (but then you need to handle the case where processing fails after acknowledgment). For payments, the async approach requires careful idempotency design.
Q3: Explain why adding more servers sometimes doesn’t improve latency.
Strong answer: Adding servers increases throughput (more capacity to handle concurrent requests) but doesn’t reduce the latency of any individual request. If your p99 latency is 2 seconds because of a slow database query, adding more web servers doesn’t help - they all hit the same slow database. Latency is determined by the critical path of a single request: the sum of all sequential operations it must perform. To reduce latency, you must shorten that critical path - faster queries, fewer round trips, caching, or parallelizing sequential operations. More servers help when the bottleneck is CPU or I/O capacity (throughput), not when the bottleneck is the speed of individual operations (latency).