Load Balancing: Distributing Traffic Without Dropping Requests

Black Friday. Your e-commerce site is handling 50,000 requests per second. You have 20 application servers. The load balancer is routing all traffic to servers 1 through 5 because they responded fastest during the last health check. Servers 6 through 20 are sitting at 10% CPU. Servers 1 through 5 are at 100% and starting to time out. Requests are failing.

You have a load balancer. It is not load balancing. This is what happens when you pick the wrong algorithm for your workload.

What a load balancer actually does

A load balancer sits between clients and your backend servers. It accepts incoming connections, decides which backend server should handle each request, and forwards the request there. From the client’s perspective, there is one server. From the backend’s perspective, each server handles a fraction of the total traffic.

Beyond distribution, load balancers also handle:

Health checking - Removing unhealthy backends from rotation
SSL/TLS termination - Decrypting HTTPS so backends handle plain HTTP
Connection pooling - Reusing backend connections across client requests
Request routing - Sending different paths to different backend pools
Rate limiting - Rejecting requests above a threshold

Layer 4 vs Layer 7

Load balancers operate at different layers of the network stack.

Layer 4 (transport layer) - Routes based on IP address and TCP/UDP port. Does not inspect the request content. Fast and low overhead. Cannot make routing decisions based on URL path, headers, or cookies. Examples: AWS NLB, HAProxy in TCP mode.

Layer 7 (application layer) - Inspects the full HTTP request. Can route based on URL path (/api to one pool, /static to another), headers, cookies, or request body. Supports SSL termination, HTTP/2, WebSocket upgrades. Higher overhead than L4 but far more flexible. Examples: AWS ALB, nginx, Envoy.

graph TB
subgraph l4["Layer 4 Load Balancer"]
  C1["Client"] -->|"TCP connection
IP:port only"| L4["L4 LB"]
  L4 -->|"forward TCP stream"| B1["Backend 1"]
  L4 -->|"forward TCP stream"| B2["Backend 2"]
end

subgraph l7["Layer 7 Load Balancer"]
  C2["Client"] -->|"HTTPS request"| L7["L7 LB
Terminates TLS
Reads HTTP headers"]
  L7 -->|"/api requests"| API["API servers"]
  L7 -->|"/static requests"| CDN["Static servers"]
  L7 -->|"/ws WebSocket"| WS["WebSocket servers"]
end

style L4 fill:#F1EFE8,stroke:#888780,color:#444441
style L7 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style API fill:#E1F5EE,stroke:#0F6E56,color:#085041
style CDN fill:#E1F5EE,stroke:#0F6E56,color:#085041
style WS fill:#E1F5EE,stroke:#0F6E56,color:#085041

Load balancing algorithms

Round robin

Requests go to each server in turn: 1, 2, 3, 1, 2, 3. Simple and even distribution assuming all requests take the same time and all servers have equal capacity.

Problem: If some requests are much slower than others (a database query vs a static file), some servers accumulate slow requests while others are idle. The server that got three slow requests is overwhelmed while the server that got three fast requests is waiting.

Weighted round robin

Like round robin, but servers get different weights. A server with weight 3 gets 3x as many requests as a server with weight 1. Useful when servers have different capacities (different instance sizes).

Least connections

Route each new request to the server with the fewest active connections. This naturally handles variable request duration - slow requests keep a connection open, so that server gets fewer new requests until it catches up.

Best for: Long-lived connections, variable request duration, WebSockets.

Least response time

Route to the server with the lowest combination of active connections and response time. More sophisticated than least connections - a server with 10 connections but 1ms response time is better than a server with 5 connections but 500ms response time.

IP hash / consistent hashing

Hash the client’s IP address to determine which server handles the request. The same client always goes to the same server. Provides session affinity without cookies.

Problem: If a server goes down, all clients hashed to that server get rerouted. If traffic is unevenly distributed by IP (corporate NAT sends thousands of users from one IP), one server gets overloaded.

Random with two choices (power of two choices)

Pick two servers at random, route to the one with fewer connections. Surprisingly effective - achieves near-optimal load distribution with minimal overhead. Used by Nginx, HAProxy, and many modern load balancers.

graph LR
subgraph algorithms["Algorithm Comparison"]
  RR["Round Robin
Simple, ignores load"]
  LC["Least Connections
Handles variable duration"]
  LR["Least Response Time
Best for heterogeneous load"]
  IH["IP Hash
Session affinity"]
  P2["Power of Two
Fast, near-optimal"]
end

subgraph use["Best Used For"]
  U1["Equal requests, equal servers"]
  U2["Long-lived or variable requests"]
  U3["Mixed fast and slow backends"]
  U4["Stateful apps without shared session store"]
  U5["General purpose, high throughput"]
end

RR --- U1
LC --- U2
LR --- U3
IH --- U4
P2 --- U5

style RR fill:#F1EFE8,stroke:#888780,color:#444441
style LC fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style LR fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style IH fill:#FAEEDA,stroke:#854F0B,color:#633806
style P2 fill:#E1F5EE,stroke:#0F6E56,color:#085041

Health checks

A load balancer is only useful if it routes to healthy backends. Health checks are how it knows which backends are healthy.

Passive health checks - The load balancer monitors responses. If a backend returns 5xx errors or times out, it is marked unhealthy. Simple but reactive - you only detect failure after it has already affected real requests.

Active health checks - The load balancer periodically sends requests to a health endpoint (GET /health). If the response is not 200 OK within a timeout, the backend is marked unhealthy. Proactive - catches failures before they affect real traffic.

A good health endpoint checks actual dependencies: can the server connect to the database? Is the cache reachable? A health endpoint that just returns 200 OK unconditionally is useless.

# nginx example
upstream backend {
  server 10.0.0.1:8080;
  server 10.0.0.2:8080;
  server 10.0.0.3:8080;
}

# HAProxy example
backend web_servers
  balance roundrobin
  option httpchk GET /health
  http-check expect status 200
  server web1 10.0.0.1:8080 check inter 5s fall 3 rise 2
  server web2 10.0.0.2:8080 check inter 5s fall 3 rise 2

fall 3 means mark unhealthy after 3 consecutive failures. rise 2 means mark healthy again after 2 consecutive successes. This prevents flapping.

Where it breaks or gets interesting

The thundering herd on backend recovery

A backend goes down. The load balancer removes it from rotation. The other backends absorb its traffic. The failed backend recovers and is added back. The load balancer immediately sends it a full share of traffic. The backend, which just recovered, gets slammed and goes down again.

Fix: slow start. When a backend is added back, gradually ramp up its traffic share over 30-60 seconds. nginx and HAProxy both support this.

Connection draining

When you deploy a new version, you need to take backends out of rotation. If you remove them immediately, in-flight requests are dropped. Connection draining (or graceful shutdown) tells the load balancer to stop sending new requests to a backend but let existing connections finish. Typically 30-60 seconds.

Sticky sessions and their problems

Some applications store session state in memory. For these, you need the same client to always hit the same backend (sticky sessions). The load balancer uses a cookie to track which backend a client is assigned to.

Problems: if that backend goes down, the session is lost. Sticky sessions prevent even load distribution - a client doing heavy work keeps hitting the same server. The right fix is to make the application stateless, not to use sticky sessions.

The load balancer as a single point of failure

Your load balancer is now the single point of failure for your entire system. Fix: run two load balancers in active-passive or active-active configuration. Use a floating IP (virtual IP) that moves between them on failover. Cloud load balancers (AWS ALB, GCP Load Balancing) handle this for you - they are distributed by design.

Real-world systems

AWS Application Load Balancer (ALB) - L7, supports path-based routing, WebSockets, HTTP/2, gRPC. Integrates with Auto Scaling Groups. Health checks built in. The default choice for HTTP APIs on AWS.

AWS Network Load Balancer (NLB) - L4, extremely high throughput (millions of requests per second), ultra-low latency. Used for non-HTTP protocols or when ALB overhead is too high.

nginx - Widely used as both a reverse proxy and load balancer. Supports round robin, least connections, IP hash, and consistent hashing. Highly configurable.

HAProxy - Purpose-built load balancer. Excellent observability (stats page, metrics). Used by GitHub, Reddit, and many high-traffic sites.

Envoy - Modern L7 proxy used as the data plane in service meshes (Istio). Supports advanced features: circuit breaking, retries, distributed tracing, gRPC load balancing.

Cloudflare Load Balancing - Global load balancing with health checks across regions. Routes users to the nearest healthy origin. Handles DNS-based failover.

How to apply it in practice

Choosing an algorithm

Stateless API with uniform request size: round robin or power of two choices
Long-running requests or WebSockets: least connections
Mixed workloads with variable response times: least response time
Need session affinity without shared session store: IP hash (but fix the root cause)
Need session affinity without shared session store: consistent hashing by user ID at the application layer

Multi-tier load balancing

Large systems use multiple layers of load balancing:

DNS load balancing - Route users to the nearest region (GeoDNS)
Global load balancer - Route between regions based on health and latency
Regional load balancer - Distribute within a region across availability zones
Service mesh - Load balance between microservices within a cluster

FAQ

Q: Should I use a hardware load balancer or a software load balancer?

For most teams, software load balancers (nginx, HAProxy, cloud-managed) are the right choice. They are cheaper, more flexible, and easier to configure as code. Hardware load balancers (F5, Citrix ADC) are used in enterprises with specific compliance requirements or when you need to handle millions of connections per second with dedicated hardware. Cloud-managed load balancers (AWS ALB/NLB) are the easiest option - no infrastructure to manage.

Q: How does a load balancer handle WebSocket connections?

WebSockets start as an HTTP upgrade request. An L7 load balancer detects the Upgrade: websocket header and keeps the TCP connection open, forwarding all subsequent frames to the same backend. The connection is long-lived - it stays open until the client or server closes it. This means WebSocket connections are not distributed per-request but per-connection. A backend handling many WebSocket connections will accumulate them over time. Least connections is the right algorithm here.

Q: What is the difference between a load balancer and an API gateway?

A load balancer distributes traffic across backends. An API gateway does that plus: authentication, rate limiting, request transformation, response caching, API versioning, and developer portal features. An API gateway is a load balancer with application-level features. In practice, many systems use both: an API gateway at the edge for auth and rate limiting, and a load balancer internally for service-to-service traffic.

Interview questions

Q1: You have 10 backend servers behind a load balancer. One server is consistently slower than the others. What happens with round robin and how do you fix it?

Strong answer: With round robin, the slow server gets the same number of requests as the fast ones. But because it processes them slower, it accumulates a backlog. Its response times increase further. Clients waiting for responses from that server experience high latency. The fix depends on the cause. If the server is underpowered, either replace it with a same-spec server or use weighted round robin to send it fewer requests. If the slowness is intermittent (GC pauses, noisy neighbor), switch to least connections or least response time - these algorithms naturally send fewer requests to slow servers. Also investigate why one server is slower: is it a hardware issue, a hot partition in the data it is serving, or a software bug?

Q2: Design the load balancing strategy for a chat application with 1 million concurrent WebSocket connections.

Strong answer: WebSocket connections are long-lived, so you need least connections to distribute them evenly. Each backend server can handle maybe 50,000-100,000 concurrent WebSocket connections (limited by file descriptors and memory). So you need 10-20 backend servers. Use an L7 load balancer that supports WebSocket (nginx, HAProxy, AWS ALB). The key challenge is that messages between users on different servers need to be routed correctly - use a pub/sub system (Redis pub/sub or Kafka) so any server can publish a message and the server holding the recipient’s connection receives it. For horizontal scaling, partition users across server groups using consistent hashing by user ID, so users in the same conversation tend to land on the same server, reducing cross-server message routing.

Q3: A load balancer health check passes but users are still getting errors. How is this possible?

Strong answer: Several scenarios. The health check endpoint is too shallow - it returns 200 OK but does not actually verify that the application can serve real requests (no database check, no dependency check). The health check passes but the application is in a degraded state that only affects certain request types. There is a race condition: the health check passes, the backend is added to rotation, but it takes a few seconds to fully warm up (JVM JIT compilation, connection pool initialization) and the first real requests fail. The load balancer’s health check interval is too long - the backend failed between checks and is serving errors until the next check detects the failure. Fix: make health checks deep (check all critical dependencies), use shorter check intervals for faster detection, and implement slow start to avoid hammering newly added backends.