Rate Limiting Algorithms: Controlling Traffic Without Dropping the Wrong Requests


Your API is getting hammered. A single client is making 10,000 requests per second. Your servers are at 100% CPU. Legitimate users are getting timeouts. You add rate limiting: 100 requests per minute per API key. The abusive client is blocked. But now a legitimate client that sends 100 requests in the first second of each minute is also blocked for the remaining 59 seconds, even though they are within the limit.

You chose the wrong algorithm. Fixed window rate limiting has a burst problem. Token bucket would have handled this correctly.

Why rate limiting exists

Rate limiting protects your system from:

  • Abuse - Malicious clients hammering your API
  • Bugs - A client with a retry loop gone wrong
  • Overload - Legitimate traffic that exceeds your capacity
  • Cost control - Expensive operations (AI inference, email sending) that should be metered

Rate limiting is also a product feature: tiered pricing (free tier: 100 req/min, paid tier: 10,000 req/min).

The four main algorithms

Fixed window counter

Divide time into fixed windows (e.g., 1-minute windows). Count requests per window. Reject requests that exceed the limit.

Implementation: Store a counter in Redis with a key like rate:user:123:2024-01-15-14:30. Increment on each request. Expire the key after the window ends.

Problem: The boundary between windows allows a burst of 2x the limit. A client can send 100 requests at 14:30:59 (end of window 1) and 100 requests at 14:31:00 (start of window 2). In a 2-second window, they sent 200 requests - double the limit.

graph LR
subgraph fixed["Fixed Window - Burst at Boundary"]
  W1["Window 1
14:30:00 - 14:31:00
100 requests allowed"]
  W2["Window 2
14:31:00 - 14:32:00
100 requests allowed"]
  BURST["Client sends 100 at 14:30:59
and 100 at 14:31:00
= 200 requests in 2 seconds
Both windows allow it"]
  W1 --> BURST
  W2 --> BURST
end

style BURST fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
style W1 fill:#FAEEDA,stroke:#854F0B,color:#633806
style W2 fill:#FAEEDA,stroke:#854F0B,color:#633806

Sliding window log

Keep a log of timestamps for each request. On each new request, remove timestamps older than the window, count remaining timestamps, reject if over the limit.

Pros: Precise. No boundary burst problem.

Cons: Memory-intensive. Storing a timestamp per request for every client is expensive at scale.

Implementation: Redis sorted set with timestamps as scores. ZREMRANGEBYSCORE to remove old entries, ZCARD to count.

Sliding window counter

A hybrid approach. Keep counters for the current and previous windows. Estimate the count for the sliding window using a weighted average.

current_count = prev_window_count * (1 - elapsed_fraction) + current_window_count

If the previous window had 80 requests and the current window (40% elapsed) has 30 requests: current_count = 80 * 0.6 + 30 = 78

Pros: Memory-efficient (only two counters per client). More accurate than fixed window.

Cons: Approximate (not exact). The estimate can be slightly off.

Best for: Most production rate limiting. Good balance of accuracy and efficiency.

Token bucket

Each client has a bucket that holds tokens. Tokens are added at a fixed rate (e.g., 10 tokens per second). Each request consumes one token. If the bucket is empty, the request is rejected.

The bucket has a maximum capacity. Tokens accumulate up to the maximum, then stop. This allows bursting: a client that has not made requests for a while has a full bucket and can send a burst of requests.

Pros: Allows controlled bursting. Smooth rate limiting. Intuitive.

Cons: Slightly more complex to implement. Requires storing the token count and last refill time.

Implementation: Store {tokens: 95, last_refill: 1705312800} in Redis. On each request: calculate tokens added since last refill, add them (up to max), subtract 1 for the request, store back.

Leaky bucket

Requests go into a queue (the bucket). The queue drains at a fixed rate. If the queue is full, new requests are rejected.

Pros: Smooths out bursts. Output rate is constant regardless of input rate.

Cons: Adds latency (requests wait in the queue). Does not allow any bursting.

Best for: When you need a constant output rate (e.g., sending emails at exactly 100/second to avoid spam filters).

graph TB
subgraph token["Token Bucket"]
  TB1["Bucket capacity: 100 tokens"]
  TB2["Refill rate: 10 tokens/second"]
  TB3["Client idle for 5s: 50 tokens accumulated"]
  TB4["Client sends 50 requests instantly: allowed"]
  TB5["Client sends 51st request: rejected"]
  TB1 --> TB2 --> TB3 --> TB4 --> TB5
end

subgraph leaky["Leaky Bucket"]
  LB1["Queue capacity: 100 requests"]
  LB2["Drain rate: 10 requests/second"]
  LB3["Client sends 50 requests instantly"]
  LB4["All 50 queued, processed at 10/s"]
  LB5["Output: smooth 10 req/s"]
  LB1 --> LB2 --> LB3 --> LB4 --> LB5
end

style TB4 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style TB5 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
style LB5 fill:#EEEDFE,stroke:#534AB7,color:#3C3489

Where it breaks or gets interesting

Distributed rate limiting

A single Redis instance is a single point of failure. With multiple Redis nodes, you need to decide: use a centralized counter (accurate but single point of failure) or use local counters per node (fast but approximate).

The Sliding Window Counter algorithm works well with local counters: each node tracks its own count, and you accept that the global count might be slightly over the limit. For most use cases, being 10-20% over the limit occasionally is acceptable.

For strict limits, use Redis Cluster with a consistent hash to route each client’s requests to the same Redis node.

Rate limiting at the right layer

Rate limiting can happen at multiple layers:

  • CDN/edge - Block abusive IPs before they reach your servers
  • API gateway - Rate limit by API key, user, or IP
  • Application - Rate limit specific expensive operations
  • Database - Connection pool limits are a form of rate limiting

Each layer has different granularity and cost. CDN rate limiting is cheapest (no application code involved). Application-level rate limiting is most flexible.

The retry storm

A client hits the rate limit. It retries immediately. It hits the rate limit again. It retries again. This creates a retry storm that makes the rate limiting worse.

Fix: return a Retry-After header with the time until the client can retry. Implement exponential backoff with jitter on the client side. Rate limit retries separately from initial requests.

Rate limiting by different dimensions

You might want to rate limit by:

  • IP address - Protects against unauthenticated abuse
  • API key - Per-client limits for authenticated clients
  • User ID - Per-user limits for logged-in users
  • Endpoint - Different limits for different operations (expensive endpoints get lower limits)
  • Global - Total requests across all clients (protects your infrastructure)

Combine multiple dimensions: a client might be limited to 100 req/min globally, but the /search endpoint is limited to 10 req/min because it is expensive.

Real-world systems

Stripe - Token bucket rate limiting. Different limits per API key tier. Returns 429 Too Many Requests with Retry-After header.

GitHub - Fixed window rate limiting. 5,000 requests per hour for authenticated requests. 60 requests per hour for unauthenticated. Returns X-RateLimit-Remaining and X-RateLimit-Reset headers.

Twitter - Sliding window rate limiting. Different limits per endpoint. Returns rate limit headers on every response.

Cloudflare - Rate limiting at the edge. Supports fixed window, sliding window, and token bucket. Can rate limit by IP, cookie, header, or query parameter.

AWS API Gateway - Token bucket rate limiting. Configurable burst limit (token bucket capacity) and steady-state rate (token refill rate).

nginx - limit_req_zone implements leaky bucket. limit_req applies the limit. Widely used for simple rate limiting without a separate service.

How to apply it in practice

Choosing an algorithm

  • Fixed window - Simple, good enough for most cases. Be aware of the boundary burst.
  • Sliding window counter - Better accuracy than fixed window, similar memory usage. Good default.
  • Token bucket - When you want to allow controlled bursting. Good for APIs where clients occasionally need to send a burst of requests.
  • Leaky bucket - When you need a constant output rate. Good for outbound rate limiting (sending emails, calling third-party APIs).

Response headers

Always return rate limit information in response headers:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 45
X-RateLimit-Reset: 1705312860
Retry-After: 30

This lets clients implement proper backoff without guessing.

Graceful degradation

When a client is rate limited, return 429 Too Many Requests with a Retry-After header. Do not return 503 Service Unavailable (that implies your service is down). Do not silently drop requests (the client will retry immediately).

For internal services, consider returning a degraded response instead of an error: serve cached data, return a subset of results, or queue the request for later processing.

FAQ

Q: What is the difference between rate limiting and throttling?

Rate limiting rejects requests that exceed a threshold. Throttling slows down requests (adds delay) rather than rejecting them. Rate limiting is more common for APIs. Throttling is used when you want to allow all requests but control the pace (e.g., a background job that should not overwhelm a downstream service).

Q: How do you rate limit WebSocket connections?

WebSocket connections are long-lived, so per-request rate limiting does not apply directly. Rate limit at the connection level (max connections per IP or user) and at the message level (max messages per second per connection). Implement the token bucket algorithm per connection: each connection has a bucket, messages consume tokens, the bucket refills over time.

Q: Should rate limits be per user or per IP?

Both, for different purposes. Per-IP limits protect against unauthenticated abuse (before the user logs in). Per-user limits enforce fair usage for authenticated clients. Per-API-key limits are used for developer APIs where each integration has its own limit. Use per-IP as a first line of defense, per-user/API-key for authenticated rate limiting.

Interview questions

Q1: Design a rate limiter for an API that allows 100 requests per minute per user. It must work across multiple API servers.

Strong answer: Use a sliding window counter stored in Redis. Key: rate:{user_id}:{window_start}. On each request: get the current window start (floor of current time to the minute), get the previous window start, calculate the weighted count using the sliding window formula, if under the limit increment the current window counter and allow the request, otherwise return 429. Use Redis atomic operations (MULTI/EXEC or Lua script) to prevent race conditions. For high availability, use Redis Cluster. For the sliding window calculation: count = prev_count * (1 - elapsed_seconds/60) + current_count. This is approximate but accurate enough for rate limiting. Return X-RateLimit-Remaining and X-RateLimit-Reset headers on every response.

Q2: Your rate limiter is causing legitimate users to be blocked. Investigation shows they send 100 requests in the first second of each minute. How do you fix this?

Strong answer: Switch from fixed window to token bucket. Token bucket allows bursting: a user who has not made requests recently has accumulated tokens and can send a burst. Configure the bucket with capacity 100 (max burst) and refill rate 100/60 per second (100 per minute steady state). A user who sends 100 requests in the first second uses all their tokens. They can then send requests at the refill rate (about 1.67 per second) for the rest of the minute. This is more fair than fixed window, which would block them for 59 seconds after the burst. The key insight: token bucket allows the burst but enforces the average rate over time.

Q3: How would you implement rate limiting for an expensive AI inference endpoint that costs 10x more than a regular API call?

Strong answer: Use a cost-based token bucket instead of a request-based one. Each user has a bucket of “cost units.” A regular API call costs 1 unit. An AI inference call costs 10 units. The bucket refills at a rate that corresponds to the user’s tier (e.g., 100 units per minute for free tier, 1000 units per minute for paid tier). This way, a free tier user can make 100 regular calls per minute OR 10 AI calls per minute OR any combination that totals 100 units. Return the cost of each operation in the response headers so clients can plan their usage. This is how OpenAI’s token-based rate limiting works - you are limited by tokens consumed, not by request count.