Rate Limiting: Friend or Foe?


api-design scalability security

Rate Limiting

A bot hammers your signup endpoint 10,000 times a minute. Your DB is now a bot playground.

⏱ 12 min read📐 Intermediate🔒 Rate Limiting

It’s 2:47 AM on a Tuesday when PagerDuty fires. The alert title says “PostgreSQL - connection pool exhausted.” The body says active_connections: 100/100. You open the dashboard. Traffic looks normal - around 200 requests per minute across the whole app - except for one endpoint. POST /api/signup is fielding 10,000 requests per minute. From a single IP range in eastern Europe.

The requests are syntactically valid. Real-looking names. Email addresses with plausible domains. Passwords that pass your complexity rules. Each one triggers a SELECT for username uniqueness, another SELECT for email uniqueness, a bcrypt.hash that burns 100ms of CPU, and then an INSERT. One hundred database connections occupied. All of them processing garbage. Real users trying to sign up get a 502 Bad Gateway because the connection pool has nothing left to give.

You block the IP. Requests shift to a different one. You block that. Five more appear. By 3:15 AM you are manually playing whack-a-mole against a rotating proxy network, your only weapon being the database that is slowly drowning under the load. You are losing.

The post-mortem will note - without irony - that the fix took eleven minutes to deploy. The monitoring failure took three weeks to discover. The absence of rate limiting had existed for four years.

This is the rate limiting problem. Five algorithms exist to solve it. Most teams pick the wrong one.

Why This Happens

Open APIs are, by default, unconditional. Your server has no concept of “this IP is sending too much traffic” unless you explicitly encode that rule somewhere. HTTP is stateless. Each request arrives as if it were the first.

The failure chain is mechanical:

Bot sends 10,000 signup requests/min
  → API server accepts all (no guard)
    → Each request: SELECT + SELECT + bcrypt(100ms) + INSERT
      → Connection pool: 100 slots fill instantly
        → Legitimate requests queue behind bot traffic
          → Queue depth grows: 50, 500, 5,000
            → Requests timeout before connection granted
              → HTTP 502 / 503 for real users
                → Business: incident, refunds, churn

The deeper problem is that bot traffic looks like real traffic at the HTTP layer. You cannot distinguish intent from a TCP packet. What you can do is enforce a policy: no single identity - IP, user ID, API key, or device fingerprint - gets unlimited access to a resource. The mechanism for enforcing that policy is rate limiting.

The Naive Solution (and Where It Breaks)

The first thing most engineers reach for is fixed window counter rate limiting. Track how many requests an IP has made in the current time window. Reject requests that exceed the limit.

No rate limiting: bot floods API, DB overwhelmed

The implementation is simple:

def check_rate_limit(ip: str, window_seconds: int = 60, max_requests: int = 100) -> bool:
    # Key rotates every window_seconds
    key = f"ratelimit:{ip}:{int(time.time() // window_seconds)}"
    count = redis.incr(key)
    if count == 1:
        redis.expire(key, window_seconds)
    return count <= max_requests

At small scale this works fine:

100 users, 5 req/min each: peak count = 5  →  well under limit  →  no problem

At real bot scale, it breaks in a specific way:

Small scale:  1 IP, 5 req/min   →  count stays at 5  →  fine
Large scale:  1 IP, 10K req/min →  count hits 100 in first second
              bot waits until window resets at :00
              sends 100 req at :59.999, then 100 req at :00.001
              = 200 requests in 2 milliseconds
              your "100 requests per minute" limit is now 200/min at the boundary

This is the boundary burst problem, and it is not theoretical. Any rate limited endpoint using fixed windows can be exploited by bursting traffic precisely at the window boundary - doubling throughput while technically respecting the per-window limit. A patient bot will do this automatically.

The Better Solution

There is no single correct rate limiting algorithm. Each solves a different failure mode. The question is which failure mode you actually care about.

Token Bucket (Smooth Sustained Traffic with Burst Headroom)

The token bucket algorithm stores tokens in a “bucket” per identity. Tokens refill at a fixed rate. Each request consumes one token. If the bucket is empty, the request is rejected with HTTP 429.

Token bucket: tokens refill at fixed rate, requests consume tokens

The key insight is that token bucket decouples burst capacity from sustained rate. A bucket with capacity=100 and refill_rate=10/sec lets a user send 100 requests instantly (burst), then enforces a steady 10 requests per second afterward. This is exactly the right model for an API: legitimate users occasionally need to do batch operations; bots sustain high volume indefinitely.

def token_bucket(key: str, capacity: int = 100, refill_rate: float = 10.0) -> tuple[bool, float]:
    now = time.time()
    data = redis.hgetall(f"rl:tb:{key}")

    tokens = float(data.get(b"tokens", capacity))
    last_refill = float(data.get(b"last_refill", now))

    elapsed = now - last_refill
    tokens = min(capacity, tokens + elapsed * refill_rate)

    if tokens < 1.0:
        return False, 0.0  # denied - bucket empty

    tokens -= 1.0
    pipe = redis.pipeline()
    pipe.hset(f"rl:tb:{key}", mapping={"tokens": tokens, "last_refill": now})
    pipe.expire(f"rl:tb:{key}", 3600)
    pipe.execute()
    return True, tokens

The response to a rejected request should include a Retry-After header. Calculate it from the refill rate: if the bucket needs n tokens and refills at r/sec, the client should wait n/r seconds. This prevents retry storms where clients immediately hammer the endpoint again.

Leaky Bucket (Predictable Output Rate)

The leaky bucket algorithm inverts the model. Incoming requests fill a fixed-size queue. The queue drains at a constant rate - the “leak.” If the queue is full, new requests are dropped immediately rather than queued.

def leaky_bucket(key: str, rate: float = 10.0, capacity: int = 100) -> bool:
    now = time.time()
    data = redis.hgetall(f"rl:lb:{key}")

    queue_size = int(data.get(b"queue", 0))
    last_leak = float(data.get(b"last_leak", now))

    elapsed = now - last_leak
    leaked = int(elapsed * rate)
    queue_size = max(0, queue_size - leaked)

    if queue_size >= capacity:
        return False  # queue full - drop request

    queue_size += 1
    redis.hset(f"rl:lb:{key}", mapping={"queue": queue_size, "last_leak": now})
    redis.expire(f"rl:lb:{key}", 3600)
    return True

Leaky bucket is the right tool when the downstream system needs smooth, predictable input - a payment provider’s API, a third-party webhook receiver, or a rate-limited SMS gateway. It trades burst flexibility for output regularity. A token bucket lets a user spend accumulated tokens in a burst; a leaky bucket always smooths the output regardless of burst.

The downside: if a legitimate user sends 50 rapid requests and the bucket is near capacity, their requests get dropped even if the previous window was quiet. Token bucket handles this better.

Sliding Window Counter (Production Default)

The sliding window counter is the algorithm most production systems actually use, because it eliminates the boundary burst problem with O(1) memory per user.

Fixed window boundary burst vs sliding window accurate counting

The pure sliding window approach stores every request timestamp in a sorted set:

# Add current request with millisecond timestamp as score
ZADD rate:signup:192.168.1.1 1716812867000 "a3f2-uuid-here"

# Evict expired entries (older than 60 seconds)
ZREMRANGEBYSCORE rate:signup:192.168.1.1 0 1716812807000

# Count remaining in-window requests
ZCARD rate:signup:192.168.1.1

This is perfectly accurate but O(N) memory per user. The hybrid approximation uses only two counters per user - one for the current window, one for the previous - and computes a weighted sum:

current_weight = requests_in_current_window
previous_weight = requests_in_previous_window * (1 - elapsed_fraction_of_current_window)
estimated_rate = previous_weight + current_weight

If the previous window had 80 requests, and the current window is 30% complete, the estimated rate is 80 * 0.70 + current = 56 + current. This approximation introduces at most ~3% error and uses constant memory.

For high-throughput endpoints, make it atomic with a Lua script - otherwise ZREMRANGEBYSCORE, ZADD, and ZCARD can race between processes:

local key = KEYS[1]
local now = tonumber(ARGV[1])
local window_ms = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])
local request_id = ARGV[4]

redis.call("ZREMRANGEBYSCORE", key, 0, now - window_ms)
local count = tonumber(redis.call("ZCARD", key))

if count < limit then
  redis.call("ZADD", key, now, request_id)
  redis.call("EXPIRE", key, math.ceil(window_ms / 1000) + 1)
  return 1
end
return 0

Distributed Rate Limiting

A single Redis instance is a single point of failure. For multi-region or high-availability systems, distributed rate limiting requires a strategy for coordinating counters across nodes.

The naive approach - one global Redis counter - works but adds a network hop to every request and creates a bottleneck. The practical alternatives are:

Local + sync: each app server maintains a local counter and syncs to Redis every N milliseconds. This means the actual enforced limit can be up to N * servers * rate in the worst case during a sync lag, but the overhead is near zero per request.

Token bucket with approximation: each app server holds a local token bucket. Every window/10 interval, it reconciles with Redis by exchanging a fraction of its tokens for the global state. Stripe’s rate limiting infrastructure uses a variant of this approach.

Redis Cluster with slot-based sharding: hash the rate limit key to a consistent Redis slot. All requests for a given user always land on the same shard. Provides O(1) counter operations with horizontal scale.

The Full Architecture

Full rate limiting architecture: Nginx edge, Redis, app middleware, PostgreSQL

The production answer is almost always two layers: edge rate limiting at Nginx or a CDN (cheap, no app server overhead, effective against volumetric attacks), and application-layer rate limiting in middleware (can enforce per-user limits that require auth context).

# Nginx config - edge rate limiting by IP
limit_req_zone $binary_remote_addr zone=signup:10m rate=10r/m;

server {
    location /api/signup {
        limit_req zone=signup burst=5 nodelay;
        limit_req_status 429;
        add_header Retry-After 60;
        proxy_pass http://api_backend;
    }
}

The burst=5 allows a user to briefly exceed the rate (up to 5 extra requests) without getting rejected, absorbing normal browser retry behavior. nodelay means excess requests are rejected immediately rather than queued - queuing under load makes things worse, not better.

Application-layer middleware sits after Nginx and enforces per-user limits using the token bucket against Redis:

const rateLimit = require("express-rate-limit");
const RedisStore = require("rate-limit-redis");

const signupLimiter = rateLimit({
  windowMs: 60 * 1000,
  max: 10,
  store: new RedisStore({
    sendCommand: (...args) => redisClient.sendCommand(args),
  }),
  keyGenerator: (req) => req.ip,
  standardHeaders: true,    // sets RateLimit-* headers
  legacyHeaders: false,
  handler: (req, res) => {
    res.status(429).json({
      error: "Too many requests",
      retryAfter: Math.ceil(req.rateLimit.resetTime / 1000),
    });
  },
});

app.post("/api/signup", signupLimiter, signupController);

Component Deep Dives

Choosing the Rate Limit Key

IP address is the obvious key - and also the weakest. Bots rotate IPs. A better key hierarchy layers multiple signals:

Priority 1: authenticated user_id (post-login endpoints)
Priority 2: API key (third-party integrations)
Priority 3: device fingerprint (client-side token in header)
Priority 4: IP address (anonymous pre-auth endpoints like /signup)
Priority 5: IP + user agent hash (last resort against basic bots)

For a signup endpoint where the user is unauthenticated, you are stuck with IP and device fingerprint. This is why Nginx-level IP rate limiting is necessary but not sufficient - you also want client-side CAPTCHAs and email verification to raise the cost of bot account creation.

What to Return on 429

A well-formed rate limit rejection is not just {"error": "rate limited"}. It should include:

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 47
X-RateLimit-Limit: 10
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1716812920

{
  "error": "rate_limit_exceeded",
  "message": "Too many signup attempts. Try again in 47 seconds.",
  "retryAfter": 47
}

The Retry-After header is critical. Without it, legitimate clients implementing retry logic will retry immediately, creating a thundering herd that re-saturates the endpoint the moment the rate limit window resets.

Monitoring Rate Limit Effectiveness

Rate limiting without observability is a policy without feedback. Track:

# Prometheus metrics to expose
rate_limit_requests_total{endpoint, key_type, result}  # allowed | denied
rate_limit_bucket_remaining{endpoint, percentile}       # p50, p95, p99 of remaining tokens
rate_limit_denial_rate{endpoint}                        # denials / total requests

A spike in rate_limit_denial_rate on POST /api/signup is your early warning for a bot attack - ideally hours before the database starts struggling.

Comparison Table

AlgorithmWrite ComplexityMemoryBoundary BurstBurst HandlingAccuracyBest Use Case
Fixed WindowO(1)O(1)Yes - 2x at edgeHard cutoffLowSimple internal APIs
Token BucketO(1)O(1)NoAllows burst to capacityHighLogin, signup, general API
Leaky BucketO(1)O(1)NoDrops burst immediatelyHighPayment APIs, SMS, webhooks
Sliding Window LogO(log N)O(N)NoNo burst allowedPerfectStrict billing quotas
Sliding Window CounterO(1)O(2)NoApproximatedVery High (~97%)Production default
Distributed (local+sync)O(1) localO(1) per nodePossible during lagInherits base algoMediumMulti-region, high RPS

Key Takeaways

  • Fixed window counter is fast and simple but the boundary burst vulnerability makes it unsuitable for bot-sensitive endpoints like signup or login.
  • Token bucket is the most versatile algorithm: it handles legitimate burst traffic gracefully while enforcing a sustainable long-term rate, making it the default choice for most APIs.
  • Leaky bucket is the right tool when you need to protect a downstream system from uneven input - it smooths output at the cost of absorbing burst traffic.
  • Sliding window counter eliminates the boundary problem with O(1) memory by approximating the rate over the past N seconds using two counters and weighted interpolation.
  • Two-layer enforcement - Nginx at the edge for IP-level volumetric protection, Redis-backed middleware for per-user authenticated limits - is the pattern that survives real bot attacks.
  • Distributed rate limiting requires explicit coordination strategy. Naive global Redis is a bottleneck; local+sync or consistent-hash sharding are the production paths.
  • The 429 response matters: include Retry-After, X-RateLimit-Remaining, and X-RateLimit-Reset headers so legitimate clients back off correctly and retry storms are avoided.
  • Rate limiting is a detection surface: a spike in denial rate is your earliest signal of an ongoing bot attack, often hours before the database shows distress.

Without rate limiting, every public endpoint is an open invitation. The bot at 2:47 AM did not care that your connection pool was 100 slots deep. It just kept sending.

Frequently Asked Questions

Q: Should I rate limit by IP or by user? A: Both, at different layers. IP-based limits at the edge (Nginx, Cloudflare) stop volumetric attacks before they reach your app. User-based limits in your middleware enforce per-account quotas that are meaningful after authentication. For unauthenticated endpoints like /signup, IP is the only key available - which is why you should complement it with bot detection signals like device fingerprints and behavioral analysis.

Q: What rate limits are reasonable for a signup endpoint? A: For a public signup flow: 5-10 attempts per IP per minute at the edge, with a burst of 2-3 for normal user behavior. If users can register multiple accounts, add a per-email limit of 3 per hour. For verified users hitting authenticated endpoints, limits can be 10-100x higher. The right number is derived from your 99th-percentile legitimate user behavior - not from what feels generous.

Q: Why does token bucket let you burst past the “limit”? A: Because the limit is on the sustained rate, not the instantaneous rate. A bucket with capacity=100 and refill_rate=10/sec says: “you can send 100 requests instantly (accumulated over 10 seconds of being idle), but after that you’re capped at 10/sec.” This matches real user behavior - someone opens your app, does a bunch of things quickly, then stops. Leaky bucket is strict about instantaneous rate if you need that.

Q: Can I implement rate limiting without Redis? A: In a single-process app, yes - an in-memory map works fine. In a multi-process or multi-server deployment, you need a shared store. Redis is the dominant choice because it supports atomic operations (INCR, EXPIRE, Lua scripts) that make the algorithms race-condition-free. PostgreSQL can do it (advisory locks + counters), but the extra round-trip latency shows at scale.

Q: How do I handle rate limiting for a mobile app where many users share an IP (corporate networks, NAT)? A: Use IP as one signal among several. Pair it with a device fingerprint token (a UUID generated on first app install, sent in a custom header), and prefer user ID limits over IP limits wherever authentication exists. Cloudflare’s bot management tools can also classify traffic as “likely mobile NAT” and apply different policies. Pure IP limiting behind aggressive NAT will hurt legitimate users.

Q: What happens when Redis goes down and I can’t check rate limits? A: Choose a failure mode intentionally. Fail open (allow all traffic when Redis is unavailable) preserves availability but removes bot protection. Fail closed (reject all requests when Redis is unavailable) maintains security but can cause an outage if Redis hiccups. Most teams fail open with a circuit breaker and an alert. The key is to make the decision before 3 AM, not during.

Interview Questions

Q: Walk me through the token bucket algorithm and explain when it fails.

Expected depth: Explain the mechanics (tokens, refill rate, capacity), implement check using Redis HGET/HSET/EXPIRE, and explain the edge cases: race conditions between GET and SET without atomic operations (use Lua or WATCH/MULTI/EXEC), the behavior when capacity >> sustained rate (allows burst), and why this is not the right algorithm when you need strict metering rather than burst-tolerant rate control.

Q: A competitor is reverse-engineering your API by walking sequential user IDs. Rate limiting doesn’t stop it because they stay under the limit. What do you do?

Expected depth: Rate limiting is not the right tool here - this is an authorization and enumeration problem. The interviewer wants to see you pivot to: UUIDs or ULIDs instead of sequential IDs, per-endpoint object-level authorization (GET /users/{id} should verify the caller has access), suspicious pattern detection (requests for IDs in arithmetic sequence), and optionally honeypot IDs that trigger alerts. Bonus: mention that rate limiting at the account level would still help if the attacker is authenticated.

Q: Design a rate limiting system for 1 million API clients making 1 billion requests per day across 3 geographic regions.

Expected depth: Capacity math first: 1B req/day = ~11,574 RPS global. Discuss why global Redis is a bottleneck at this scale. Cover: Redis Cluster with consistent hash sharding per user ID, one cluster per region with cross-region replication for quota aggregation, local token bucket in the app server with Redis sync every 100ms to amortize latency, and the tradeoff (up to ~10% of the window can slip past the limit during lag). Mention that most rate limiting errors at this scale come from clock skew between nodes.

Q: How does Nginx’s limit_req directive implement rate limiting, and what does burst do?

Expected depth: Nginx implements leaky bucket using the limit_req_zone shared memory zone keyed by a variable (typically $binary_remote_addr). The rate parameter defines the average allowed rate. burst allows temporary excess up to burst requests, which are queued rather than rejected. nodelay rejects excess immediately instead of queuing. The candidate should note that burst without nodelay adds queuing delay (potentially seconds at high burst), which is often worse than rejection. Cover the memory sizing: each zone entry is ~64 bytes, so 10m holds ~160K IPs.

Q: Explain the difference between “rate limiting” and “throttling.”

Expected depth: Rate limiting is typically binary: you are allowed or you are not (HTTP 429). Throttling degrades service quality gracefully - a request might be delayed, served at lower priority, or return reduced data. Both are admission control mechanisms. In practice: rate limiting is appropriate for protecting against abuse (hard cutoff, clear 429 response), throttling is appropriate for fairness under load (degrade experience for high-volume tenants while maintaining availability for others). Mention that some systems combine both: rate limit the burst, throttle the sustained tail.

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access
Unlock Full Article