The Webhook That Tried 11,000 Times


reliability distributed-systems

System Design Scenario

The Webhook Retry Storm

When your idempotency layer breaks, one webhook becomes eleven thousand duplicate orders

⏱ 12 min read📐 Intermediate🔒 Reliability

It’s 3:42 AM on a Saturday. Your phone vibrates with a PagerDuty alert: order_count_anomaly - 11,247 orders created in last 15 minutes. You blink at the screen. Your e-commerce platform averages 200 orders per hour on weekends. Something is profoundly wrong.

You SSH into the production box, tail the logs, and watch the same Stripe webhook event - evt_1NqQPbL7xK9 with type payment_intent.succeeded - arrive over and over. Every 30 seconds, another POST hits your /webhooks/stripe endpoint. Every single one creates a new order. The customer paid $89.99 once for a pair of running shoes. Your system has now promised to ship them 11,247 pairs.

The root cause takes you twenty minutes to find. A deploy three days ago introduced a subtle bug: the webhook handler validates the Stripe signature, processes the payment, creates the order, then attempts to write a “processed” record to your idempotency table. But the idempotency table write is happening after the order creation, and it’s failing silently because someone renamed a column in a migration. The handler catches the error, returns HTTP 500 to Stripe, and Stripe does exactly what it’s supposed to do - retry the delivery.

Think of it like a postal service that keeps re-delivering a package because nobody answers the door to sign for it. Except every time the package arrives, your system opens a new order for it instead of recognizing “we already have this one.” Stripe’s retry schedule is aggressive - starting at 1 minute, backing off to every hour, continuing for up to 72 hours. In your case, the endpoint returns 500 fast enough that retries pile up within the retry window. Three days of retries. Eleven thousand deliveries of the same event.

This is the idempotency problem.

Why This Happens

Every payment provider - Stripe, Razorpay, Adyen, PayPal - operates on at-least-once delivery semantics. They guarantee your webhook will be delivered at least once, but they explicitly do not guarantee exactly once. The contract is simple: they POST the event to your endpoint, and if they don’t receive an HTTP 2xx response within their timeout window, they retry.

This design is intentional. In distributed systems, exactly-once delivery between two independent services is theoretically impossible without coordination overhead that would make payment processing unusably slow. The providers push the deduplication responsibility to you. Their documentation says so explicitly - Stripe’s webhook guide literally states “your endpoint must handle duplicate events.”

The failure chain in a broken idempotency setup looks like this:

Provider sends webhook (event_id: evt_abc123)
  → Your endpoint receives POST
    → Signature verification passes
      → Business logic executes (order created!)
        → Idempotency record write FAILS (bug, timeout, schema mismatch)
          → Handler throws exception
            → HTTP 500 returned to provider
              → Provider schedules retry
                → Same event arrives again
                  → No record of prior processing exists
                    → Business logic executes AGAIN (duplicate order!)
                      → Loop continues for 72 hours

The critical flaw is ordering: if your side effects (order creation, fulfillment triggers, email sends) execute before your idempotency record is durably written, any failure in the idempotency write creates a window where the event appears unprocessed to the next delivery attempt.

Core Insight

Idempotency isn’t about preventing retries - retries are a feature of reliable systems. Idempotency is about ensuring that processing the same event multiple times produces the same outcome as processing it once. The provider WILL retry. Your system must be ready.

The Naive Solution

The first thing most engineers reach for: check a database table before processing.

def handle_webhook(request):
    event = verify_stripe_signature(request)
    
    # Check if already processed
    existing = db.query("SELECT id FROM processed_events WHERE event_id = %s", event.id)
    if existing:
        return Response(status=200)
    
    # Process the event
    order = create_order(event.data)
    send_confirmation_email(order)
    trigger_fulfillment(order)
    
    # Mark as processed
    db.execute("INSERT INTO processed_events (event_id) VALUES (%s)", event.id)
    
    return Response(status=200)
Broken idempotency flow showing webhook retry storm creating duplicate orders

This breaks in three ways. First, the race condition: two concurrent deliveries of the same event both check the table simultaneously, both find no existing record, both proceed to create orders. Second, the partial failure: if create_order succeeds but the INSERT into processed_events fails (network blip, disk full, deadlock), you’ve created the order but lost the dedup record. Third, the response timing: if your processing takes longer than the provider’s timeout (typically 5-30 seconds), the provider retries while you’re still processing the first delivery.

The scale breakpoint is surprisingly low:

1 event/minute  → naive check works fine, race window is tiny
10 events/sec   → race conditions start appearing under load
100 events/sec  → guaranteed duplicates on hot events
burst (retries) → complete failure, every retry creates a duplicate

Warning

A SELECT-then-INSERT pattern without locking or unique constraints is never idempotent under concurrency. It’s the distributed systems equivalent of a check-then-act bug - the classic TOCTOU (time-of-check, time-of-use) vulnerability.

The Better Solution

The fix has three layers, each protecting against a different failure mode. Layer 1 stops the retry storm immediately. Layer 2 provides fast deduplication. Layer 3 guarantees correctness even if Layer 2 fails.

Layer 1: Acknowledge Fast, Process Later

The single most impactful change: decouple receiving the webhook from processing it. Return HTTP 200 the moment you’ve verified the signature and enqueued the event. The provider stops retrying instantly.

def handle_webhook(request):
    event = verify_stripe_signature(request)
    
    # Enqueue for async processing - fast, reliable
    queue.send_message(
        body=json.dumps(event),
        deduplication_id=event["id"],  # SQS FIFO dedup
        group_id=event["data"]["object"]["id"]
    )
    
    # Return 200 IMMEDIATELY - stop the retry storm
    return Response(status=200)

This is the “sign for the package at the door” approach. You don’t process the payment right there in the doorway - you take the envelope inside and deal with it on your own timeline. The postal service marks it as delivered and moves on.

Real World

Shopify’s webhook infrastructure acknowledges webhooks within 100ms and processes them asynchronously. Their documentation recommends the same pattern: “respond quickly with a 200 status, then process the webhook payload in a background job.”

Layer 2: Redis Deduplication (Fast Path)

For the async worker that processes queued events, use Redis as a high-speed dedup cache. The SET NX (set if not exists) command is atomic - no race condition possible.

import redis
import json

r = redis.Redis()

def process_event(message):
    event = json.loads(message.body)
    event_id = event["id"]
    
    # Atomic check-and-set: returns True only if key was NEW
    is_new = r.set(
        f"idem:{event_id}",
        "processing",
        nx=True,      # only set if not exists
        ex=86400      # expire after 24 hours
    )
    
    if not is_new:
        # Already seen this event - skip
        message.delete()
        metrics.increment("webhook.dedup.hit")
        return
    
    try:
        execute_business_logic(event)
        r.set(f"idem:{event_id}", "completed", ex=86400)
    except Exception as e:
        # Delete the Redis key so retry can reprocess
        r.delete(f"idem:{event_id}")
        raise  # Let the queue retry

The Redis SET NX command is the equivalent of an atomic compare-and-swap. Two workers pulling the same event from the queue simultaneously will race to set the key, and exactly one will win. The loser sees is_new = False and skips processing.

Real World

Stripe themselves recommend using the event ID as an idempotency key with a cache layer. Their internal systems use a similar pattern - a fast in-memory check backed by a durable store.

Layer 3: Database Unique Constraint (Durability Guarantee)

Redis is fast but volatile. If Redis restarts, your dedup keys vanish. The final safety net is a database-level unique constraint that makes duplicate processing physically impossible.

CREATE TABLE idempotency_keys (
    event_id VARCHAR(255) PRIMARY KEY,
    status VARCHAR(20) NOT NULL DEFAULT 'processing',
    result JSONB,
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    completed_at TIMESTAMP,
    CONSTRAINT uk_event_id UNIQUE (event_id)
);

-- The processing function uses INSERT ... ON CONFLICT
INSERT INTO idempotency_keys (event_id, status)
VALUES ('evt_abc123', 'processing')
ON CONFLICT (event_id) DO NOTHING
RETURNING event_id;
def execute_business_logic(event):
    with db.transaction() as tx:
        # Attempt to claim the event - atomic with business logic
        result = tx.execute("""
            INSERT INTO idempotency_keys (event_id, status)
            VALUES (%s, 'processing')
            ON CONFLICT (event_id) DO NOTHING
            RETURNING event_id
        """, [event["id"]])
        
        if not result.rowcount:
            # Another worker already claimed this event
            return
        
        # Business logic inside the SAME transaction
        order = tx.execute("""
            INSERT INTO orders (payment_id, amount, customer_id, idempotency_key)
            VALUES (%s, %s, %s, %s)
            RETURNING id
        """, [event["data"]["id"], event["data"]["amount"],
              event["data"]["customer"], event["id"]])
        
        # Mark as completed
        tx.execute("""
            UPDATE idempotency_keys 
            SET status = 'completed', completed_at = NOW()
            WHERE event_id = %s
        """, [event["id"]])
    
    # Side effects AFTER transaction commits
    trigger_fulfillment(order.id)
    send_confirmation_email(order.id)

The crucial detail: the idempotency key insert and the order creation happen in the same database transaction. They either both succeed or both roll back. There is no window where the order exists without the idempotency record.

Real World

Stripe’s own API uses this exact pattern for their idempotency keys feature. When you pass an Idempotency-Key header, Stripe stores the request and response atomically. Replaying the same key returns the cached response without re-executing the operation.

The Full Architecture

Complete idempotent webhook processing architecture with all layers

The happy path flows through four layers. The provider sends a POST. Layer 1 (Ingress) verifies the signature, returns 200 immediately, and enqueues the raw event. Layer 2 (Deduplication) attempts a Redis SET NX - if the key already exists, the event is dropped silently. If the key is new, Layer 3 (Processing) executes the business logic inside a database transaction with the idempotency key insert. Layer 4 (Observability) tracks dedup hit rates, retry frequencies, and dead-letter events.

The retry path is boring by design: provider sends the same event again, Layer 1 accepts and enqueues it again (idempotent queue dedup may catch it here), Layer 2 sees the Redis key exists, drops the duplicate. No order created. No side effects triggered. The customer gets one pair of shoes.

Component Deep Dives

Signature Verification

Never process an unverified webhook. Providers include an HMAC signature in headers that proves the request originated from them.

import hmac
import hashlib

def verify_stripe_signature(request, webhook_secret):
    payload = request.body
    sig_header = request.headers.get("Stripe-Signature")
    
    # Parse timestamp and signatures from header
    elements = dict(pair.split("=", 1) for pair in sig_header.split(","))
    timestamp = elements["t"]
    expected_sig = elements["v1"]
    
    # Compute expected signature
    signed_payload = f"{timestamp}.{payload.decode()}"
    computed_sig = hmac.new(
        webhook_secret.encode(),
        signed_payload.encode(),
        hashlib.sha256
    ).hexdigest()
    
    if not hmac.compare_digest(computed_sig, expected_sig):
        raise SignatureVerificationError("Invalid signature")
    
    # Prevent replay attacks - reject if timestamp is too old
    if abs(time.time() - int(timestamp)) > 300:
        raise SignatureVerificationError("Timestamp too old")
    
    return json.loads(payload)

Why verify before enqueuing? Without verification, an attacker can flood your queue with fabricated events. The signature check is CPU-cheap (one HMAC computation) and prevents a denial-of-service vector on your processing pipeline.

The Idempotency Key Selection

Choosing the right key determines whether your deduplication actually works. The event ID from the provider is the obvious choice, but there are subtleties.

def extract_idempotency_key(event):
    # Option 1: Provider's event ID (most common, safest)
    # Stripe: event.id = "evt_1NqQPbL7xK9..."
    # Razorpay: event.id = "event_abc123..."
    key = event["id"]
    
    # Option 2: Composite key for providers without stable event IDs
    # Some providers may resend with a different event ID for the same
    # underlying state change. In that case, derive from the payload:
    # key = f"{event['type']}:{event['data']['object']['id']}"
    
    return key

The event ID is ideal because the provider guarantees it’s stable across retries of the same event. If they send evt_abc123 and you return 500, the retry contains the same evt_abc123. This is your deduplication anchor.

Queue Configuration for Exactly-Once Processing

If you use AWS SQS FIFO queues, you get built-in deduplication as a bonus layer:

import boto3

sqs = boto3.client('sqs')

def enqueue_webhook_event(event):
    sqs.send_message(
        QueueUrl="https://sqs.us-east-1.amazonaws.com/123/webhooks.fifo",
        MessageBody=json.dumps(event),
        MessageGroupId=event["data"]["object"]["id"],  # partition by entity
        MessageDeduplicationId=event["id"]  # SQS dedup window: 5 min
    )

SQS FIFO queues reject duplicate MessageDeduplicationId values within a 5-minute window. This catches the most common retry scenario (provider retrying within minutes) before your worker even sees the message. But it’s a 5-minute window only - you still need Redis and the DB constraint for longer-running retry storms.

Deduplication mechanism showing Redis SET NX and database unique constraint working together

Comparison Table

ApproachThroughputDurabilityFailure ModeComplexityBest For
DB SELECT before INSERTLowHighRace conditions under concurrencyLowNever use this
DB UNIQUE constraint onlyMediumHighSlow under high load, connection pool pressureLowLow-volume webhooks (<10/s)
Redis SET NX onlyVery HighLow (volatile)Lost keys on Redis restartMediumStateless events where replay is safe
Redis + DB constraintVery HighHighRedis miss → DB catches it, slightly slowerMedium-HighProduction payment webhooks
Queue dedup + Redis + DBVery HighVery HighOverkill for simple casesHighHigh-volume, critical financial events
Decision flow for choosing the right idempotency strategy based on scale and requirements

Key Takeaways

  • Return 200 immediately - decouple acknowledgment from processing. The fastest way to stop a retry storm is to tell the provider you received the event, then process it at your own pace.
  • Idempotency keys must be written atomically with your side effects - if the business logic and the dedup record aren’t in the same transaction, you have a gap where duplicates slip through.
  • At-least-once delivery is the norm - every webhook provider, message queue, and event bus operates this way. Design your consumers to be idempotent from day one, not as a retrofit.
  • Layer your defenses - Redis for speed, database constraints for durability, queue deduplication for belt-and-suspenders. Any single layer can fail; the combination is what gives you exactly-once semantics in practice.
  • The event ID is your anchor - use the provider’s stable event identifier as your idempotency key. Don’t invent your own unless the provider explicitly lacks stable IDs.
  • Monitor your dedup hit rate - a sudden spike in duplicate detections means something upstream is broken. A zero dedup rate might mean your dedup layer itself is broken.
  • Set TTLs on idempotency records - you don’t need to remember every event forever. 24-72 hours covers the retry window of every major provider. Clean up old records to keep your tables lean.
  • Retry strategies on your side matter too - when your worker fails processing, use exponential backoff with jitter. Don’t create your own thundering herd on the consumer side.

The broader lesson: distributed systems are built on unreliable networks connecting unreliable machines. Deduplication isn’t a feature you add - it’s a fundamental property your system must have whenever money, physical goods, or irreversible actions are involved.

FAQ

Q: Why not just use a database UNIQUE constraint and skip Redis entirely?

You can, and for low-volume webhooks it’s perfectly fine. The Redis layer exists for performance at scale. A database round-trip costs 2-5ms; a Redis SET NX costs 0.1-0.3ms. At 1,000 events per second, that difference is the gap between your worker keeping up and falling behind. Redis also reduces connection pressure on your database - important when your DB is already handling order writes.

Q: What happens if Redis is down when a webhook arrives?

Your system degrades gracefully. The Redis check fails (timeout or connection error), and you fall through to the database unique constraint. Processing is slightly slower (every event hits the DB dedup check) but correctness is maintained. This is why layering matters - you’re never relying on a single deduplication mechanism.

Q: Why not use the payment amount + customer ID as the idempotency key?

Because a customer can legitimately make two identical purchases. If someone buys the same $89.99 shoes twice (different colors, or a gift), those are distinct payments with distinct event IDs. Using amount + customer as the key would incorrectly deduplicate the second legitimate purchase. The provider’s event ID is unique per event, not per payment shape.

Q: Can the provider change the event ID across retries?

Major providers (Stripe, Razorpay, Adyen) guarantee stable event IDs across retries. The same webhook delivery attempt for the same event always carries the same event ID. However, some providers distinguish between retries (same event ID) and new events for the same state change (new event ID). Read your provider’s documentation carefully - Stripe, for example, may send multiple different events for the same payment if the payment transitions through states.

Q: Why return 200 for already-processed events instead of 204 or 409?

Providers interpret non-2xx responses as “delivery failed, retry later.” If you return 409 (Conflict) for a duplicate, some providers will retry it. The semantics of webhook acknowledgment are simple: 2xx means “I got it, stop sending.” The specific 2xx code doesn’t matter. Return 200, always, for any event you’ve seen before - whether you processed it or deduplicated it.

Q: How do you handle webhook events that arrive out of order?

This is a separate concern from idempotency but related. If you receive payment.refunded before payment.succeeded (which can happen with aggressive retry schedules and network reordering), use event timestamps and state machine logic. Store the event regardless, but only execute state transitions that are valid from the current state. An order that doesn’t exist yet can’t be refunded - queue the refund event for later reprocessing.

Interview Questions

1. “Design an idempotent webhook handler for a payment system that processes 10,000 events per second.”

Expected depth: Discuss the layered approach (fast ACK, queue, Redis dedup, DB constraint). Mention that at 10K/s, the DB-only approach creates connection pool bottlenecks. Explain why Redis SET NX is atomic and race-free. Discuss failure modes - what happens when Redis is down, when the DB is slow, when the queue has duplicates. Mention SQS FIFO deduplication windows.

2. “Explain the difference between at-least-once and exactly-once delivery. Can you achieve exactly-once in practice?”

Expected depth: Exactly-once delivery between two independent systems is impossible without unbounded coordination. Exactly-once processing (semantics) is achievable through idempotent consumers. Discuss the FLP impossibility result tangentially. Real systems achieve effectively-exactly-once through idempotency keys + atomic writes. Mention that Kafka’s “exactly-once” is actually idempotent producers + transactional consumers, not true exactly-once delivery.

3. “A webhook handler is creating duplicate orders during peak traffic. Walk me through how you’d diagnose and fix this in production.”

Expected depth: Start with immediate mitigation (pause the webhook endpoint or return 200 for all events to stop the storm). Diagnose by checking idempotency table write success rate, looking for race conditions in logs, checking if the dedup check is happening before or after side effects. Fix by moving to atomic transactions, adding Redis fast-path, and implementing the fast-ACK pattern. Discuss how to clean up existing duplicates without affecting legitimate orders.

4. “Your Redis dedup cache lost all keys after a restart. How does your system behave, and what do you do?”

Expected depth: With layered architecture, the DB unique constraint catches duplicates that Redis would have caught. Performance degrades (more DB hits) but correctness is maintained. Discuss whether to warm the Redis cache from the DB (yes, for hot keys) or let it rebuild organically (acceptable for most cases). Mention Redis persistence options (RDB snapshots, AOF) as preventive measures. Discuss Redis Cluster vs single-node tradeoffs for this use case.

5. “How would you design an idempotency system that works across multiple regions?”

Expected depth: Redis SET NX won’t work across regions without Redis Cluster or a geo-replicated cache (latency tradeoffs). DynamoDB with conditional writes (attribute_not_exists) provides global strong consistency. Discuss the CAP theorem implications - you need consistency for dedup, so you sacrifice availability during partitions. Mention that some systems accept eventual consistency with a “last writer wins” approach for non-critical events, reserving strong consistency for financial operations only.

Continue Learning

Want to see how these patterns hold up when traffic spikes 50x at 3 AM? That's exactly what this Premium deep-dive covers.