Build a Payment Processing System with Idempotency Guarantees


distributed-systems reliability databases

System Design Deep Dive

Payment Processing with Idempotency Guarantees

Building a system that safely retries failed transactions without double-charging users and maintains a consistent ledger under concurrent requests

⏱ 14 min read📐 Advanced🏗️ Blueprint

Payment systems are among the most unforgiving environments in software engineering. A user clicks “Pay Now” and within 300 milliseconds, that request crosses your API gateway, bounces through an idempotency service, orchestrates calls to a third-party payment provider like Stripe or Braintree, writes to your ledger, and publishes an event to trigger downstream fulfillment. At any point in that chain, a network timeout, a pod restart, or a transient 503 from the payment provider can leave the transaction in an ambiguous state - the charge may or may not have been applied, but the API never returned a definitive response.

Think of a payment processor like a postal certified-mail system. Every letter gets a unique tracking number before it leaves your hands. If the delivery truck breaks down midway, the postal service doesn’t simply send another identical letter without checking whether the first arrived. It uses the tracking number to determine whether delivery was completed before dispatching a replacement. Idempotency keys are your tracking numbers: they let any component in the system - client, orchestrator, provider adapter - answer the question “did this specific operation already succeed?” before taking any irreversible action.

The scale of the problem amplifies the complexity. A mid-sized e-commerce platform processes 50,000 transactions per hour at peak, spread across 200 concurrent API nodes. Each node independently retries timed-out requests with exponential backoff. Without a coordinated idempotency layer, a payment that timed out at the provider returns 408 to the client, the client retries three times, and you submit the same charge four times. Stripe returns success for the first and duplicate-charge errors for the rest - if you are lucky. If the first returned a timeout too, all four attempts look like new charges.

The ledger consistency dimension adds another layer. Crediting a user’s wallet, debiting their bank account, and recording the transaction in your double-entry ledger are three separate writes across potentially three separate services or databases. If the wallet credit succeeds, the bank debit succeeds, but the ledger write fails due to a deadlock, your books are out of balance. A reconciliation job will catch this in the morning, but in the meantime your system reports a balance that doesn’t match reality. We need to solve for idempotent retries, distributed transaction atomicity, and ledger consistency simultaneously.

Requirements and Constraints

Functional Requirements

  • Accept payment requests with a client-supplied idempotency key and return identical responses for duplicate requests
  • Support payment providers Stripe, Braintree, and PayPal with a unified adapter interface
  • Maintain a double-entry ledger recording every debit and credit with transaction-level atomicity
  • Retry failed payment attempts with exponential backoff without issuing duplicate charges
  • Expose payment status endpoint returning the current state and all transition history
  • Support full and partial refunds that are themselves idempotent
  • Run a reconciliation job that detects ledger imbalances and provider-side discrepancies within 24 hours

Non-Functional Requirements

  • Process 50,000 transactions per hour sustained, with 200,000 peak for flash sale events
  • P99 payment latency under 2 seconds end-to-end including provider round-trip
  • Idempotency store lookups must complete in under 5ms at P99
  • Ledger writes must be durable - zero data loss under single-node failures
  • Payment state changes must be auditable for 7 years (regulatory requirement)
  • System must remain operational during a single availability zone outage

Constraints and Assumptions

  • Payment providers implement their own idempotency keys on their side; we pass ours through
  • Idempotency keys expire after 24 hours; re-using an expired key creates a new payment
  • Maximum concurrent in-flight payments per user: 5 (fraud prevention constraint)
  • Double-entry ledger is authoritative; provider records are reconciled against it
  • All monetary amounts stored as integers in the smallest currency unit (paise for INR, cents for USD)

High-Level Architecture

Six components form the backbone of the system: a Client that supplies idempotency keys with every request, an API Gateway that handles authentication and rate limiting, an Idempotency Service that deduplicates requests before any downstream work begins, a Payment Orchestrator that drives the saga across provider and ledger, a set of Payment Provider Adapters (Stripe, Braintree, PayPal), and a Storage Layer consisting of the Ledger DB, idempotency key store, and Event Bus.

Payment processing system architecture showing client, API gateway, idempotency service, payment orchestrator, provider adapters, ledger, and event bus

A new payment request arrives at the API Gateway carrying an Idempotency-Key header. The gateway passes it to the Idempotency Service, which performs a Redis lookup using the hashed key. On a cache miss, it creates a new idempotency record with status IN_PROGRESS and a distributed lock, then allows the request to proceed to the Payment Orchestrator. The Orchestrator drives the saga: reserve the funds in the user’s account, call the selected payment provider adapter, write credit and debit entries to the Ledger DB in a single transaction, and emit a payment.succeeded event to the Event Bus. On saga completion, the Idempotency Service updates its record with the final response so future duplicates receive an instant cached reply.

On a cache hit, the Idempotency Service inspects the stored record’s status. If COMPLETED, it returns the cached response immediately without touching the Orchestrator. If IN_PROGRESS, it returns 202 Accepted and the client polls the status endpoint. This prevents the thundering-herd scenario where a retry storm during a slow provider response causes hundreds of duplicate saga executions for the same payment.

Key Insight

The idempotency layer must sit upstream of all side effects - not just the payment provider call. If you check idempotency after the API gateway but before the ledger write, a crash between the provider charge and the ledger write will leave you with a charged card and no ledger record. The idempotency record must capture the complete outcome including ledger IDs, so any retry returns the full original response without re-executing any step.

The Idempotency Layer

The Idempotency Layer is the gatekeeper that ensures every client-supplied key maps to exactly one payment outcome. Its job is to answer three questions in under 5ms: have we seen this key before, is the corresponding payment still in progress, and what was the final response if it completed.

The design centers on a Redis hash per idempotency key with a 24-hour TTL. Each record stores the key status (IN_PROGRESS, COMPLETED, FAILED), the payment ID created for this key, and the serialized response body to return on cache hit. A Lua script handles the entire check-and-set atomically to prevent TOCTOU races between concurrent retries arriving simultaneously.

Payment state machine showing transitions from PENDING through PROCESSING to SUCCEEDED, FAILED, TIMED_OUT, and REFUNDED
import hashlib
import json
import time
import uuid
import redis

IDEMPOTENCY_TTL_SECONDS = 86400  # 24 hours
LOCK_TTL_SECONDS = 120           # 2 minutes max for in-progress payments

_lua_acquire = """
local key = KEYS[1]
local lock_key = KEYS[2]
local payment_id = ARGV[1]
local ttl = tonumber(ARGV[2])
local lock_ttl = tonumber(ARGV[3])

local existing = redis.call('HGETALL', key)
if #existing > 0 then
    return existing
end

redis.call('HSET', key,
    'status', 'IN_PROGRESS',
    'payment_id', payment_id,
    'created_at', ARGV[4],
    'response', ''
)
redis.call('EXPIRE', key, ttl)
redis.call('SET', lock_key, payment_id, 'EX', lock_ttl, 'NX')
return nil
"""

class IdempotencyService:
    def __init__(self, redis_client: redis.Redis):
        self.r = redis_client
        self._acquire_script = self.r.register_script(_lua_acquire)

    def _hash_key(self, raw_key: str, user_id: str) -> str:
        """Namespace key by user to prevent cross-user key collisions."""
        combined = f"{user_id}:{raw_key}"
        return "idem:" + hashlib.sha256(combined.encode()).hexdigest()

    def acquire_or_get(self, raw_key: str, user_id: str) -> dict | None:
        """
        Attempt to acquire a new idempotency slot.
        Returns None if slot was created (proceed with payment).
        Returns existing record dict if key was already seen.
        """
        hashed = self._hash_key(raw_key, user_id)
        lock_key = "idem_lock:" + hashed
        payment_id = str(uuid.uuid4())
        now_iso = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())

        result = self._acquire_script(
            keys=[hashed, lock_key],
            args=[payment_id, IDEMPOTENCY_TTL_SECONDS, LOCK_TTL_SECONDS, now_iso]
        )

        if result is None:
            # Slot freshly created - caller proceeds with payment_id
            return {"status": "NEW", "payment_id": payment_id}

        # Key already exists - parse and return existing record
        fields = iter(result)
        record = dict(zip(fields, fields))
        return {k.decode(): v.decode() for k, v in record.items()}

    def complete(self, raw_key: str, user_id: str, response_body: dict, status: str):
        """Seal the idempotency record with the final response."""
        hashed = self._hash_key(raw_key, user_id)
        lock_key = "idem_lock:" + hashed
        pipe = self.r.pipeline()
        pipe.hset(hashed, mapping={
            "status": status,
            "response": json.dumps(response_body),
            "completed_at": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
        })
        pipe.delete(lock_key)
        pipe.execute()
Watch Out

Never use the raw client-supplied idempotency key as a Redis key without namespacing and hashing. A user who guesses another user’s idempotency key string can cause their payment response to be returned for the attacker’s request. Always namespace by authenticated user ID and hash the combined string to prevent enumeration. Also ensure the lock TTL is strictly longer than your maximum payment provider timeout - if the lock expires while the provider call is still in flight, a retry will create a second in-progress record and you may double-charge.

The Payment State Machine

The Payment State Machine governs which transitions are legal and enforces them through conditional database writes. Its job is to ensure no payment can skip states, no terminal state can be overwritten by a late-arriving update, and every transition is recorded with a timestamp for the audit log.

States form a directed acyclic graph with one exception: TIMED_OUT payments can be retried and re-enter PROCESSING. All other terminal states (SUCCEEDED, FAILED) are immutable. Refund states are separate records linked to the original payment rather than mutations of the parent state.

from enum import Enum
from typing import Optional
import psycopg2

class PaymentStatus(str, Enum):
    PENDING = "PENDING"
    PROCESSING = "PROCESSING"
    SUCCEEDED = "SUCCEEDED"
    FAILED = "FAILED"
    TIMED_OUT = "TIMED_OUT"
    REFUNDED = "REFUNDED"

# Valid forward transitions
ALLOWED_TRANSITIONS: dict[PaymentStatus, set[PaymentStatus]] = {
    PaymentStatus.PENDING:     {PaymentStatus.PROCESSING},
    PaymentStatus.PROCESSING:  {PaymentStatus.SUCCEEDED, PaymentStatus.FAILED, PaymentStatus.TIMED_OUT},
    PaymentStatus.TIMED_OUT:   {PaymentStatus.PROCESSING},  # retry path
    PaymentStatus.SUCCEEDED:   {PaymentStatus.REFUNDED},
    PaymentStatus.FAILED:      set(),    # terminal
    PaymentStatus.REFUNDED:    set(),    # terminal
}

def transition_payment(
    conn,
    payment_id: str,
    from_status: PaymentStatus,
    to_status: PaymentStatus,
    metadata: Optional[dict] = None
) -> bool:
    """
    Atomically transition a payment to a new status.
    Returns True if transition succeeded, False if the row was already in a different state.
    Uses conditional UPDATE to prevent races.
    """
    if to_status not in ALLOWED_TRANSITIONS[from_status]:
        raise ValueError(f"Illegal transition {from_status} -> {to_status}")

    with conn.cursor() as cur:
        cur.execute("""
            UPDATE payments
            SET status = %s,
                updated_at = NOW(),
                provider_metadata = COALESCE(%s::jsonb, provider_metadata)
            WHERE id = %s
              AND status = %s
        """, (to_status.value, psycopg2.extras.Json(metadata) if metadata else None,
              payment_id, from_status.value))

        if cur.rowcount == 0:
            conn.rollback()
            return False

        # Append to audit log
        cur.execute("""
            INSERT INTO payment_events (payment_id, from_status, to_status, metadata, occurred_at)
            VALUES (%s, %s, %s, %s, NOW())
        """, (payment_id, from_status.value, to_status.value,
              psycopg2.extras.Json(metadata or {})))

        conn.commit()
        return True
Key Insight

The conditional UPDATE on status = from_status is the linchpin of the state machine. If two concurrent processes both try to transition the same payment from PROCESSING to SUCCEEDED (possible during a retry overlap), only one UPDATE will match the current row state and return rowcount=1. The second receives rowcount=0 and backs off. This optimistic locking approach requires zero explicit locks and scales to thousands of concurrent payments without contention.

The Saga Pattern for Distributed Transactions

The Saga pattern replaces a single distributed ACID transaction with a sequence of local transactions, each with a defined compensating action if a later step fails. For payment processing, this means each step - reserve funds, charge provider, write ledger entries, emit event - can succeed or fail independently, and failures trigger compensating rollbacks in reverse order.

The orchestrator drives the saga via an explicit state log. Every step is written to a saga_steps table before execution and updated on completion. This log is the saga’s recovery mechanism: if the orchestrator crashes mid-saga, any process that resumes it can replay from the last incomplete step rather than restarting from scratch.

package saga

import (
    "context"
    "database/sql"
    "encoding/json"
    "fmt"
    "time"
)

type StepStatus string

const (
    StepPending     StepStatus = "PENDING"
    StepSucceeded   StepStatus = "SUCCEEDED"
    StepFailed      StepStatus = "FAILED"
    StepCompensated StepStatus = "COMPENSATED"
)

type SagaStep struct {
    Name       string
    Execute    func(ctx context.Context, state *PaymentState) error
    Compensate func(ctx context.Context, state *PaymentState) error
}

type PaymentState struct {
    PaymentID      string
    UserID         string
    AmountCents    int64
    Currency       string
    ProviderToken  string
    ProviderTxnID  string
    LedgerEntryIDs []string
}

type PaymentOrchestrator struct {
    db    *sql.DB
    steps []SagaStep
}

func NewPaymentOrchestrator(db *sql.DB, adapters ProviderAdapters) *PaymentOrchestrator {
    return &PaymentOrchestrator{
        db: db,
        steps: []SagaStep{
            {
                Name:       "reserve_funds",
                Execute:    adapters.ReserveFunds,
                Compensate: adapters.ReleaseReservation,
            },
            {
                Name:       "charge_provider",
                Execute:    adapters.ChargeProvider,
                Compensate: adapters.RefundProvider,
            },
            {
                Name:       "write_ledger",
                Execute:    adapters.WriteLedgerEntries,
                Compensate: adapters.ReverseLedgerEntries,
            },
            {
                Name:       "emit_event",
                Execute:    adapters.EmitPaymentEvent,
                Compensate: adapters.EmitCompensationEvent,
            },
        },
    }
}

func (o *PaymentOrchestrator) Execute(ctx context.Context, state *PaymentState) error {
    completedSteps := make([]int, 0, len(o.steps))

    for i, step := range o.steps {
        if err := o.recordStepStart(state.PaymentID, step.Name, i); err != nil {
            return fmt.Errorf("recording step start: %w", err)
        }

        if err := step.Execute(ctx, state); err != nil {
            // Step failed - compensate all previously completed steps in reverse
            o.recordStepFailed(state.PaymentID, step.Name, err)
            for j := len(completedSteps) - 1; j >= 0; j-- {
                idx := completedSteps[j]
                compensateErr := o.steps[idx].Compensate(ctx, state)
                o.recordStepCompensated(state.PaymentID, o.steps[idx].Name, compensateErr)
            }
            return fmt.Errorf("saga failed at step %s, compensation applied: %w", step.Name, err)
        }

        o.recordStepSucceeded(state.PaymentID, step.Name)
        completedSteps = append(completedSteps, i)
    }

    return nil
}

func (o *PaymentOrchestrator) recordStepStart(paymentID, stepName string, stepIndex int) error {
    _, err := o.db.Exec(`
        INSERT INTO saga_steps (payment_id, step_name, step_index, status, started_at)
        VALUES ($1, $2, $3, 'PENDING', NOW())
        ON CONFLICT (payment_id, step_name) DO UPDATE
            SET status = 'PENDING', started_at = NOW()
    `, paymentID, stepName, stepIndex)
    return err
}

func (o *PaymentOrchestrator) recordStepSucceeded(paymentID, stepName string) {
    o.db.Exec(`
        UPDATE saga_steps SET status = 'SUCCEEDED', completed_at = NOW()
        WHERE payment_id = $1 AND step_name = $2
    `, paymentID, stepName)
}

func (o *PaymentOrchestrator) recordStepFailed(paymentID, stepName string, err error) {
    errMsg := err.Error()
    o.db.Exec(`
        UPDATE saga_steps SET status = 'FAILED', error_message = $3, completed_at = NOW()
        WHERE payment_id = $1 AND step_name = $2
    `, paymentID, stepName, errMsg)
}

func (o *PaymentOrchestrator) recordStepCompensated(paymentID, stepName string, err error) {
    status := "COMPENSATED"
    if err != nil {
        status = "COMPENSATION_FAILED"
    }
    o.db.Exec(`
        UPDATE saga_steps SET status = $3, compensated_at = NOW()
        WHERE payment_id = $1 AND step_name = $2
    `, paymentID, stepName, status)
}
Real World

Uber’s money platform uses a saga-based orchestration pattern for their payment flows, logging each step to a durable store before execution. Their key insight - shared in engineering blog posts - is that compensation functions must themselves be idempotent. A refund that gets called twice due to an orchestrator retry must not issue two credits. Stripe’s idempotency key support on their refund API makes this straightforward: pass the original charge ID as the refund idempotency key to guarantee at-most-one refund per original transaction.

Data Model

The data model is organized around four tables: payments for the canonical payment record and current state, payment_events as the immutable audit log of every state transition, saga_steps for orchestration recovery, and ledger_entries for the double-entry accounting records.

-- Core payment record
CREATE TABLE payments (
    id                  UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    idempotency_key     VARCHAR(255) NOT NULL,
    user_id             UUID NOT NULL,
    amount_cents        BIGINT NOT NULL CHECK (amount_cents > 0),
    currency            CHAR(3) NOT NULL,
    status              VARCHAR(20) NOT NULL DEFAULT 'PENDING',
    provider            VARCHAR(20),
    provider_txn_id     VARCHAR(255),
    provider_metadata   JSONB DEFAULT '{}',
    error_code          VARCHAR(64),
    error_message       TEXT,
    created_at          TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at          TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    idempotency_expires_at TIMESTAMPTZ NOT NULL,
    CONSTRAINT payment_status_check CHECK (
        status IN ('PENDING','PROCESSING','SUCCEEDED','FAILED','TIMED_OUT','REFUNDED')
    )
);

-- Namespace idempotency key uniqueness per user
CREATE UNIQUE INDEX idx_payments_idempotency ON payments (user_id, idempotency_key)
    WHERE idempotency_expires_at > NOW();

CREATE INDEX idx_payments_user_status ON payments (user_id, status, created_at DESC);
CREATE INDEX idx_payments_provider_txn ON payments (provider, provider_txn_id)
    WHERE provider_txn_id IS NOT NULL;

-- Partition by created_at for archive management (7 year retention)
-- CREATE TABLE payments_2026_06 PARTITION OF payments FOR VALUES FROM ('2026-06-01') TO ('2026-07-01');

-- Immutable audit log - append only, never update
CREATE TABLE payment_events (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    payment_id  UUID NOT NULL REFERENCES payments(id),
    from_status VARCHAR(20) NOT NULL,
    to_status   VARCHAR(20) NOT NULL,
    metadata    JSONB DEFAULT '{}',
    occurred_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (occurred_at);

CREATE INDEX idx_events_payment_time ON payment_events (payment_id, occurred_at DESC);

-- Saga orchestration log for crash recovery
CREATE TABLE saga_steps (
    id               UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    payment_id       UUID NOT NULL REFERENCES payments(id),
    step_name        VARCHAR(64) NOT NULL,
    step_index       SMALLINT NOT NULL,
    status           VARCHAR(30) NOT NULL DEFAULT 'PENDING',
    error_message    TEXT,
    started_at       TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    completed_at     TIMESTAMPTZ,
    compensated_at   TIMESTAMPTZ,
    CONSTRAINT saga_step_unique UNIQUE (payment_id, step_name)
);

CREATE INDEX idx_saga_payment ON saga_steps (payment_id);

-- Double-entry ledger: every payment creates exactly two entries (debit + credit)
CREATE TABLE ledger_entries (
    id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    payment_id    UUID NOT NULL REFERENCES payments(id),
    account_id    UUID NOT NULL,
    entry_type    CHAR(2) NOT NULL CHECK (entry_type IN ('DR', 'CR')),
    amount_cents  BIGINT NOT NULL CHECK (amount_cents > 0),
    currency      CHAR(3) NOT NULL,
    balance_after BIGINT NOT NULL,
    description   TEXT,
    posted_at     TIMESTAMPTZ NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (posted_at);

CREATE INDEX idx_ledger_account_time ON ledger_entries (account_id, posted_at DESC);
CREATE INDEX idx_ledger_payment ON ledger_entries (payment_id);
Payment data flow showing request path through idempotency check, saga steps, and response caching

Partitioning ledger_entries by posted_at lets you drop monthly partitions after the 7-year retention window expires, keeping the live table lean. The double-entry invariant is enforced at the application layer: every write_ledger saga step inserts exactly one DR and one CR entry in the same database transaction, with a CHECK constraint ensuring the two entries for a payment sum to zero when combined.

Key Algorithms and Protocols

Idempotency Key Hashing

Raw client keys must be namespaced and hashed before use as cache keys to prevent cross-user collision and key enumeration attacks.

import hashlib
import hmac
import os

_HASH_SECRET = os.environ["IDEMPOTENCY_HASH_SECRET"].encode()

def make_cache_key(user_id: str, raw_key: str) -> str:
    """
    HMAC-SHA256 the user-namespaced key with a server secret.
    This prevents clients from predicting cache key collisions
    and ensures different users' identical key strings never collide.
    """
    namespace = f"{user_id}:{raw_key}"
    mac = hmac.new(_HASH_SECRET, namespace.encode(), hashlib.sha256)
    return "idem:" + mac.hexdigest()
Key Insight

Using HMAC instead of plain SHA-256 for key hashing adds server-side secret rotation capability. If your hash secret leaks, you can rotate it and all existing keys immediately become unresolvable, forcing a clean slate. With plain SHA-256 there is no secret to rotate - an attacker who has observed enough request logs can reconstruct the key space.

Duplicate Detection Window

The duplicate detection window must handle the gap between a record entering IN_PROGRESS and the response being sealed into the idempotency store. During this window, a retry must receive a deterministic “in progress” response rather than being allowed to spawn a second saga.

import time
import redis
from typing import Optional

def check_duplicate_window(
    r: redis.Redis,
    cache_key: str,
    retry_deadline_secs: int = 30
) -> Optional[dict]:
    """
    During the in-progress window, poll with bounded backoff rather than
    immediately allowing the retry to create a new saga.
    Returns the completed response if the original finishes, or None if
    the window expires (triggering a timeout recovery path).
    """
    deadline = time.monotonic() + retry_deadline_secs
    backoff = 0.1  # start at 100ms

    while time.monotonic() < deadline:
        record = r.hgetall(cache_key)
        if not record:
            # Key expired - treat as new payment
            return None

        status = record.get(b"status", b"").decode()
        if status == "COMPLETED":
            import json
            response_raw = record.get(b"response", b"{}").decode()
            return json.loads(response_raw)
        if status == "FAILED":
            import json
            response_raw = record.get(b"response", b"{}").decode()
            return json.loads(response_raw)

        # Still IN_PROGRESS - back off and retry
        time.sleep(min(backoff, 2.0))
        backoff *= 1.5

    return None  # Deadline exceeded - escalate to timeout recovery
Key Insight

The 30-second duplicate detection window must be shorter than your idempotency key TTL but longer than your p99 payment provider latency. If the window is shorter than the provider round-trip time, retries will slip through and you will issue duplicate charges. Instrument your provider adapter latencies and set the window to at least 2 * p99_provider_latency. For most providers this is 8-12 seconds; set the window to 30 seconds for comfortable margin.

At-Least-Once Retry with Exponential Backoff

Payment clients must retry with backoff, but the retry window has a hard upper bound beyond which the payment should be treated as failed rather than retried indefinitely.

package retry

import (
    "context"
    "errors"
    "math"
    "math/rand"
    "time"
)

type RetryConfig struct {
    MaxAttempts int
    BaseDelay   time.Duration
    MaxDelay    time.Duration
    Multiplier  float64
    JitterFrac  float64
}

var DefaultPaymentRetry = RetryConfig{
    MaxAttempts: 4,
    BaseDelay:   500 * time.Millisecond,
    MaxDelay:    10 * time.Second,
    Multiplier:  2.0,
    JitterFrac:  0.2,
}

// Sentinel errors that should NOT be retried (client errors)
var ErrInsufficientFunds = errors.New("insufficient_funds")
var ErrCardDeclined = errors.New("card_declined")
var ErrInvalidCard = errors.New("invalid_card")

func isRetryable(err error) bool {
    // Only retry transient provider errors, never hard declines
    if errors.Is(err, ErrInsufficientFunds) { return false }
    if errors.Is(err, ErrCardDeclined)      { return false }
    if errors.Is(err, ErrInvalidCard)       { return false }
    return true
}

func WithRetry(ctx context.Context, cfg RetryConfig, op func(ctx context.Context) error) error {
    var lastErr error
    for attempt := 0; attempt < cfg.MaxAttempts; attempt++ {
        lastErr = op(ctx)
        if lastErr == nil {
            return nil
        }
        if !isRetryable(lastErr) {
            return lastErr  // Hard failure - do not retry
        }

        if attempt == cfg.MaxAttempts-1 {
            break
        }

        delay := float64(cfg.BaseDelay) * math.Pow(cfg.Multiplier, float64(attempt))
        if delay > float64(cfg.MaxDelay) {
            delay = float64(cfg.MaxDelay)
        }
        // Add jitter to prevent synchronized retry storms across API nodes
        jitter := (rand.Float64()*2 - 1) * cfg.JitterFrac * delay
        sleep := time.Duration(delay + jitter)

        select {
        case <-ctx.Done():
            return ctx.Err()
        case <-time.After(sleep):
        }
    }
    return fmt.Errorf("exhausted %d retry attempts: %w", cfg.MaxAttempts, lastErr)
}
Key Insight

Jitter is not optional for payment retries. Without jitter, all 200 API nodes that received the same payment request during a Stripe 5xx window will retry simultaneously at T+500ms, T+1s, T+2s, creating synchronized load spikes that worsen the provider’s recovery. Full jitter - randomizing the delay across the full backoff range - reduces retry storm amplitude by up to 80% compared to unjittered exponential backoff, as measured by AWS’s analysis of their own retry patterns.

Scaling and Performance

The system scales on three independent axes: API and Idempotency nodes scale horizontally via stateless load balancing, Payment Orchestrators scale via consistent hashing on user_id to colocate concurrent requests for the same user, and the Ledger DB scales via primary-replica reads with partition-level sharding for high write throughput.

Horizontal scaling diagram showing consistent hashing across API nodes, sharded Redis idempotency store, and primary-replica ledger DB

The idempotency key store is the first scaling bottleneck. At 50,000 transactions per hour, each transaction generates 3-5 Redis operations (acquire, status check, complete), yielding 250,000 Redis ops/hour or roughly 70 ops/second sustained - well within a single Redis instance’s capacity. At peak flash-sale load of 200,000 transactions per hour, you hit 280 ops/second. Use Redis Cluster with at least 3 shards, partitioned by user_id hash slot, to handle this with headroom.

The Ledger DB is the second bottleneck. Double-entry writes require a transaction inserting two ledger rows plus updating the payments row, for 3 writes per transaction. At 50,000 transactions per hour, that’s 150,000 writes per hour or 42 writes/second - comfortably handled by a single PostgreSQL primary. At 10x peak, use pgBouncer for connection pooling and read replicas for the reconciliation job’s audit queries.

Capacity estimation:

  Sustained load:        50,000 txns/hour = 14 txns/second
  Peak (flash sale):    200,000 txns/hour = 56 txns/second

  Redis Cluster sizing:
    Ops per txn: 4 (acquire, lock, complete, expire)
    Peak Redis ops: 56 * 4 = 224 ops/second
    Redis cluster: 3 shards x 100,000 ops/sec capacity = 300,000 ops/sec headroom

  Ledger DB sizing (PostgreSQL):
    Writes per txn: 3 (2 ledger entries + 1 payment update)
    Audit events per txn: 2 (PENDING->PROCESSING, PROCESSING->SUCCEEDED)
    Peak write load: 56 * 5 = 280 writes/second
    Single primary handles up to ~5,000 writes/second with WAL

  Storage (Ledger - 7 year retention):
    ~400 bytes per ledger_entry row
    2 entries * 50,000 txns/hour * 8760 hours/year * 7 years = 6.1B rows
    Storage: 6.1B * 400 bytes = ~2.4 TB (with partitioning, active partition ~350 GB)

  Idempotency store (Redis - 24 hour TTL):
    ~800 bytes per key (hash with response body)
    50,000 keys/hour * 24 hours = 1.2M live keys
    Memory: 1.2M * 800 bytes = ~960 MB -> fits in 3 x 2GB Redis shards easily
Real World

Stripe’s engineering team published that their idempotency layer stores over 100 million active keys at peak, using a combination of a fast in-memory cache layer and a durable PostgreSQL backing store. Their key architectural decision was to make the idempotency store the source of truth for response replay rather than re-deriving the response from the payment record - which is exactly the design described here. They also noted that the most common production bug in payment idempotency is not checking the idempotency record status before executing side effects, which allows a crashed-and-restarted orchestrator to re-execute a saga whose first step already completed at the provider.

Failure Modes and Recovery

FailureDetectionImpactRecovery
Idempotency Redis node crashHealth check fails, cache miss spikeNew requests proceed without duplicate check for TTL windowRedis Cluster failover promotes replica in under 5s; in-flight IN_PROGRESS keys must be audited against payment DB
Payment provider timeout (no response)HTTP 408 / connection timeoutPayment status unknown - may or may not have been chargedQuery provider using provider_txn_id via their retrieve API; update payment status from provider response
Saga partial completion (ledger write fails after provider charge)saga_steps shows charge_provider=SUCCEEDED, write_ledger=FAILEDProvider charged user but ledger not updatedSaga compensator calls provider refund API using original charge ID as idempotency key; alerts reconciliation job
Orchestrator crash mid-sagaMissing heartbeat or pod restartSaga left in partially completed stateResume saga from last completed step using saga_steps log; re-execute only PENDING and FAILED steps
Duplicate provider charge (provider idempotency failure)Reconciliation job finds two provider charges for one payment IDUser double-chargedReconciliation job triggers compensating refund for duplicate; alerts on-call engineer
Ledger imbalance (DR and CR entries do not sum to zero)Nightly reconciliation queryBooks out of balanceAlert triggers manual review; automated correction inserts balancing entry with audit note
Watch Out

The most dangerous failure is a provider charge with no ledger record. This happens when the write_ledger saga step fails after charge_provider succeeds, and the compensation function also fails (for example, the provider refund API is down). You now have a charged user and no record of why. The defense is a nightly reconciliation job that fetches all charges from the provider API for the past 24 hours and cross-references them against your ledger. Any provider charge with no matching ledger entry triggers an immediate alert with the provider transaction ID so an engineer can manually post the entry and investigate the saga failure.

Comparison of Approaches

ApproachConsistency GuaranteeComplexityFailure RecoveryBest Fit
Saga with orchestrator (this design)Eventually consistent with compensationHighResume from saga_steps logMulti-provider payment flows requiring audit trail
Two-phase commit (2PC)Atomically consistentVery HighCoordinator recovery protocolSingle-database payment within one RDBMS cluster
Outbox pattern onlyEventually consistent, no compensationMediumRetry outbox messagesSimple single-provider flows with idempotent provider API
Process manager with state machineEventually consistent with replayHighEvent log replay from checkpointEvent-driven architectures with async provider callbacks
Choreography-based sagaEventually consistentMedium-HighHard to reason about failure pathsMicroservices with well-defined domain events, no shared state

The saga orchestrator wins for multi-provider payment systems because the centralized orchestrator provides a single place to observe saga state, implement compensations, and recover from partial failures. Choreography-based sagas distribute saga logic across services, making it difficult to answer “why did this payment fail?” during an incident. Two-phase commit provides stronger consistency guarantees but requires all participants to support the 2PC protocol - most third-party payment APIs do not. The outbox pattern is simpler but only works when your provider API is strictly idempotent and you never need to compensate partially completed flows.

Key Takeaways

  • Idempotency keys are the client’s tracking number: every mutable API call must carry one, and the server must store both the deduplication signal and the original response to replay on retry.
  • The idempotency layer must be the first check before any side effect - not after the provider call, not after the ledger write, but before any step that cannot be safely re-executed.
  • Saga pattern replaces distributed transactions with a sequence of local transactions plus compensating actions, making partial failures recoverable without requiring 2PC support from payment providers.
  • The payment state machine enforces immutability of terminal states via conditional SQL updates - the optimistic lock pattern prevents concurrent processes from overwriting completed payments.
  • At-least-once retry with jitter is safe precisely because idempotency guarantees the provider receives at most one new charge per idempotency key, regardless of how many retries reach the orchestrator.
  • Double-entry ledger as the source of truth means your internal books are authoritative - provider records are a reconciliation input, not the ground truth. This inverts the usual integration pattern but is essential for auditability.
  • Reconciliation jobs are not optional - they are the safety net that catches every failure mode that slips past your real-time defenses, including provider-side bugs that your system had no visibility into.
  • Compensating transactions must themselves be idempotent - a refund triggered by saga compensation that gets called twice due to an orchestrator retry must not issue two credits; always pass the original charge ID as the refund idempotency key.

The hardest conceptual shift in payment system design is accepting that “exactly once” is impossible at the network layer, and designing instead for “at most once at the provider, at least once at the orchestrator, with idempotency guaranteeing the user sees exactly one outcome.” That separation of concerns - retry freely at the infrastructure layer, deduplicate at the application layer - is what makes reliable payments at scale achievable.

Frequently Asked Questions

Q: Why use a Redis-based idempotency store instead of just checking the payments database?

A: The payments database check adds a synchronous database round-trip on every request, including high-frequency duplicate retries that would otherwise be deflected at the cache layer. More importantly, the idempotency store stores the serialized response body so duplicates receive the exact original response - including the same payment ID, timestamps, and metadata. Re-deriving the response from the payments table requires joining multiple tables and applying business logic, which is slower and risks returning a slightly different response body if the serialization logic changes between the original request and the retry. Redis gives you the cached original response in under 1ms.

Q: Why not use two-phase commit across all payment steps instead of sagas?

A: Two-phase commit requires all participants to implement the 2PC protocol - specifically, a coordinator that can block transactions while waiting for participant votes. Stripe, Braintree, and PayPal do not implement 2PC. They are third-party HTTP APIs with their own transaction semantics. Any payment system that touches external providers must use a compensation-based approach because you cannot include a third-party API in a distributed transaction. Even within your own services, 2PC performance degrades significantly under coordinator failures and requires careful deadlock analysis across all participants.

Q: How do you handle a payment provider that has its own idempotency implementation differently from yours?

A: Each provider adapter generates a provider-specific idempotency key derived from your internal payment_id. Stripe accepts an Idempotency-Key header; Braintree uses a submitForSettlementIdempotencyKey; PayPal uses PayPal-Request-Id. The adapter maps your payment_id to the provider’s idempotency mechanism. The key is to use a stable, deterministic value - the payment_id itself works well - so retrying the adapter call with the same payment_id always sends the same provider key and the provider deduplicates on their side. Never generate a random idempotency key inside the retry loop; generate it once before the first attempt.

Q: What happens when the reconciliation job finds a provider charge with no ledger entry?

A: The reconciliation job classifies the discrepancy by checking the saga_steps table. If charge_provider is SUCCEEDED but write_ledger is FAILED or COMPENSATED, this is a known partial saga failure. The reconciliation job verifies whether the compensation refund was issued (by querying the provider for a refund against that charge). If no refund exists, it creates a manual review ticket and optionally issues a refund automatically. If charge_provider has no corresponding saga_steps record at all, this indicates a provider-side duplicate charge caused by the provider’s own systems - escalate immediately to the provider’s support team with the charge ID and your transaction timeline.

Q: Why store monetary amounts as integers instead of decimals or floats?

A: Floating-point arithmetic is non-deterministic for monetary values. 0.1 + 0.2 in IEEE 754 float does not equal 0.3. This becomes a serious ledger imbalance source when summing thousands of transactions. Storing in the smallest currency unit as a BIGINT means all arithmetic is exact integer arithmetic. A payment of $10.99 is stored as 1099 cents; a payment of 399.00 Indian rupees is stored as 39900 paise. The only division that ever occurs is when displaying to users, and that happens in the presentation layer with controlled rounding rules. This is the standard approach used by Stripe, PayPal, and every major payment infrastructure library.

Q: How do you prevent a user from submitting the same payment with different amounts under the same idempotency key?

A: The idempotency key is bound to the complete request fingerprint at creation time, not just the key string. When a new idempotency record is created, we store a hash of the request body alongside the key. On subsequent requests with the same key, we compare the incoming request body hash against the stored one. If they differ - same key, different amount - we return HTTP 422 Unprocessable Entity with a clear error: “idempotency key reused with different request parameters.” This prevents accidental amount mutations on retry and also prevents intentional fraud attempts that reuse keys to alter transaction amounts.

Interview Questions

Q: Walk me through how you guarantee a user is never double-charged even when your API pod crashes between the provider charge and the ledger write.

Expected depth: Explain the saga step log: charge_provider writes a SUCCEEDED row to saga_steps before returning to the orchestrator. If the orchestrator crashes after that write, the resumed saga reads saga_steps, sees charge_provider=SUCCEEDED, and skips re-executing it - proceeding directly to write_ledger. Explain that the compensation path is the other branch: if write_ledger fails and cannot be retried, the compensator calls the provider refund API using the original payment_id as the refund idempotency key, ensuring the refund is issued at most once. The user sees a failed payment rather than a charged one with no ledger record.

Q: How would you design the reconciliation job to detect a discrepancy where Stripe charged the user but our system has no record of it?

Expected depth: Describe fetching all charges from the Stripe API for a time window using their list charges endpoint with created[gte] and created[lte] parameters. For each Stripe charge, look up the payment_id stored in Stripe’s metadata field (which we set when creating the charge). Query our ledger_entries table for a matching payment_id. A Stripe charge with no matching ledger entry is a discrepancy. Discuss three causes: saga compensation failed to refund, Stripe created a duplicate charge due to their own bug, or the charge came from a test/rogue API key. Classify each by checking saga_steps. Mention that this job should run within 24 hours to comply with typical chargeback windows.

Q: A flash sale causes 5,000 concurrent payment requests in the first second. How does your idempotency layer handle this without Redis becoming a bottleneck?

Expected depth: Discuss Redis Cluster with at least 3 shards partitioned by user ID hash slot. 5,000 requests per second generate roughly 20,000 Redis ops/second (4 ops per request) - well within a Redis Cluster’s capacity of over 300,000 ops/second across 3 shards. The more important concern is hot keys: if many users retry the same payment simultaneously, each user’s key lives on one shard and is accessed independently - no single hot key. Contrast with a naive single-Redis approach where a single flawed idempotency key used by a misconfigured client hammering retries could saturate one shard. Mention connection pooling to avoid connection exhaustion: use a pool of 20-50 connections per API node shared across all requests.

Q: How would you extend this design to support split payments where one purchase is charged across two different payment methods?

Expected depth: Model a split payment as a parent payment record with two child payment_leg records, each with its own idempotency key derived from parent_payment_id + leg_index. The saga orchestrates both legs as parallel steps: both provider charges execute concurrently, and the ledger write only proceeds if both succeed. If leg 1 succeeds and leg 2 fails, the compensation must refund leg 1 using its idempotency key. Discuss the user-facing state machine: the parent payment shows PROCESSING until both legs complete; if any leg fails, the parent transitions to FAILED and compensation runs. The key constraint is that both legs must use different payment methods - prevent the same card from appearing as both legs to avoid split-charge detection by card networks.

Q: Design the API contract for the payment endpoint to make idempotency key misuse as difficult as possible for client developers.

Expected depth: Require Idempotency-Key as a mandatory header - return 400 if absent. Document that the key must be a UUID generated per payment intent, not per retry. Return 422 with a distinct error code idempotency_key_mismatch if the same key is submitted with different request parameters. Return 200 with the cached response (not 201) for duplicate requests that completed, so clients can distinguish “new success” from “replayed success” using the X-Idempotent-Replayed: true response header. Return 202 for duplicates that are still IN_PROGRESS with a Location header pointing to the payment status endpoint for polling. Document the 24-hour expiry prominently - a common client bug is building idempotency keys from date-based components that repeat across days, causing next-day payments to collide with yesterday’s completed ones.

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access
Unlock Full Article