Build Uber's Surge Pricing Engine

distributed-systems scalability performance

System Design Deep Dive

Uber Surge Pricing Engine

Detecting supply-demand imbalance per geographic zone and computing surge multipliers in under a second without race conditions

14 min readAdvancedDynamic Pricing

Think of surge pricing like a commodities market on a city grid. When crude oil demand spikes and supply is constrained, the market price rises instantly - no committee, no deliberation. The higher price incentivizes more production and reduces some demand until equilibrium returns. Surge pricing does exactly this for urban transportation, except the “price” must be set per 1km2 zone, updated every 30 seconds, and communicated to millions of users on mobile apps - all without two users seeing different prices for the same ride at the same instant.

The mechanics sound simple: count the drivers, count the ride requests, compute the ratio, apply a multiplier. But the engineering challenge hides in the gaps. Driver and rider counts are moving targets - a driver accepting a trip in one zone might physically be at the edge of an adjacent zone. A rider who sees a 1.2x surge price and clicks “confirm” must be charged exactly 1.2x, even if the price ticks up to 1.4x between their click and the backend processing the request. And you have to do all of this at the scale of 70+ cities simultaneously, with each city having 50-200 distinct surge zones.

Every 30 seconds, the surge engine must re-evaluate potentially 5,000 zone-vehicle-class combinations across all metros, publish the updated prices to all connected mobile clients, and ensure that riders currently mid-booking see a consistent price throughout their booking flow. A price change that happens mid-checkout is a support nightmare and a legal grey area in many jurisdictions.

We need to solve for sub-second zone re-computation, strong price consistency within a booking session, and predictive surge modeling that smooths out the sawtooth oscillation that naive reactive pricing creates.

Requirements and Constraints

Functional Requirements

Divide each metro into geographic surge zones (H3 cells at resolution 7, roughly 5km2 per cell)
Continuously aggregate supply (available drivers) and demand (open ride requests + demand forecast) per zone
Compute a surge multiplier for each zone-vehicle-class combination every 30 seconds
Publish updated prices to all active rider apps within 5 seconds of computation
Guarantee that a rider who sees a quoted price gets exactly that price if they confirm within 60 seconds
Apply surge multipliers to fare estimates shown during the pre-booking screen
Communicate surge level changes to drivers to influence repositioning behavior
Expose zone surge status to internal services (matching engine, analytics, marketing)

Non-Functional Requirements

Surge computation latency: P99 under 500ms for full metro re-computation
Price publication latency: P99 under 5 seconds from computation to all connected clients
Price consistency: a locked price (user confirmed) must not change; zero race conditions between price display and charge
Zone count: up to 10,000 active zones globally (avg 100 per metro, 70+ metros)
Throughput: handle 50,000 concurrent rider connections per metro for price updates
Historical retention: surge data retained for 90 days for analytics and audit

Constraints and Assumptions

Surge only applies to marketplace pricing, not flat-rate tiers (subscriptions out of scope)
We assume driver locations are available from the matching system’s geo index
Fare calculation service is a separate system; the surge engine only produces multipliers
Legal teams may apply city-specific maximum multiplier caps (e.g., 5x ceiling); these are configuration, not code
Mobile clients connect via WebSocket for real-time price updates; HTTP polling is the fallback

High-Level Architecture

The surge pricing engine comprises five major components: zone aggregation, multiplier computation, price versioning, client fan-out, and the price lock service.

Uber surge pricing engine architecture overview showing zone aggregation to client fan-out

The Zone Aggregator is a stream-processing job (Flink or Kafka Streams) that consumes the driver location event stream and ride request stream. It maintains a rolling 5-minute window of supply and demand counts per H3 zone. Every 30 seconds, it emits a snapshot of all zone metrics to the Surge Computation Service.

The Surge Computation Service receives zone snapshots and applies the surge multiplier model. It reads the current multiplier from a versioned store, computes the new multiplier, applies smoothing to prevent oscillation, enforces city-specific caps, and writes the new price to the Surge Price Store with a monotonically increasing version number.

The Surge Price Store is a Redis hash per metro, keyed by (zone_id, vehicle_class). Every write includes a version number. The version is the source of truth for consistency checks during booking.

The Price Fan-out Service subscribes to price update events and pushes delta updates to all connected rider apps via WebSocket. It uses a publish-subscribe topology - one writer, many readers - to handle the fan-out from a single zone update to 50,000 connected clients without O(N) writes in the computation path.

The Price Lock Service handles the critical consistency requirement. When a rider initiates a booking, it records the current zone price and version into a short-lived reservation. When the booking completes, the fare service reads the locked price from this reservation rather than the current zone price.

Key Insight

The surge engine is fundamentally a read-heavy broadcast system. A single price update for one zone triggers reads by thousands of clients. The architecture must decouple write throughput (30-second updates for 10,000 zones) from read throughput (50,000 clients checking prices continuously).

Zone Aggregation and H3 Spatial Hierarchy

Zone-based demand aggregation is the foundation of the entire system. A naive approach assigns each driver and each ride request to a fixed rectangular grid cell. H3 hexagonal cells are better for three reasons: uniform neighbor distances (each hexagon’s center is the same distance from all 6 neighbors, unlike rectangles), clean hierarchical containment (one resolution 5 cell contains exactly 7 resolution 6 cells), and accurate ring queries.

Zone aggregation data flow from driver pings through H3 cells to surge computation

The aggregation challenge is that drivers near zone boundaries should contribute to both the zone they are in and their adjacent zones. A driver waiting at the edge of Zone A is effectively supply for both Zone A and Zone B. We handle this with boundary weighting - a driver within 500m of a zone boundary contributes 0.6 to their primary zone and 0.4 to the adjacent zone in the supply count.

import h3
from dataclasses import dataclass
from typing import Dict, Tuple
from collections import defaultdict

SURGE_H3_RESOLUTION = 7    # ~5 km2 per cell
BOUNDARY_WEIGHT = 0.6      # primary zone weight for boundary drivers
BOUNDARY_THRESHOLD_M = 500 # meters from cell center boundary

@dataclass
class ZoneMetrics:
    zone_id: str
    vehicle_class: str
    supply_count: float       # available drivers (weighted)
    demand_count: float       # open requests + forecast
    supply_demand_ratio: float
    timestamp_sec: int

def assign_driver_to_zones(
    driver_lat: float,
    driver_lon: float,
    vehicle_class: str
) -> Dict[str, float]:
    """
    Returns a dict of {zone_id: weight} for a driver's position.
    Most drivers are fully inside one zone (weight=1.0).
    Boundary drivers split their contribution.
    """
    primary_cell = h3.latlng_to_cell(driver_lat, driver_lon, SURGE_H3_RESOLUTION)
    # Get distance from cell boundary using H3 cell center distance
    cell_center = h3.cell_to_latlng(primary_cell)
    dist_from_center = haversine_distance_m(
        driver_lat, driver_lon, cell_center[0], cell_center[1]
    )
    cell_radius_m = h3.average_hexagon_edge_length(SURGE_H3_RESOLUTION, unit="m") * 0.866

    if dist_from_center < cell_radius_m - BOUNDARY_THRESHOLD_M:
        # Well inside zone - full weight
        return {primary_cell: 1.0}

    # Near boundary - split weight across neighbors
    neighbors = h3.grid_disk(primary_cell, 1) - {primary_cell}
    # Find which neighbor is closest to the driver
    closest_neighbor = min(
        neighbors,
        key=lambda c: haversine_distance_m(
            driver_lat, driver_lon, *h3.cell_to_latlng(c)
        )
    )
    return {
        primary_cell: BOUNDARY_WEIGHT,
        closest_neighbor: 1.0 - BOUNDARY_WEIGHT
    }

def aggregate_zone_metrics(
    driver_positions: list,   # list of (driver_id, lat, lon, vehicle_class, status)
    ride_requests: list,      # list of (request_id, lat, lon, vehicle_class)
    demand_forecast: Dict[Tuple[str, str], float]  # {(zone_id, vehicle_class): predicted_demand}
) -> Dict[Tuple[str, str], ZoneMetrics]:
    """Aggregates supply and demand per (zone_id, vehicle_class)."""
    supply: Dict[Tuple, float] = defaultdict(float)
    demand: Dict[Tuple, float] = defaultdict(float)

    for _, lat, lon, vclass, status in driver_positions:
        if status != "AVAILABLE":
            continue
        zone_weights = assign_driver_to_zones(lat, lon, vclass)
        for zone_id, weight in zone_weights.items():
            supply[(zone_id, vclass)] += weight

    for _, lat, lon, vclass in ride_requests:
        cell = h3.latlng_to_cell(lat, lon, SURGE_H3_RESOLUTION)
        demand[(cell, vclass)] += 1.0

    # Add forecasted demand (ML model predictions for next 5 minutes)
    for (zone_id, vclass), forecasted in demand_forecast.items():
        demand[(zone_id, vclass)] += forecasted * 0.3  # 30% weight on forecast

    all_zones = set(supply.keys()) | set(demand.keys())
    import time
    ts = int(time.time())
    result = {}
    for (zone_id, vclass) in all_zones:
        s = supply.get((zone_id, vclass), 0.0)
        d = demand.get((zone_id, vclass), 0.0)
        ratio = s / max(d, 1.0)  # avoid division by zero
        result[(zone_id, vclass)] = ZoneMetrics(
            zone_id=zone_id,
            vehicle_class=vclass,
            supply_count=s,
            demand_count=d,
            supply_demand_ratio=ratio,
            timestamp_sec=ts
        )
    return result

def haversine_distance_m(lat1, lon1, lat2, lon2) -> float:
    """Returns distance in meters between two GPS coordinates."""
    import math
    R = 6371000
    phi1, phi2 = math.radians(lat1), math.radians(lat2)
    dphi = math.radians(lat2 - lat1)
    dlambda = math.radians(lon2 - lon1)
    a = math.sin(dphi/2)**2 + math.cos(phi1)*math.cos(phi2)*math.sin(dlambda/2)**2
    return R * 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))

Real World

Uber uses H3 at multiple resolutions simultaneously. Resolution 7 cells (~5 km2) are used for surge computation. Resolution 9 cells (~0.1 km2) are used for fine-grained driver positioning. The hierarchical containment property means you can aggregate resolution 9 counts into resolution 7 zones with a simple sum - no geometry intersection required.

Surge Multiplier Model and Oscillation Prevention

The naive surge formula is multiplier = f(demand / supply). The problem with naive reactive pricing is oscillation: a high price attracts more drivers, overshooting supply, which drops the price, drivers leave the zone, price spikes again. Left unchecked, this creates a sawtooth pattern that riders and drivers both find maddening.

Surge multiplier computation with smoothing and demand forecasting components

The solution is a rate-limited multiplier - the price can only change by a maximum delta per re-computation cycle - combined with an exponential moving average that smooths out momentary demand spikes.

from dataclasses import dataclass
from typing import Optional
import math

SURGE_TIERS = [
    (float('inf'), 0.0, 1.0),   # ratio >= inf  -> 1.0x (never fires, catches all)
    (2.0, 1.5, 1.0),             # ratio >= 2.0 (plenty supply) -> 1.0x
    (1.5, 1.0, 1.1),             # ratio 1.5-2.0 -> 1.1x (mild)
    (1.0, 0.8, 1.2),             # ratio 1.0-1.5 -> 1.2x
    (0.8, 0.6, 1.4),             # ratio 0.8-1.0 -> 1.4x
    (0.6, 0.4, 1.7),             # ratio 0.6-0.8 -> 1.7x
    (0.4, 0.2, 2.1),             # ratio 0.4-0.6 -> 2.1x
    (0.2, 0.0, 3.0),             # ratio 0.2-0.4 -> 3.0x
    (0.0, -1.0, 5.0),            # ratio < 0.2   -> max cap
]

MAX_MULTIPLIER_DELTA_PER_CYCLE = 0.5  # max change per 30-second cycle
EMA_ALPHA = 0.4  # exponential moving average smoothing factor (0.4 = moderate smoothing)

@dataclass
class SurgeState:
    zone_id: str
    vehicle_class: str
    current_multiplier: float
    ema_ratio: float              # smoothed supply-demand ratio
    version: int                  # monotonically increasing
    city_max_multiplier: float    # city-specific regulatory cap

def compute_new_multiplier(
    metrics: ZoneMetrics,
    current_state: SurgeState
) -> SurgeState:
    """
    Applies EMA smoothing then looks up surge tier.
    Rate-limits the delta to prevent oscillation.
    """
    raw_ratio = metrics.supply_demand_ratio

    # Apply exponential moving average to smooth noisy signal
    smoothed_ratio = (
        EMA_ALPHA * raw_ratio +
        (1 - EMA_ALPHA) * current_state.ema_ratio
    )

    # Look up target multiplier from tier table
    target_multiplier = 1.0
    for upper, lower, mult in SURGE_TIERS:
        if lower <= smoothed_ratio < upper:
            target_multiplier = mult
            break

    # Apply city-specific cap (regulatory or business policy)
    target_multiplier = min(target_multiplier, current_state.city_max_multiplier)

    # Rate-limit the change - prevent jumps of more than 0.5x per cycle
    current = current_state.current_multiplier
    delta = target_multiplier - current
    if abs(delta) > MAX_MULTIPLIER_DELTA_PER_CYCLE:
        target_multiplier = current + math.copysign(MAX_MULTIPLIER_DELTA_PER_CYCLE, delta)

    # Round to nearest 0.1x to avoid awkward prices like 1.37x
    target_multiplier = round(target_multiplier * 10) / 10

    return SurgeState(
        zone_id=metrics.zone_id,
        vehicle_class=metrics.vehicle_class,
        current_multiplier=target_multiplier,
        ema_ratio=smoothed_ratio,
        version=current_state.version + 1,
        city_max_multiplier=current_state.city_max_multiplier
    )

def write_surge_to_store(state: SurgeState, redis_client) -> bool:
    """
    Writes new surge state using optimistic locking.
    Returns False if a concurrent writer updated the version first.
    """
    key = f"surge:{state.zone_id}:{state.vehicle_class}"
    expected_version = state.version - 1  # version before this update

    # Lua script: compare version, then write atomically
    lua = """
local current_ver = tonumber(redis.call('HGET', KEYS[1], 'version') or '0')
if current_ver ~= tonumber(ARGV[1]) then
    return 0
end
redis.call('HMSET', KEYS[1],
    'multiplier', ARGV[2],
    'ema_ratio', ARGV[3],
    'version', ARGV[4],
    'updated_at', ARGV[5]
)
redis.call('PUBLISH', 'surge:updates', KEYS[1])
return 1
"""
    import time
    result = redis_client.eval(
        lua,
        1,
        key,
        expected_version,
        state.current_multiplier,
        state.ema_ratio,
        state.version,
        int(time.time())
    )
    return bool(result)

Watch Out

The EMA alpha parameter is critically important. A high alpha (0.8+) makes the price very reactive - good for matching supply quickly, but creates the oscillation problem. A low alpha (0.1) makes price changes very sluggish - drivers don’t respond fast enough and supply never recovers. In practice, Uber likely uses different alpha values for price-up (fast, to attract supply) vs. price-down (slow, to prevent immediate driver departure) events.

Price Consistency and the Lock Service

The hardest distributed systems problem in surge pricing is not computing the multiplier - it is ensuring price consistency across the booking flow. Consider the timeline: a rider sees “1.4x surge” on their app at T=0. At T=5 seconds, surge drops to 1.0x. At T=10 seconds, the rider confirms the booking. Which price do they pay?

The product contract at Uber is: if you confirm within 60 seconds of seeing the quoted price, you pay the quoted price. This requires a price lock - a short-lived reservation that pins a specific multiplier version to a specific booking session.

package pricelock

import (
    "context"
    "fmt"
    "time"
    "github.com/redis/go-redis/v9"
)

const PriceLockTTL = 60 * time.Second

type PriceLock struct {
    LockID       string
    RiderID      string
    ZoneID       string
    VehicleClass string
    Multiplier   float64
    Version      int64
    LockedAt     time.Time
    ExpiresAt    time.Time
}

type PriceLockService struct {
    redis *redis.Client
}

// CreatePriceLock pins the current zone multiplier to this booking session.
// Called when the rider opens the fare estimate screen.
func (s *PriceLockService) CreatePriceLock(
    ctx context.Context,
    riderID, zoneID, vehicleClass string,
) (*PriceLock, error) {
    // Read current surge from store
    surgeKey := fmt.Sprintf("surge:%s:%s", zoneID, vehicleClass)
    vals, err := s.redis.HGetAll(ctx, surgeKey).Result()
    if err != nil {
        return nil, fmt.Errorf("reading surge: %w", err)
    }

    var multiplier float64 = 1.0
    var version int64 = 0
    if m, ok := vals["multiplier"]; ok {
        fmt.Sscanf(m, "%f", &multiplier)
    }
    if v, ok := vals["version"]; ok {
        fmt.Sscanf(v, "%d", &version)
    }

    now := time.Now()
    lock := &PriceLock{
        LockID:       fmt.Sprintf("%s:%s:%d", riderID, zoneID, now.UnixMilli()),
        RiderID:      riderID,
        ZoneID:       zoneID,
        VehicleClass: vehicleClass,
        Multiplier:   multiplier,
        Version:      version,
        LockedAt:     now,
        ExpiresAt:    now.Add(PriceLockTTL),
    }

    // Store the lock - expires automatically after 60 seconds
    lockKey := fmt.Sprintf("pricelock:%s", lock.LockID)
    err = s.redis.HSet(ctx, lockKey,
        "rider_id", lock.RiderID,
        "zone_id", lock.ZoneID,
        "vehicle_class", lock.VehicleClass,
        "multiplier", lock.Multiplier,
        "version", lock.Version,
        "locked_at", lock.LockedAt.Unix(),
    ).Err()
    if err != nil {
        return nil, fmt.Errorf("writing lock: %w", err)
    }
    s.redis.ExpireAt(ctx, lockKey, lock.ExpiresAt)

    return lock, nil
}

// GetLockedMultiplier retrieves the pinned price for a booking confirmation.
// Returns current price if lock expired or not found (fail-open behavior).
func (s *PriceLockService) GetLockedMultiplier(
    ctx context.Context,
    lockID, zoneID, vehicleClass string,
) (float64, error) {
    lockKey := fmt.Sprintf("pricelock:%s", lockID)
    vals, err := s.redis.HGetAll(ctx, lockKey).Result()
    if err != nil || len(vals) == 0 {
        // Lock expired or not found - fall back to current price
        return s.getCurrentMultiplier(ctx, zoneID, vehicleClass)
    }

    var multiplier float64 = 1.0
    if m, ok := vals["multiplier"]; ok {
        fmt.Sscanf(m, "%f", &multiplier)
    }
    return multiplier, nil
}

func (s *PriceLockService) getCurrentMultiplier(
    ctx context.Context, zoneID, vehicleClass string,
) (float64, error) {
    surgeKey := fmt.Sprintf("surge:%s:%s", zoneID, vehicleClass)
    m, err := s.redis.HGet(ctx, surgeKey, "multiplier").Float64()
    if err == redis.Nil {
        return 1.0, nil // No surge active
    }
    return m, err
}

Key Insight

The price lock is fail-open, not fail-closed. If the lock expires or Redis is unavailable, we fall back to the current price rather than refusing the booking. A refused booking is worse than a slightly inaccurate price. The 60-second TTL means locks are tiny in volume and memory - at 10,000 concurrent bookings globally, the lock store is under 10MB.

Data Model

-- Zone configuration - defines the surge zones per city
CREATE TABLE surge_zones (
    zone_id         VARCHAR(20) PRIMARY KEY,   -- H3 cell ID at resolution 7
    city_id         VARCHAR(20) NOT NULL,
    zone_name       VARCHAR(100),
    max_multiplier  REAL NOT NULL DEFAULT 5.0, -- city regulatory cap
    is_active       BOOLEAN NOT NULL DEFAULT TRUE,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_surge_zones_city ON surge_zones (city_id) WHERE is_active = TRUE;

-- Surge history - append-only audit log of all multiplier changes
CREATE TABLE surge_history (
    id              BIGSERIAL,
    zone_id         VARCHAR(20) NOT NULL,
    vehicle_class   VARCHAR(10) NOT NULL,
    multiplier      REAL NOT NULL,
    supply_count    REAL NOT NULL,
    demand_count    REAL NOT NULL,
    ema_ratio       REAL NOT NULL,
    version         BIGINT NOT NULL,
    computed_at     TIMESTAMPTZ NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (computed_at);

CREATE TABLE surge_history_2026_06 PARTITION OF surge_history
    FOR VALUES FROM ('2026-06-01') TO ('2026-07-01');

CREATE INDEX idx_surge_history_zone_time ON surge_history (zone_id, computed_at DESC);
CREATE INDEX idx_surge_history_version ON surge_history (zone_id, vehicle_class, version DESC);

-- Price lock audit (only locks that were actually used for a booking)
CREATE TABLE price_lock_usage (
    lock_id             VARCHAR(100) PRIMARY KEY,
    rider_id            UUID NOT NULL,
    zone_id             VARCHAR(20) NOT NULL,
    vehicle_class       VARCHAR(10) NOT NULL,
    locked_multiplier   REAL NOT NULL,
    locked_version      BIGINT NOT NULL,
    current_multiplier  REAL NOT NULL, -- multiplier at booking confirmation time
    trip_id             UUID,
    locked_at           TIMESTAMPTZ NOT NULL,
    used_at             TIMESTAMPTZ NOT NULL
);

CREATE INDEX idx_price_lock_rider ON price_lock_usage (rider_id, locked_at DESC);
CREATE INDEX idx_price_lock_trip ON price_lock_usage (trip_id);

-- Demand forecasts (written by ML pipeline, read by aggregation service)
CREATE TABLE demand_forecasts (
    forecast_id     BIGSERIAL PRIMARY KEY,
    zone_id         VARCHAR(20) NOT NULL,
    vehicle_class   VARCHAR(10) NOT NULL,
    horizon_minutes SMALLINT NOT NULL,  -- 5, 10, 15, 30 minute horizons
    predicted_demand REAL NOT NULL,
    confidence      REAL NOT NULL,      -- model confidence [0,1]
    model_version   VARCHAR(20) NOT NULL,
    generated_at    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    valid_until     TIMESTAMPTZ NOT NULL
);

CREATE INDEX idx_demand_forecasts_zone_horizon ON demand_forecasts
    (zone_id, vehicle_class, horizon_minutes, valid_until DESC);

Key Algorithms and Protocols

The fan-out problem is where most surge pricing implementations struggle. When zone 84c2b-sg17-sedan updates from 1.4x to 1.8x, you need to push that update to every rider in or near that zone who has the app open. At 50,000 concurrent connections per metro, a naive broadcast is 50,000 WebSocket writes per zone update.

The solution is a differential publish pattern: instead of the computation service directly sending to clients, it publishes to a Redis pub/sub channel. Fan-out workers subscribe to the channel and maintain mappings of which clients are in which zones. Each fan-out worker handles ~1,000 connections and only sends updates to the clients whose zones changed.

import asyncio
import json
import redis.asyncio as aioredis
from typing import Dict, Set

class SurgeFanoutWorker:
    """
    One worker per ~1,000 WebSocket connections.
    Subscribes to Redis surge:updates channel and pushes deltas to relevant clients.
    """

    def __init__(self, redis_url: str, websocket_manager):
        self.redis = aioredis.from_url(redis_url)
        self.ws_manager = websocket_manager
        # client_id -> set of zone_ids they are currently subscribed to
        self.client_zones: Dict[str, Set[str]] = {}
        # zone_id -> set of client_ids watching that zone
        self.zone_clients: Dict[str, Set[str]] = {}

    async def run(self):
        pubsub = self.redis.pubsub()
        await pubsub.psubscribe("surge:updates")

        async for message in pubsub.listen():
            if message["type"] != "pmessage":
                continue
            zone_key = message["data"]  # format: "surge:{zone_id}:{vehicle_class}"
            parts = zone_key.split(":", 2)
            if len(parts) != 3:
                continue
            _, zone_id, vehicle_class = parts
            await self._push_zone_update(zone_id, vehicle_class)

    async def _push_zone_update(self, zone_id: str, vehicle_class: str):
        # Fetch the new multiplier once, then push to all watching clients
        surge_key = f"surge:{zone_id}:{vehicle_class}"
        vals = await self.redis.hgetall(surge_key)
        if not vals:
            return

        update_payload = json.dumps({
            "type": "surge_update",
            "zone_id": zone_id,
            "vehicle_class": vehicle_class,
            "multiplier": float(vals.get(b"multiplier", b"1.0")),
            "version": int(vals.get(b"version", b"0"))
        })

        watching_clients = self.zone_clients.get(zone_id, set())
        # Fan out to all clients watching this zone
        await asyncio.gather(*[
            self.ws_manager.send(client_id, update_payload)
            for client_id in watching_clients
        ], return_exceptions=True)

    def register_client(self, client_id: str, zone_ids: Set[str]):
        """Called when a rider's app connects and shares their location."""
        self.client_zones[client_id] = zone_ids
        for zone_id in zone_ids:
            self.zone_clients.setdefault(zone_id, set()).add(client_id)

    def unregister_client(self, client_id: str):
        """Called when the WebSocket connection closes."""
        zones = self.client_zones.pop(client_id, set())
        for zone_id in zones:
            self.zone_clients.get(zone_id, set()).discard(client_id)

Real World

Lyft published details about their “Prime Time” system, which is structurally similar. They noted that the hardest operational problem was not computation but debugging pricing anomalies - why did zone X show 3x when adjacent zones were 1.0x? They built a “surge explainer” internal tool that shows the supply/demand time series, each computation cycle’s EMA ratio, and the tier it landed in. This dramatically reduced support escalation time from hours to minutes.

Scaling and Performance

# Capacity Estimation - Global Surge Pricing
# ===========================================

# Zone computation
metros                    = 70
zones_per_metro           = 100
vehicle_classes           = 4
total_zone_class_combos   = 70 * 100 * 4 = 28,000
computation_interval_sec  = 30
computations_per_sec      = 28,000 / 30 = ~933 computations/sec

# Each computation is cheap - ~1ms including EMA + tier lookup + Redis write
computation_workers       = 10  # easily handles 933/sec with headroom

# Redis surge store
keys_per_metro            = 100 * 4 = 400 keys
total_keys                = 70 * 400 = 28,000 keys
redis_memory_per_key      = ~200 bytes (hash with 5 fields)
total_redis_memory        = 28,000 * 200B = ~5.6 MB  # trivially fits

# Fan-out (riders in active surge zones)
concurrent_rider_sessions = 50,000 per metro
surge_active_zones_pct    = 20%  # typically 20% of zones have surge > 1.0x
riders_in_surge_zones     = 50,000 * 20% = 10,000 per metro
update_events_per_sec     = 933 zone updates/sec (not all result in pub events)
# Only publish when multiplier changes - typically 10-20% of computations
actual_pub_events_per_sec = 933 * 15% = ~140 events/sec globally
fan_out_workers_per_metro  = 5 (each handles 10,000 connections)

# Price lock store
concurrent_booking_flows  = ~5,000 globally (1% of active riders)
locks_in_redis            = 5,000 keys * 300 bytes = ~1.5 MB
lock_write_rate           = 5,000 / 60 = ~83 writes/sec  # trivial

# PostgreSQL surge history
writes_per_sec            = 933 (one row per computation)
rows_per_day              = 933 * 86,400 = ~80 million
storage_per_row           = ~100 bytes
storage_per_day           = ~8 GB (before compression)
90_day_retention          = ~720 GB raw, ~180 GB with LZ4 compression

Surge pricing scaling diagram showing Kafka stream processing to zone sharding

Key Insight

The surge computation load is dominated by the stream aggregation (reading driver and request events) rather than the multiplier math. The aggregation job needs to maintain rolling window state for 28,000 zone-class pairs. Flink handles this natively with keyed state windows. If you use Kafka Streams, partition your topics by (metro_id + zone_id) to ensure all updates for a zone land on the same partition and state is co-located.

Failure Modes and Recovery

Failure	Detection	Impact	Recovery
Zone aggregator lag (Kafka backpressure)	Consumer lag > 5 min	Surge prices become stale - computed from old data	Scale up aggregator replicas; stale prices are safe (conservative), not dangerous
Redis surge store unavailable	Health check + write errors	Surge computation writes fail; clients stop getting updates	In-memory fallback: computation service caches last-known prices; serves from cache for up to 5 minutes
Fan-out worker crash	Pod health check fails	Clients in crashed worker’s shard stop receiving updates	Kubernetes restarts pod in ~30s; clients fall back to polling `/surge/zone/{zone_id}` endpoint
Price lock Redis unavailable	HGet returns error	Riders cannot lock prices; booking uses current price	Fail-open: use current price. Log the incident for potential refund processing if price increased significantly
Demand forecast service down	API call timeout	Aggregation runs without forecast component	Proceed with reactive-only pricing (supply/demand actuals only); oscillation risk increases
Surge computation service OOM	Process crash	Zone prices not updated for 30-60 seconds	Prices remain at last-computed value; stale but not dangerous. Service restarts from Redis state
Incorrect multiplier published (bug)	Anomaly detection: multiplier > 8x	Riders charged incorrectly	Rollback: write version N+1 with corrected multiplier; refund price_lock_usage rows where locked_version maps to bad computation

Watch Out

The rollback scenario is the most operationally complex. If a bug causes a 10x surge to be published for 30 seconds before detection, you have riders who locked that price and drivers who saw the high price and repositioned. The rollback must: (1) publish the correct price, (2) identify all price_lock_usage rows with the bad version, (3) process refunds, and (4) do this without re-triggering the bug. Version numbers in the surge history table make this forensic query straightforward.

Comparison of Approaches

Approach	Update Latency	Consistency	Operational Complexity	Scale
Database polling (cron every 30s)	30s avg, 60s worst	Strong via transactions	Low - simple SQL queries	Poor - O(N) reads per client
Redis pub/sub fan-out (this design)	1-5s P95	Price lock via Redis TTL	Medium - pub/sub + WebSocket management	High - O(1) write, O(N) fan-out distributed across workers
Server-sent events (SSE) + CDN	5-15s (CDN cache TTL)	Eventual - CDN may serve stale	Low - CDN handles fan-out	High - CDN scales to millions
Kafka consumer per client	Sub-second	Strong - Kafka offsets	Very high - millions of partitions	Poor - Kafka not designed for per-client partitions
GraphQL subscriptions	1-3s	Application-level	High - resolver complexity	Medium - subscription servers are stateful
gRPC bidirectional streaming	0.5-2s	Strong with sequence numbers	High - gRPC server multiplexing	High - but harder to load balance

Real World

DoorDash’s dynamic pricing system for delivery fees uses a similar zone-based approach with one key difference: they apply demand forecasting much more aggressively (60% weight on predictions vs. Uber’s estimated 30%) because food delivery demand is more predictable than ride demand. A lunch rush at noon is extremely predictable; a rainstorm that triggers simultaneous ride demand is not. The right forecast weight depends heavily on demand pattern regularity in your specific marketplace.

Key Takeaways

H3 hexagonal cells are superior to rectangular grids for zone-based aggregation: uniform neighbor distances, clean hierarchical containment, and well-distributed cell sizes globally.
EMA smoothing prevents oscillation: a reactive surge system without smoothing creates a feedback loop where high prices attract supply, overshoot, prices crash, supply leaves, prices spike. Rate-limiting the per-cycle delta adds a second layer of damping.
Price consistency is a distributed systems problem, not a math problem: the surge multiplier formula is trivial. Ensuring riders pay the price they were shown requires a price lock pattern with TTL-backed reservations.
Fan-out must be decoupled from computation: the computation path writes once; the fan-out path reads many. Redis pub/sub with dedicated fan-out workers scales the broadcast independently from the pricing logic.
Fail-open on price locks: refusing a booking because the lock service is down is worse than charging a slightly different price. Log the discrepancy for potential retroactive refunds.
Versioned writes prevent race conditions in computation: two computation workers running concurrent re-evaluations must use optimistic locking (compare-and-swap on version number) to prevent one overwriting the other’s result with stale data.
Demand forecasting shifts you from reactive to proactive pricing: pure supply/demand ratio is always lagging by one cycle. Blending in short-horizon demand forecasts lets prices start rising before the full demand wave hits.
Separate the audit log from the hot store: Redis for fast reads; PostgreSQL with time-range partitioning for forensic queries, refund processing, and regulatory reporting.

Surge pricing is ultimately a real-time control loop operating on a two-sided market. The system’s job is not to maximize price but to maintain marketplace liquidity - enough drivers to serve riders, prices high enough to attract drivers to undersupplied zones, prices communicated clearly enough that riders can make informed decisions. When it works, nobody thinks about it. When it breaks - wrong price charged, oscillating multipliers, or stale data shown to riders - it generates more support tickets than almost any other feature.

Frequently Asked Questions

Why use Redis for the surge store instead of a time-series database like InfluxDB?

The surge store needs two things the TSDB doesn’t provide well: key-value access patterns (look up current surge for a specific zone in under 1ms) and atomic compare-and-swap for conflict-free concurrent writes. InfluxDB and similar TSDBs are optimized for range queries and aggregations over time, not point lookups with atomic writes. We use PostgreSQL as the time-series audit log for historical analysis - a TSDB would be a valid alternative there, but Redis is specifically the right tool for the hot current-value store.

Why not use a global multiplier instead of per-zone multipliers?

A global multiplier would cause massive over-pricing in well-served areas and under-pricing in genuinely imbalanced zones. Manhattan at 5pm has extreme variation by neighborhood - Midtown might be 2.5x while the Upper West Side is 1.0x. A global average of 1.7x would simultaneously annoy riders in the UWS and fail to attract drivers to Midtown. Zone granularity is the whole point.

What prevents a clever user from spoofing their GPS to stay in a low-surge zone and then riding from a high-surge origin?

The surge multiplier is applied at booking confirmation time, not at pickup time. The booking captures the pickup coordinates from the ride request, looks up the zone for those coordinates, and applies the current (or locked) multiplier for that zone. GPS spoofing the app’s displayed location doesn’t change the pickup coordinates in the backend request - those must match the user’s verified GPS signal, which the app reports separately from the display location.

How do you handle zone boundary effects where a rider’s app shows different surge than what the backend computes?

The rider’s app shows the surge for their current GPS position. The booking backend uses the pickup coordinates. If a rider walks 50 meters across a zone boundary between opening the app and confirming the ride, they may see a different zone’s price. This is by design - the price applies to where you are picked up, not where you opened the app. The 60-second price lock helps: if the rider confirms quickly, their displayed price is locked and applied to the booking, even if their current GPS drifts into an adjacent zone.

How would you add machine learning to improve surge accuracy?

Two integration points: (1) replace the tier lookup table with an ML model that predicts the optimal multiplier for a given zone given supply, demand, time of day, weather, and nearby event data; (2) replace or augment the simple demand forecast with a deep learning model (LSTM or Transformer) trained on historical demand patterns. The integration point with the system is clean - the compute service just needs to call a model serving endpoint instead of a table lookup. The hard part is training feedback: you need to measure whether the multiplier actually attracted more drivers, which requires counterfactual reasoning.

How do you test surge pricing changes safely in production?

Shadow mode and geographic canaries. In shadow mode, the new pricing model computes multipliers in parallel with the existing model but doesn’t apply them - you can compare outputs and measure discrepancies before going live. Geographic canary means you apply the new model to a small set of low-traffic zones (e.g., 5 zones in a single city) for a week, measure marketplace health metrics (match rate, driver earnings, rider cancel rate), and only roll out if metrics hold. Rollback is a single config flag that switches the computation service back to the old model.

Interview Questions

Walk through the full lifecycle of a surge event - concert ends, 10,000 people simultaneously request rides. What happens to the system?

Expected depth: Describe the ride request flood hitting the matching engine queue. The queue depth spikes, which the zone aggregator detects as a demand spike in the concert venue H3 cells. Within 30 seconds the surge computation cycle fires, detects a supply/demand ratio of ~0.1 (roughly 1 driver per 10 requests), computes a 3.0x multiplier (rate-limited from the previous 1.0x to at most 1.5x per cycle), publishes via Redis pub/sub, fan-out workers push to all connected riders. Some riders cancel on seeing 1.5x. The high price signal reaches nearby drivers in adjacent zones, who reposition. Over the next 3-5 computation cycles (90-150 seconds), supply recovers and the price modulates back toward 1.0x.

How would you detect that a bug in the surge computation code published incorrect prices globally?

Expected depth: Describe the anomaly detection layer that monitors the published multiplier stream. Alert thresholds: any zone > city_max_multiplier, any zone jumping more than 1.5x in a single cycle, more than 5% of zones in a metro simultaneously changing by > 0.5x. Each of these would fire a PagerDuty alert. The fix path: publish a correction version (version N+1 with correct multipliers) immediately, then audit the surge_history table to identify all riders who locked prices during the bad window, and process refunds for any overcharges.

Design the demand forecasting component that feeds into the aggregation service.

Expected depth: Discuss the features: time of day, day of week, historical demand for same zone and time, weather API integration, event calendar (concerts, games, conferences), school calendar, public transit disruption feed. Model architecture options: gradient boosting (LightGBM) for interpretability and fast inference, or LSTM/Transformer for sequence modeling. Discuss the serving infrastructure: model server (TF Serving or Triton), prediction cache keyed by (zone_id, horizon_minutes, time_bucket), batch predictions generated every 5 minutes for all zones and stored in Redis. Latency budget for forecast lookup: under 5ms to avoid delaying the aggregation cycle.

How would you implement a “soft surge warning” that tells riders the price is about to increase, before it actually increases?

Expected depth: This requires a leading indicator. Options: (1) publish a “surge imminent” flag when the smoothed_ratio drops below 0.8 (current threshold for 1.1x surge) even before the next computation cycle fires; (2) use the demand forecast to predict that the next cycle will trigger a tier change and pre-warn riders 30-60 seconds early. The price lock should apply even to the warning state - if a rider locks on the “1.0x, surge imminent” warning and confirms within 60 seconds, they should get 1.0x even if the cycle fires and moves to 1.2x during that window.

What would change if you needed to support per-rider personalized pricing (loyalty discounts applied on top of surge)?

Expected depth: The surge multiplier is a zone-level signal, not a rider-level one. Personalized discounts are a rider-level overlay. The architecture change: the price lock service reads the zone multiplier AND the rider’s discount profile from a separate Loyalty Service. The locked price becomes: base_fare * zone_multiplier * loyalty_discount. The surge engine itself doesn’t change - it still computes zone-level multipliers. The fan-out to riders shows the blended price (not the raw multiplier) so different riders may see different prices for the same zone. This requires the WebSocket message to include the personalized fare estimate rather than the raw multiplier, which means the fan-out service must call the Loyalty Service per client - a significant architectural change from broadcasting a single multiplier value to all zone watchers.

Continue Learning

Want to see how these patterns hold up when traffic spikes 50x at 3 AM? That's exactly what this Premium deep-dive covers.

Read: The 3 AM Black Friday Meltdown: How to Design Auto-Scaling That Actually Works Premium Unlock all articles · ₹399