Build Google Maps ETA Prediction Under Live Traffic
distributed-systems performance scalability
System Design Deep Dive
Google Maps ETA Prediction
Merging a planet-scale road graph with live probe data from millions of phones - without missing your exit
Think of a city’s road network as a living, breathing organism. The arteries - highways and major roads - carry the bulk of traffic. The capillaries - residential streets and alleys - handle local flow. And every few seconds, millions of phones whisper their GPS coordinates back to Google’s servers, collectively painting a real-time picture of where traffic is moving and where it has stopped. Your job, as the engineer building this system, is to take that firehose of location data, fuse it with a road graph that covers every paved surface on Earth, and answer one deceptively simple question: “How long will it take me to get there?”
The naive answer is to run Dijkstra’s algorithm on a static road graph using posted speed limits. That approach works at 2am on an empty freeway. It catastrophically fails at 5pm on a Tuesday when three lanes of the 101 merge into one because of a fender-bender. Real navigation requires two things a static graph cannot provide: knowledge of current conditions (the accident is slowing traffic to 8 mph) and knowledge of how conditions will change as the driver travels (the accident will clear by the time they arrive in 40 minutes). The ETA must predict the future state of roads along a route, not just their current state.
Scale makes this incomparably harder. Google Maps serves over a billion users. On a busy morning, tens of millions of simultaneous navigating sessions each require not just one ETA computation at trip start, but continuous recomputation - every 30 seconds or whenever a new traffic event is detected along the route. At the same time, hundreds of millions of additional devices (phones in pockets, Android Auto in dashboards) are passively reporting their GPS traces as probe vehicle data, feeding the traffic model that every ETA depends on. The write path (ingesting probe data) and the read path (computing ETAs) share the same underlying road graph representation and must stay in sync without either blocking the other.
The core architectural decisions we need to solve for simultaneously: how to represent a planetary road network as a graph that supports sub-200ms shortest-path queries; how to overlay real-time traffic speeds onto that graph with freshness measured in seconds; how to incorporate historical traffic patterns so we predict not just current speeds but future speeds along the route; and how to trigger rerouting exactly when staying on the current route is no longer the fastest option - without alert-fatiguing drivers with constant reroute suggestions.
Requirements and Constraints
Functional Requirements
- Compute route ETA between any two points on Earth incorporating live traffic conditions
- Serve ETA predictions within 200ms end-to-end (p99)
- Recalculate ETA every 30 seconds during active navigation and immediately on traffic events
- Trigger reroute suggestions when an alternative route saves at least 2 minutes
- Incorporate probe vehicle data from GPS-reporting devices within 60 seconds of the event
- Support ETA confidence intervals (not just a point estimate) to communicate uncertainty
- Provide historical traffic patterns by time-of-day and day-of-week as baseline fallback
Non-Functional Requirements
- 50 million concurrent active navigation sessions
- 500 million probe location pings per minute from passive devices
- Road graph coverage: 220+ countries, approximately 2 billion road segments
- ETA accuracy: within 10% of actual travel time for 80% of predictions
- Traffic data freshness: under 60 seconds from event to ETA incorporation
- System availability: 99.99% (less than 53 minutes downtime per year)
- Write throughput for probe ingestion: 8 million events per second at peak
Constraints and Assumptions
- We focus on driving ETA only (not walking, cycling, or transit)
- We assume GPS accuracy of 5-15 meters from probe devices
- Road graph topology changes (new roads, closures) are handled via a separate map editing pipeline
- Traffic incident reports (accidents, construction) arrive via a separate incident feed
- We assume devices report location every 3-5 seconds while navigating
High-Level Architecture
The system decomposes into five major subsystems: the Road Graph Store, the Probe Ingestion Pipeline, the Traffic Speed Layer, the ETA Computation Service, and the Route Monitoring and Reroute Engine.
The Road Graph Store holds the static topology of the road network - nodes (intersections), edges (road segments), and their attributes (speed limit, road class, number of lanes). This graph is pre-partitioned by geographic region and loaded into memory on ETA computation servers. It changes rarely (new roads, closures) and is updated via a separate offline pipeline.
The Probe Ingestion Pipeline accepts a firehose of GPS location pings from navigating devices. These raw coordinates are map-matched to specific road segments using a hidden Markov model, aggregating into observed speed readings per segment. The pipeline runs on a distributed stream processor (think Flink or a proprietary equivalent) and writes speed observations into a time-windowed buffer.
The Traffic Speed Layer is a low-latency key-value store mapping road segment IDs to their current observed speed (and confidence). It is continuously updated by the ingestion pipeline and read by the ETA computation service on every query. The freshness SLA here is the dominant constraint: a segment’s speed must reflect observations less than 60 seconds old.
The ETA Computation Service receives a route request, loads the relevant graph partition into memory, overlays the live speed layer onto edge weights, runs a modified shortest-path algorithm (A* with live weights), and returns an ETA with a confidence interval. This is the hot path and must complete in under 150ms to leave budget for network overhead.
The Route Monitoring and Reroute Engine runs as a background process for each active navigation session. Every 30 seconds, it re-evaluates the current route against alternatives. If an alternative saves more than the reroute threshold (default 2 minutes), it pushes a reroute suggestion to the device.
The critical architectural decision is to decouple the road graph topology (which changes daily) from the traffic speed overlay (which changes every second). Keeping them as separate data structures means you can update live speeds without reloading the entire graph, and you can run ETA computations against a consistent graph snapshot while speeds update concurrently.
Road Graph Representation
The road network is a weighted directed graph where nodes are intersections and edges are road segments. What most engineers underestimate is the size: a planetary road graph at full resolution has approximately 2 billion edges. Loading that into a single process’s memory is impossible, so geographic partitioning is mandatory.
Google’s approach uses a hierarchical tile system. The world is divided into tiles at multiple zoom levels. At the coarsest level, each tile covers a large geographic region (a US state, for instance). At the finest level, tiles cover individual neighborhoods. The graph is partitioned into these tiles, and connections between tiles are represented by boundary nodes - special intersection nodes that appear in both the tile they belong to and in the adjacent tile’s boundary index.
Each road segment edge stores:
segment_id: globally unique 64-bit integerfrom_node_id,to_node_id: intersection node referenceslength_meters: physical distancespeed_limit_mps: posted speed in meters per secondroad_class: enum (MOTORWAY, TRUNK, PRIMARY, SECONDARY, RESIDENTIAL)lanes: number of lanestravel_time_s: pre-computed baseline travel time (length / speed_limit)tile_id: which geographic tile this segment belongs to
The routing algorithm loads only the tiles needed for a given route. A trip from San Francisco to San Jose touches maybe 50 tiles. A cross-country trip touches thousands. The tile loader uses an LRU cache on each ETA server - popular urban tiles stay resident in memory permanently; remote tiles are evicted.
Why A instead of Dijkstra?* Dijkstra explores all nodes in order of accumulated cost from the source, expanding in a circle. A* uses a heuristic - the straight-line distance to the destination divided by the maximum possible road speed - to bias exploration toward the destination. On a continental road graph, A* typically explores 10-20x fewer nodes than Dijkstra for long-distance routes. The heuristic is admissible (never overestimates) because we never travel faster than the maximum speed on any road class.
# A* routing with live traffic weights on a road graph tile
import heapq
from dataclasses import dataclass, field
from typing import Dict, Optional
@dataclass(order=True)
class HeapEntry:
f_cost: float
node_id: int = field(compare=False)
g_cost: float = field(compare=False) # actual cost from source
parent: Optional[int] = field(compare=False, default=None)
def haversine_meters(lat1: float, lon1: float, lat2: float, lon2: float) -> float:
"""Straight-line distance between two lat/lon points in meters."""
import math
R = 6_371_000
phi1, phi2 = math.radians(lat1), math.radians(lat2)
dphi = math.radians(lat2 - lat1)
dlambda = math.radians(lon2 - lon1)
a = math.sin(dphi/2)**2 + math.cos(phi1)*math.cos(phi2)*math.sin(dlambda/2)**2
return 2 * R * math.asin(math.sqrt(a))
MAX_SPEED_MPS = 40.0 # ~144 km/h, max motorway speed
def astar_eta(
graph: Dict, # adjacency list: node_id -> list of (neighbor_id, segment_id)
segments: Dict, # segment_id -> {length_m, travel_time_s, from_node, to_node}
nodes: Dict, # node_id -> {lat, lon}
traffic: Dict, # segment_id -> current_speed_mps (or None if no data)
source: int,
dest: int,
) -> tuple[float, list[int]]:
"""
Returns (eta_seconds, path_of_node_ids) using A* with live traffic weights.
Falls back to historical speed if no live data for a segment.
"""
dest_lat = nodes[dest]['lat']
dest_lon = nodes[dest]['lon']
def heuristic(node_id: int) -> float:
n = nodes[node_id]
dist = haversine_meters(n['lat'], n['lon'], dest_lat, dest_lon)
return dist / MAX_SPEED_MPS # optimistic travel time in seconds
def edge_cost(seg_id: int) -> float:
seg = segments[seg_id]
live_speed = traffic.get(seg_id)
if live_speed and live_speed > 0.5: # at least 0.5 m/s (1.8 km/h)
return seg['length_m'] / live_speed
return seg['travel_time_s'] # fallback to speed-limit baseline
open_set: list[HeapEntry] = []
g_costs: Dict[int, float] = {source: 0.0}
parents: Dict[int, Optional[int]] = {source: None}
heapq.heappush(open_set, HeapEntry(
f_cost=heuristic(source),
node_id=source,
g_cost=0.0
))
while open_set:
entry = heapq.heappop(open_set)
current = entry.node_id
if current == dest:
# Reconstruct path
path = []
node = dest
while node is not None:
path.append(node)
node = parents[node]
path.reverse()
return g_costs[dest], path
if entry.g_cost > g_costs.get(current, float('inf')):
continue # stale entry
for neighbor_id, seg_id in graph.get(current, []):
cost = edge_cost(seg_id)
new_g = g_costs[current] + cost
if new_g < g_costs.get(neighbor_id, float('inf')):
g_costs[neighbor_id] = new_g
parents[neighbor_id] = current
f = new_g + heuristic(neighbor_id)
heapq.heappush(open_set, HeapEntry(f, neighbor_id, new_g))
return float('inf'), [] # no path found
Google’s production routing uses a technique called Contraction Hierarchies (CH) on top of A*. CH pre-processes the graph by “contracting” unimportant nodes, creating shortcut edges that allow skipping long chains of low-importance roads. This reduces the number of nodes A* needs to explore by another 100-1000x, enabling sub-millisecond routing on continental-scale graphs. The contraction must be recomputed when road topology changes, but not when traffic speeds change.
Probe Vehicle Data Fusion
Every Android phone running Google Maps (navigating or passively) periodically reports its GPS coordinates, speed, and heading back to Google’s servers. This is the probe data pipeline - the lifeblood of real-time traffic. At peak, this represents 8 million GPS events per second globally.
The raw GPS events are not directly useful. A phone reports latitude/longitude, but what we need is: “segment X currently has a speed of Y meters per second.” The process of converting raw GPS traces to per-segment speed readings is map matching.
Map matching solves the problem: given a sequence of noisy GPS points, which road segments did this device actually travel? A single GPS fix is ambiguous - a point in the middle of a city block could be on the road, a parking lot, or inside a building. A sequence of points over time narrows this down dramatically. The standard approach is a Hidden Markov Model (HMM) where the hidden states are road segments and the observations are GPS coordinates.
# Simplified map matching using Viterbi HMM to snap GPS trace to road segments
from typing import List, Tuple
import math
def emission_probability(gps_point: Tuple[float,float], segment_midpoint: Tuple[float,float]) -> float:
"""
Probability that a GPS reading came from a device on this segment.
Uses a Gaussian model with sigma=20 meters (typical GPS error).
"""
sigma = 20.0 # meters
dist = haversine_meters(gps_point[0], gps_point[1],
segment_midpoint[0], segment_midpoint[1])
return math.exp(-(dist**2) / (2 * sigma**2))
def transition_probability(seg_a: int, seg_b: int, graph_distance: float, gps_distance: float) -> float:
"""
Probability of transitioning from seg_a to seg_b given the GPS distance traveled.
High probability when graph distance ~ GPS distance (straight-line movement along roads).
"""
delta = abs(graph_distance - gps_distance)
beta = 10.0 # tuned empirically, meters
return math.exp(-delta / beta)
def viterbi_map_match(
gps_trace: List[Tuple[float, float]], # [(lat, lon), ...]
candidate_segments_per_point: List[List[int]], # [[seg_id, ...], ...]
segment_midpoints: Dict, # seg_id -> (lat, lon)
segment_graph_distances: Dict, # (seg_a, seg_b) -> shortest path distance
) -> List[int]:
"""
Viterbi algorithm: returns the most likely sequence of road segments.
Each observation (GPS point) is matched to the most probable segment state.
"""
n = len(gps_trace)
if n == 0:
return []
# viterbi[t][seg] = max log-probability of being on seg at time t
viterbi = [{} for _ in range(n)]
backpointer = [{} for _ in range(n)]
# Initialize
for seg in candidate_segments_per_point[0]:
em = emission_probability(gps_trace[0], segment_midpoints[seg])
viterbi[0][seg] = math.log(em + 1e-10)
backpointer[0][seg] = None
# Forward pass
for t in range(1, n):
gps_dist = haversine_meters(
gps_trace[t-1][0], gps_trace[t-1][1],
gps_trace[t][0], gps_trace[t][1]
)
for seg in candidate_segments_per_point[t]:
em = emission_probability(gps_trace[t], segment_midpoints[seg])
best_prev_score = float('-inf')
best_prev_seg = None
for prev_seg in candidate_segments_per_point[t-1]:
graph_dist = segment_graph_distances.get((prev_seg, seg), float('inf'))
trans = transition_probability(prev_seg, seg, graph_dist, gps_dist)
score = viterbi[t-1][prev_seg] + math.log(trans + 1e-10)
if score > best_prev_score:
best_prev_score = score
best_prev_seg = prev_seg
viterbi[t][seg] = best_prev_score + math.log(em + 1e-10)
backpointer[t][seg] = best_prev_seg
# Backtrack
best_final = max(viterbi[n-1], key=lambda s: viterbi[n-1][s])
path = [best_final]
for t in range(n-1, 0, -1):
path.append(backpointer[t][path[-1]])
path.reverse()
return path
Once GPS traces are map-matched to segments, the pipeline computes a speed observation: the device traveled distance_of_segment / time_elapsed meters per second on that segment. Multiple observations per segment within a 60-second window are averaged (weighted by confidence of the map match). The resulting per-segment speed is written to the Traffic Speed Layer.
GPS spoofing and mismatched map data are real problems at scale. Phones in tunnels, parking garages, or dense urban canyons (the “urban canyon effect”) produce GPS readings that jump randomly between segments. Without outlier rejection - discarding readings where the implied speed exceeds 150 mph or the map match confidence is below threshold - you will poison your traffic layer with phantom traffic jams and phantom clear roads.
Live Traffic Speed Layer
The Traffic Speed Layer is the interface between the probe ingestion pipeline (writes) and the ETA computation service (reads). It must support:
- Writes at 500,000 segment speed updates per second (after aggregation from raw probes)
- Reads at 5 million segment lookups per second (from ETA queries)
- Freshness: every segment’s speed must reflect data less than 60 seconds old
- Fallback: if no recent probe data exists for a segment, fall back to historical speed by time-of-day
The data structure is a hash map from segment_id (64-bit integer) to a small struct:
# Speed record stored per road segment in the traffic layer
from dataclasses import dataclass
from typing import Optional
@dataclass
class SegmentSpeed:
segment_id: int
observed_speed_mps: float # meters per second from probes
confidence: float # 0.0 to 1.0 based on probe count
observation_count: int # number of probes in last 60s
last_updated_epoch_ms: int # timestamp of most recent observation
historical_speed_mps: float # baseline from historical model (time-of-day)
def effective_speed(self, current_epoch_ms: int) -> float:
"""
Returns the speed to use for ETA computation.
Live data expires after 90 seconds; falls back to historical.
"""
age_ms = current_epoch_ms - self.last_updated_epoch_ms
if age_ms < 90_000 and self.confidence > 0.3:
# Blend live and historical by confidence
w_live = min(self.confidence, 1.0)
w_hist = 1.0 - w_live
return w_live * self.observed_speed_mps + w_hist * self.historical_speed_mps
return self.historical_speed_mps
This layer is deployed as a distributed in-memory cache partitioned by segment_id % N_shards. Each shard holds approximately 2 billion / N_shards segments. With 256 shards and each segment record taking ~60 bytes, each shard holds about 8 million segments and uses roughly 480 MB of memory - manageable on modern servers.
The update protocol is write-through: the ingestion pipeline writes directly to the appropriate shard, no intermediate queue. Reads are always local - the ETA server that needs segment X’s speed fetches it directly from the shard that owns X.
Historical traffic patterns are not just a fallback - they are the prediction engine for future segments on a route. When computing ETA for a 40-minute drive, the segment you will reach in 35 minutes has no live probe data for that future moment. The correct weight to use is the historical average speed for that segment at the time of day you will arrive there, not the current speed right now.
Historical Traffic Patterns
Historical patterns are built from years of anonymized probe traces. For each road segment, we maintain a speed profile indexed by (day-of-week, hour-of-day, 15-minute-bucket). This gives 7 * 24 * 4 = 672 speed samples per segment, representing the typical speed at that time of week.
-- Historical traffic speed table (pre-aggregated offline)
-- Stores typical speed for each segment by time bucket
CREATE TABLE historical_segment_speed (
segment_id BIGINT NOT NULL,
day_of_week SMALLINT NOT NULL, -- 0=Sunday, 6=Saturday
hour_of_day SMALLINT NOT NULL, -- 0-23
minute_bucket SMALLINT NOT NULL, -- 0, 15, 30, 45
p50_speed_mps FLOAT NOT NULL, -- median observed speed
p85_speed_mps FLOAT NOT NULL, -- 85th percentile (optimistic)
p15_speed_mps FLOAT NOT NULL, -- 15th percentile (pessimistic)
sample_count INTEGER NOT NULL, -- observations used
last_computed_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
PRIMARY KEY (segment_id, day_of_week, hour_of_day, minute_bucket)
);
-- Index for time-based bulk loads during off-peak recomputation
CREATE INDEX idx_hist_speed_time
ON historical_segment_speed (day_of_week, hour_of_day, minute_bucket);
For ETA confidence intervals, we use the spread between p15 and p85 historical speeds as a proxy for uncertainty. A segment with very consistent historical speeds (p15 close to p85) has low uncertainty. A segment with high variability (p15 = 5 mph, p85 = 45 mph - a merge ramp in downtown) has high uncertainty and contributes more to the ETA confidence interval.
# ETA confidence interval computation
def compute_eta_confidence_interval(
path_segments: list[int],
arrival_times: list[float], # estimated arrival time at each segment (epoch seconds)
historical_speeds: dict, # segment_id -> HistoricalSpeedRecord
live_confidences: dict, # segment_id -> confidence in live data (0-1)
) -> tuple[float, float]:
"""
Returns (lower_bound_seconds, upper_bound_seconds) for the total ETA.
Lower bound uses p85 speed (optimistic), upper bound uses p15 (pessimistic).
"""
lower_total = 0.0
upper_total = 0.0
for seg_id, arrival_epoch in zip(path_segments, arrival_times):
import datetime
dt = datetime.datetime.fromtimestamp(arrival_epoch)
dow = dt.weekday()
hod = dt.hour
mbucket = (dt.minute // 15) * 15
hist = historical_speeds.get((seg_id, dow, hod, mbucket))
if hist is None:
continue
live_conf = live_confidences.get(seg_id, 0.0)
seg = segments[seg_id]
# High live confidence narrows the interval toward the live speed
blend_factor = live_conf # 0 = use full historical spread, 1 = no spread
p85 = hist['p85_speed_mps']
p15 = hist['p15_speed_mps']
# Blend: with high live confidence, p85 and p15 converge toward live speed
effective_p85 = p85 + blend_factor * (live_speeds.get(seg_id, p85) - p85)
effective_p15 = p15 + blend_factor * (live_speeds.get(seg_id, p15) - p15)
lower_total += seg['length_m'] / max(effective_p85, 0.1)
upper_total += seg['length_m'] / max(effective_p15, 0.1)
return lower_total, upper_total
Data Model
-- Core tables for the ETA prediction system
-- Road segments: static topology, updated by map editing pipeline
CREATE TABLE road_segment (
segment_id BIGINT PRIMARY KEY,
from_node_id BIGINT NOT NULL REFERENCES road_node(node_id),
to_node_id BIGINT NOT NULL REFERENCES road_node(node_id),
length_meters FLOAT NOT NULL CHECK (length_meters > 0),
speed_limit_mps FLOAT NOT NULL CHECK (speed_limit_mps > 0),
road_class SMALLINT NOT NULL, -- 1=motorway ... 7=residential
lanes SMALLINT NOT NULL DEFAULT 2,
tile_id INTEGER NOT NULL,
geometry_wkb BYTEA, -- encoded linestring for display
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_segment_tile ON road_segment (tile_id);
CREATE INDEX idx_segment_nodes ON road_segment (from_node_id, to_node_id);
-- Road intersections: nodes in the graph
CREATE TABLE road_node (
node_id BIGINT PRIMARY KEY,
lat DOUBLE PRECISION NOT NULL,
lon DOUBLE PRECISION NOT NULL,
tile_id INTEGER NOT NULL
);
CREATE INDEX idx_node_geo ON road_node USING GIST (
ST_SetSRID(ST_MakePoint(lon, lat), 4326)
);
-- Active navigation sessions: tracked for reroute monitoring
CREATE TABLE navigation_session (
session_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id BIGINT NOT NULL,
device_id VARCHAR(64) NOT NULL,
origin_node_id BIGINT NOT NULL,
dest_node_id BIGINT NOT NULL,
current_segment BIGINT,
route_segment_ids BIGINT[] NOT NULL, -- ordered list of segments on current route
started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
last_eta_s FLOAT,
last_eta_at TIMESTAMPTZ,
status VARCHAR(16) NOT NULL DEFAULT 'active'
CHECK (status IN ('active', 'completed', 'cancelled'))
);
CREATE INDEX idx_session_user ON navigation_session (user_id, status);
CREATE INDEX idx_session_segment ON navigation_session
USING GIN (route_segment_ids)
WHERE status = 'active';
-- Probe data raw ingest (short-lived, compacted by streaming pipeline)
CREATE TABLE probe_event_raw (
event_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
device_id VARCHAR(64) NOT NULL,
lat DOUBLE PRECISION NOT NULL,
lon DOUBLE PRECISION NOT NULL,
speed_mps FLOAT,
heading_deg FLOAT,
accuracy_m FLOAT,
observed_at TIMESTAMPTZ NOT NULL,
ingested_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
)
PARTITION BY RANGE (observed_at);
-- 5-minute partitions, retained 24 hours, then dropped
Key Algorithms and Protocols
Rerouting Trigger Logic
The route monitoring service re-evaluates the current route against alternatives every 30 seconds. The decision to suggest a reroute is not simply “is there a faster route?” - it requires accounting for the disruption cost of switching routes (a U-turn or unexpected exit is itself costly) and preventing alert fatigue.
# Rerouting trigger: decides when to suggest an alternative route
from dataclasses import dataclass
from typing import Optional
@dataclass
class RerouteCandidate:
route_segments: list[int]
eta_seconds: float
eta_lower: float
eta_upper: float
first_divergence_segment: int # where this route diverges from current
divergence_distance_m: float # how far along current route until divergence
REROUTE_THRESHOLD_S = 120.0 # must save at least 2 minutes
REROUTE_CONFIDENCE_RATIO = 0.8 # alternative must be faster with 80% confidence
MIN_DIVERGENCE_DISTANCE_M = 500 # must have at least 500m to reach the fork
def should_reroute(
current_eta_s: float,
current_eta_upper: float, # pessimistic bound
best_alternative: Optional[RerouteCandidate],
distance_driven_m: float,
last_reroute_time_s: float,
current_time_s: float,
) -> bool:
"""
Returns True if we should push a reroute suggestion to the driver.
"""
if best_alternative is None:
return False
# Don't reroute if we just rerouted recently (anti-flap hysteresis: 90 seconds)
if current_time_s - last_reroute_time_s < 90.0:
return False
# Can't reroute if divergence point has already passed
if best_alternative.divergence_distance_m < MIN_DIVERGENCE_DISTANCE_M:
return False
# Core check: alternative saves at least REROUTE_THRESHOLD_S seconds
savings = current_eta_s - best_alternative.eta_seconds
if savings < REROUTE_THRESHOLD_S:
return False
# Confidence check: alternative's upper bound is still better than current lower bound
# This prevents suggesting a reroute that's only faster in the optimistic case
if best_alternative.eta_upper >= current_eta_s - REROUTE_THRESHOLD_S * REROUTE_CONFIDENCE_RATIO:
return False
return True
Probe Data Aggregation with Exponential Decay
Raw probe observations must be aggregated into a single speed reading per segment. Older observations are less relevant - a speed reading from 90 seconds ago matters less than one from 5 seconds ago. We use exponential decay weighting:
# Exponential decay aggregation for segment speed from probe observations
import math
from typing import List, Tuple
def aggregate_probe_speeds(
observations: List[Tuple[float, float]], # [(speed_mps, age_seconds), ...]
decay_half_life_s: float = 30.0, # speed observations halve in weight every 30s
min_count: int = 3, # require at least 3 probes for confidence
) -> Tuple[float, float]:
"""
Returns (weighted_speed_mps, confidence) for a road segment.
Uses exponential decay: older observations contribute less.
"""
if not observations:
return 0.0, 0.0
total_weight = 0.0
weighted_speed = 0.0
decay_constant = math.log(2) / decay_half_life_s
for speed_mps, age_s in observations:
if speed_mps < 0 or speed_mps > 60.0: # reject impossible speeds (>216 km/h)
continue
weight = math.exp(-decay_constant * age_s)
weighted_speed += weight * speed_mps
total_weight += weight
if total_weight < 1e-6:
return 0.0, 0.0
avg_speed = weighted_speed / total_weight
# Confidence: based on effective probe count (sum of weights, normalized)
# Saturates at 1.0 with 10+ effective probes
effective_count = total_weight # approximation when weights sum to ~count for recent data
confidence = min(1.0, len(observations) / max(min_count, 1) * 0.5)
return avg_speed, confidence
The rerouting algorithm must be hysteretic - once you suggest a reroute and the driver accepts, you should not immediately suggest routing back to the original route even if conditions improve, because that oscillation destroys trust. The 90-second anti-flap window is critical to prevent the system from thrashing between two routes that are within the noise margin of each other.
Scaling and Performance
Capacity Estimation - Google Maps ETA System
Given:
- 50M concurrent active navigation sessions
- Each session recalculates ETA every 30 seconds
- Each ETA query reads ~200 segments (avg route)
- 500M passive probe devices pinging every 5 seconds
ETA Query Load:
- 50M sessions / 30s = ~1.67M ETA computations/second
- Each computation reads 200 segment speeds from Traffic Layer
- Traffic Layer reads: 1.67M * 200 = 333M segment lookups/second
- With 256 shards: ~1.3M reads/shard/second
Probe Ingestion Load:
- 500M devices / 5s = 100M raw GPS events/second (peak)
- After map matching: ~10M speed observations/second
- After segment aggregation (60s window): ~500K segment speed updates/second
Traffic Layer Memory:
- 2B segments * 60 bytes/record = 120 GB total
- With 256 shards: ~470 MB/shard (easily fits in 16GB nodes)
Road Graph Memory (per ETA server):
- Each server covers a geographic region: ~50M segments
- ~150 bytes/segment (nodes + edges + attributes): 7.5 GB/server
- Graph is read-only, shared between processes via mmap
Network Bandwidth (probe ingestion):
- 100M events/s * 100 bytes/event = 10 GB/s peak ingest bandwidth
- Distributed across 100+ ingest nodes: 100 MB/s each (manageable)
The system scales along two independent dimensions. ETA computation servers scale by geographic region - each server cluster handles route computations for a specific geographic area (North America West, Europe Central, etc.). Traffic layer shards scale by segment ID hash - probe updates and speed reads are routed to the correct shard via consistent hashing.
The hot-path bottleneck is the Traffic Layer read throughput. 333 million segment lookups per second globally is an enormous read volume. The primary mitigation is prefetching: when a navigation session is started, the ETA server pre-loads the speed records for all segments on the current route into a local buffer. Subsequent recalculations (every 30 seconds) only need to refresh segments where the cached value is stale. This reduces actual Traffic Layer reads by ~80%, since most segment speeds change slowly.
Caching strategy for the Traffic Layer uses a two-tier approach. Tier 1 is a local in-process cache on each ETA server, holding speeds for segments on active routes in its geographic area. This cache is invalidated by a pub-sub channel that the ingestion pipeline pushes updates to. Tier 2 is the distributed Traffic Layer shard, accessed on cache miss. The local cache hit rate on active routes is typically above 70%, dramatically reducing cross-service calls.
Waze (acquired by Google in 2013) pioneered the community-sourced traffic model where every navigating driver becomes a probe vehicle. Google Maps now fuses Waze community reports (accidents, police, road hazards) with its own passive probe data. The Waze incident layer adds a categorical dimension to the traffic model: not just “slow here” but “slow here because of an accident that will clear in 15 minutes” - which dramatically improves ETA accuracy for the minutes immediately after an incident resolves.
Failure Modes and Recovery
| Failure | Detection | Impact | Recovery |
|---|---|---|---|
| Traffic Layer shard crash | Health check fails within 10s, client sees connection refused | ETA queries for segments on that shard fall back to historical speeds | Shard restarts from snapshot (last 5-minute checkpoint); stale reads for up to 5 minutes |
| Probe ingestion lag spike | Consumer group lag > 60s on Kafka topic | Traffic layer becomes stale; ETAs revert to historical | Auto-scale ingestion workers; shed low-confidence probe events; circuit-breaker to historical-only mode |
| GPS clock skew on device | Probe timestamps in future or more than 5 minutes in past | Stale speed data injected with wrong age; decay calculation corrupted | Server-side timestamp override using ingest time for events with skew > 30s |
| ETA server geographic tile cache miss storm | P99 latency spike, tile loader queue depth rises | ETA queries for popular routes slow by 50-200ms | LRU eviction with minimum-resident policy for top-100 urban tile sets; tile prewarming on server start |
| Route recalculation storm after major incident | All active sessions in affected region simultaneously trigger reroute | Massive burst on ETA computation servers | Jittered recalculation (each session gets a random offset within its 30s window); priority queue to serve newest sessions first |
| Map matching failure in tunnel/dense urban canyon | High GPS error reported by device, map match confidence < 20% | Probe events discarded; affected segments lose live data | Dead reckoning using last known speed + heading + elapsed time to maintain approximate position |
The most operationally dangerous failure mode is the “phantom traffic jam” - where a cluster of GPS devices in a parking garage or stadium are map-matched to nearby highway segments, injecting false slow-speed observations. Without geographic plausibility checks (is this speed consistent with the segment’s road class? are multiple nearby segments also showing slow speeds?) a single high-density venue event can corrupt the traffic model for miles of surrounding roads.
Comparison of Approaches
| Approach | ETA Latency | Traffic Freshness | Accuracy | Complexity | Best Fit |
|---|---|---|---|---|---|
| Static graph (speed limits only) | < 50ms | N/A (static) | 40-60% within 10% | Low | Rural areas, sparse probe coverage |
| Historical patterns only | < 100ms | 1 week lag | 65-75% within 10% | Medium | Fallback when live data unavailable |
| Live probe overlay (current system) | 150-200ms | 60s | 80-90% within 10% | High | Urban areas, dense probe coverage |
| ML-based future state prediction | 300-500ms | N/A (predicts future) | 85-92% within 10% | Very high | Long routes (1+ hours) with complex conditions |
| User-reported incidents only (Waze classic) | < 100ms | 2-5 minutes | 70-80% within 10% | Medium | High engagement communities |
The live probe overlay approach is the right choice for the majority of navigating users. The latency cost vs. static routing (100-150ms more) is acceptable given the 30-40 percentage point accuracy improvement. For long-distance routes where most driving time is on highway segments with few probes, blending in the ML future-state predictor on a segment-by-segment basis gives the best accuracy without paying the full latency cost of end-to-end ML inference.
The most important non-obvious tradeoff here is that ML-based future state prediction is not simply “better” than the live overlay. It has higher average accuracy but wider confidence intervals - it is less confident about individual predictions. For navigation, a wrong-direction-confident ETA is worse than an honest wide-interval ETA, because the driver plans around it.
Key Takeaways
- Road graph representation: Hierarchical tile partitioning makes a 2-billion-edge planetary graph queryable in under 200ms by loading only the geographic tiles needed per route.
- A over Dijkstra*: The straight-line heuristic reduces explored nodes by 10-100x on long routes, making real-time routing at scale feasible.
- Probe data fusion: Map matching raw GPS traces to segments via Hidden Markov Models converts noisy location pings into clean per-segment speed readings.
- Temporal traffic prediction: Computing ETA means predicting future segment speeds, not just reading current ones - historical patterns give the prediction horizon that live data cannot.
- ETA confidence intervals: The p15/p85 spread of historical speeds, narrowed by live data confidence, gives drivers honest uncertainty bounds rather than falsely precise point estimates.
- Hysteretic rerouting: Anti-flap windows prevent oscillation between routes within noise margin of each other, which is critical for driver trust.
- Decoupled write and read paths: Probe ingestion and ETA computation both touch segment speed data but through different access patterns - writes are segment-local, reads are route-local - and must be isolated to prevent write storms from blocking reads.
- Geographic sharding is the primary scaling lever: Both graph storage and traffic layer partition by geographic region, not by user ID, because the access pattern is location-driven.
The deepest counter-intuitive lesson from this system is that the hardest problem is not the routing algorithm itself - A* on a road graph is a solved problem in competitive programming. The hardest problem is maintaining a globally consistent, sub-60-second-fresh speed overlay on 2 billion segments while simultaneously serving 1.67 million route computations per second, each touching 200 different segments. The routing algorithm is seconds of code; the data pipeline is years of engineering.
Frequently Asked Questions
Q: Why not use a neural network to directly predict ETA from origin and destination coordinates? A: Pure ML approaches struggle with cold paths (routes with insufficient training data), cannot explain their predictions (important for debugging when ETAs are wrong), and fail to incorporate real-time traffic events in a principled way. The graph-based approach with traffic overlay is interpretable, debuggable, and correctly handles novel situations like new road closures that the ML model has never seen.
Q: How does the system handle a route that crosses geographic tile boundaries? A: Boundary nodes serve as connection points between tiles. The routing algorithm runs a multi-tile version of A* that can “teleport” between tiles via these boundary connections. Each tile precomputes shortest paths between all its boundary nodes (the “boundary-to-boundary” shortcut table), allowing the multi-tile search to treat tile crossings as single-hop operations rather than re-traversing the interior of each tile.
Q: Why not use the same serving infrastructure for both probe ingestion and ETA queries? A: The two workloads have fundamentally different resource profiles. Probe ingestion is write-heavy with small records, sequential, and throughput-optimized. ETA queries are read-heavy, random-access across large graph data, and latency-optimized. Sharing infrastructure would force both workloads to compromise: the write path would be slowed by cache pollution from graph reads, and the read path would be starved by write I/O.
Q: Why is the reroute threshold 2 minutes rather than, say, 30 seconds? A: Driver behavior research shows that reroutes are disruptive - they require the driver to process new instructions, make unexpected turns, and rebuild their mental model of the route. A reroute that saves 30 seconds is likely not worth the cognitive overhead and the risk of a driver making an unsafe lane change to hit an unexpected exit. The 2-minute threshold is empirically tuned to maximize the ratio of accepted reroutes to suggested reroutes.
Q: How does the system handle areas with zero probe data coverage (rural roads, developing countries)? A: The historical model serves as the universal fallback. For segments with no probe history, the system uses the road class and speed limit to derive a conservative estimate (typically 70-80% of posted speed limit, accounting for curves, intersections, and unknown conditions). Confidence intervals are explicitly wide for these segments, and the ETA UI may show a range rather than a point estimate.
Q: Why does the traffic layer use 60-second freshness rather than real-time (under 5 seconds)? A: Real-time freshness requires synchronous write coordination that would create a bottleneck at 8 million events per second. The 60-second window allows the ingestion pipeline to batch and aggregate probe events before writing, dramatically reducing write amplification. The accuracy impact is minimal because speed on a road segment changes slowly (a traffic jam takes 5-10 minutes to form) - 60-second freshness captures all relevant traffic state changes.
Interview Questions
Q: How would you design the ETA recalculation trigger for active navigation sessions - when do you recalculate? Expected depth: Discuss time-based recalculation (every 30s), event-driven recalculation (traffic incident on current route), deviation detection (driver has gone off-route), and threshold-based recalculation (ETA has changed by more than 5 minutes since last computation). Cover how you’d implement the event-driven path: segments on the active route must be subscribed to traffic update events, and the session must be woken up when one of those segments changes significantly.
Q: How would you partition the road graph to support sub-200ms routing for cross-continental trips? Expected depth: Discuss hierarchical graph contraction (contracting unimportant nodes creates motorway-only shortcuts), tile-based partitioning with boundary nodes, precomputed boundary-to-boundary shortest paths within tiles, and the multi-level routing strategy that first routes on the high-level motorway network and then refines to surface streets at origin and destination.
Q: How would you ensure that a single large concert venue (100,000 people with phones) doesn’t create phantom traffic jams on nearby highways? Expected depth: Discuss GPS urban canyon / stationary device filtering (speed < 1 m/s sustained for 30+ seconds means the device is parked, not driving), segment plausibility checks (if 10,000 devices are on a 2-lane road that can physically hold 200 vehicles, reject excess readings), geographic clustering of anomalous readings as a hallmark of venue effects, and the importance of correlating neighboring segment speeds (real traffic jams propagate spatially, venue artifacts don’t).
Q: Walk me through how you would compute a confidence interval for a 2-hour route ETA. Expected depth: Discuss per-segment speed variance from historical data (p15/p85), the different sources of uncertainty (GPS accuracy, map match quality, traffic model staleness, future state uncertainty for distant segments), how to propagate per-segment uncertainty into a total-route interval (sum of independent variances for uncorrelated segments), and why the interval grows super-linearly with route length due to correlated congestion events (gridlock in a city affects all segments simultaneously, not independently).
Q: How would you design the system to handle a sudden major road closure (e.g., a bridge collapse) that affects hundreds of thousands of active routes? Expected depth: Discuss the incident feed ingestion path (authoritative closure overrides probe data), marking affected segments as impassable in the Traffic Layer (speed = 0, confidence = 1.0), the resulting reroute storm (all affected sessions simultaneously try to reroute), the need for a priority queue and shed-load mechanism to serve reroutes in priority order, and how the closed segment propagates from the incident feed to all active sessions within the 60-second freshness SLA.
Want to see how these patterns hold up when traffic spikes 50x at 3 AM? That's exactly what this Premium deep-dive covers.