Build a Geospatial Proximity Service


databases performance scalability

System Design Deep Dive

Geospatial Proximity Service

Sub-10ms “find everything within Y km” across 500 million moving entities - indexing the planet without slowing the planet down

⏱ 14 min read📐 Advanced🏗️ Geospatial

Imagine a post office sorting room the size of a continent. Five hundred million letters arrive at different addresses and they keep moving - a letter picked up, rerouted, dropped off, moved again - every few seconds. A customer walks up and asks: “Give me every letter that’s within two kilometers of this spot, right now.” And you have to answer in ten milliseconds.

That is the geospatial proximity problem. It sits at the core of Uber driver search, Yelp restaurant discovery, Doordash courier dispatch, Airbnb map views, and fleet management systems worldwide. The naive approach - store latitude and longitude in a relational database and run a distance formula against every row - collapses immediately. A full table scan over 500 million rows takes seconds even with a B-tree index, because distance is a two-dimensional relationship and B-trees are one-dimensional structures. They can answer “latitude between 37.5 and 38.0” efficiently, but they cannot answer “distance from point within radius” without fusing two orthogonal range scans and post-filtering millions of candidates.

The second naive approach - a standard 2D spatial index like PostGIS with an R-tree - works well up to tens of millions of static entities. The problem here is not query latency on the read side; R-trees are excellent at range queries. The problem is the write side. When 500 million entities are moving and each one sends a GPS update every 3 to 5 seconds, you’re absorbing roughly 100 to 160 million location writes per second at peak. Every write invalidates an entry in the spatial index. R-trees are not designed for that mutation rate. The index becomes a bottleneck and its benefit on the read side evaporates.

The fundamental tension is: write velocity versus query freshness versus read latency. We want to absorb tens of millions of location updates per second, keep the proximity index fresh enough that a query returns entities that are actually still nearby, and answer any given query in under 10ms. We need to solve for index write throughput, index read performance, and query precision vs recall simultaneously - accepting some approximation on the write side to stay within latency budgets on the read side.

Requirements and Constraints

Functional Requirements

  • Answer “find all entities within Y km of (lat, lon)” for configurable Y (0.5 km to 50 km)
  • Support entity types: drivers, couriers, restaurants, POIs, users
  • Accept location updates from moving entities (drivers, couriers) every 3 to 10 seconds
  • Return results sorted by ascending distance from the query point
  • Support entity metadata retrieval (name, type, status) in the same response

Non-Functional Requirements

  • Query latency: p99 under 10ms, p50 under 3ms
  • Write throughput: sustain 100 million location updates per second at peak (500M entities x 1 update per 5 seconds)
  • Entity count: 500 million active entities globally
  • Availability: 99.99% uptime (less than 53 minutes downtime per year)
  • Location freshness: query results must reflect entity positions within the last 5 seconds
  • Result accuracy: return all entities within the radius (recall 100%) with acceptable false positive rate under 5% (pruned in post-filter)

Constraints and Assumptions

  • Entities are distributed globally but density is concentrated in urban areas (power law distribution)
  • Static POIs (restaurants, stores) vastly outnumber moving entities - different update cadence
  • We do not need to store historical location trails (that is a separate analytics system)
  • Cross-entity-type queries are out of scope - each entity type has its own index namespace
  • The system is eventually consistent on writes: a newly moved entity may briefly appear in its old cell for up to one index update cycle (100 to 200ms)

High-Level Architecture

The architecture separates the read path (proximity queries) from the write path (location updates) with different latency and throughput profiles. Think of it as two independent conveyor belts running at different speeds that share the same warehouse floor - the fast read belt and the high-volume write belt never block each other.

Geospatial proximity service architecture showing client tier, processing tier, index tier, and streaming tier

The API Gateway handles authentication, rate limiting, and routes queries to the Query Service and updates to the Location Ingestion Service. The Query Service receives a (lat, lon, radius) triple, encodes the query point into a Geohash cell, expands to neighboring cells, fetches candidate entity IDs from the Geohash Index in Redis, runs a Haversine distance filter to prune false positives, then hydrates entity metadata from Cassandra. The entire read path touches only in-memory Redis and a Cassandra multi-get - no disk seeks in steady state.

The Location Ingestion Service receives GPS pings, writes them to a Kafka topic partitioned by geohash prefix, and returns immediately. A Flink stream processor consumes from Kafka, applies last-write-wins coalescing within 100ms windows per entity, and produces a deduplicated batch to the Index Updater. The Index Updater applies atomic SREM old_cell / SADD new_cell operations in Redis, updating the spatial index without a full rewrite of the entity’s record.

The Entity Store (Cassandra) holds full entity metadata and the precise lat/lon for Haversine post-filtering. It is written by the Index Updater alongside the Redis cell membership change. The Result Cache (Redis with 500ms TTL) sits in front of the Query Service for repeated identical queries - common in dense areas where many users query the same cell simultaneously.

Key Insight

The Geohash index in Redis is not the source of truth for location - Cassandra is. Redis holds only cell membership (which entity is in which cell). This separation means a Redis failure degrades query performance but never corrupts entity location data, and the index can be rebuilt by replaying Cassandra records.

Component Deep Dives

The Geohash Index

The Geohash index does one job: given a geohash cell string, return the set of entity IDs currently inside that cell.

Geohash is a hierarchical encoding of the Earth’s surface into a string of characters where longer strings mean smaller cells. The analogy is a zip code system: “94107” is more precise than “941”, which is more precise than “94”. Geohash cells at precision 6 cover roughly 1.2 km by 0.6 km - a useful granularity for proximity queries up to a few kilometers. Precision 4 covers 39 km by 20 km (city-level sharding key). Precision 8 covers 38 m by 19 m (fine-grained sort).

In Redis, we store each precision-6 cell as a Set keyed geo:cell:{geohash6}. A member of that set is an entity ID (UUID or snowflake). For a proximity query with radius 2 km, we encode the query point at precision 6, compute the 8 geographically neighboring cells using the known adjacency rules, and issue 9 SMEMBERS or SUNIONSTORE commands to Redis. This returns a superset of candidates that covers the query circle - but because geohash cells are rectangles, not circles, we always get some false positives at cell corners. The Haversine post-filter trims them.

# Geohash-based proximity lookup with neighbor expansion
import geohash2
import math

EARTH_RADIUS_KM = 6371.0

def haversine_km(lat1, lon1, lat2, lon2):
    # Exact great-circle distance between two points
    dlat = math.radians(lat2 - lat1)
    dlon = math.radians(lon2 - lon1)
    a = (math.sin(dlat / 2) ** 2 +
         math.cos(math.radians(lat1)) * math.cos(math.radians(lat2)) *
         math.sin(dlon / 2) ** 2)
    return EARTH_RADIUS_KM * 2 * math.asin(math.sqrt(a))

def find_nearby(redis_client, cassandra_session, lat, lon, radius_km, entity_type="driver"):
    # Step 1: encode to geohash precision 6 (covers ~1.2km x 0.6km)
    precision = 6
    center_hash = geohash2.encode(lat, lon, precision=precision)
    neighbors = geohash2.neighbors(center_hash)
    cells = [center_hash] + list(neighbors.values())

    # Step 2: fetch candidate entity IDs from all 9 cells
    pipe = redis_client.pipeline(transaction=False)
    for cell in cells:
        pipe.smembers(f"geo:cell:{entity_type}:{cell}")
    cell_results = pipe.execute()

    candidate_ids = set()
    for members in cell_results:
        candidate_ids.update(members)

    if not candidate_ids:
        return []

    # Step 3: fetch precise lat/lon from Cassandra for haversine filter
    placeholders = ",".join(["%s"] * len(candidate_ids))
    rows = cassandra_session.execute(
        f"SELECT entity_id, lat, lon, name, status FROM entities "
        f"WHERE entity_id IN ({placeholders})",
        list(candidate_ids)
    )

    # Step 4: exact distance filter - prune false positives
    results = []
    for row in rows:
        dist = haversine_km(lat, lon, row.lat, row.lon)
        if dist <= radius_km:
            results.append({"id": row.entity_id, "name": row.name,
                            "status": row.status, "distance_km": dist})

    results.sort(key=lambda x: x["distance_km"])
    return results
Real World

Uber’s Ringpop and H3-based geofencing, Yelp’s Vespa-backed geo queries, and Grab’s driver search all use cell-based pre-filtering followed by exact distance post-filtering. The pattern is nearly universal: approximate index for candidate retrieval, precise computation for final ranking.

For location updates, we atomically swap an entity from its old cell to its new cell:

# Atomic geohash cell swap for a moving entity
def update_entity_location(redis_client, cassandra_session,
                           entity_id, entity_type, new_lat, new_lon):
    # Encode new position at precision 6
    new_cell = geohash2.encode(new_lat, new_lon, precision=6)

    # Read current cell from Redis hash (fast O(1) lookup)
    current_cell = redis_client.hget(f"geo:loc:{entity_type}:{entity_id}", "cell6")

    pipe = redis_client.pipeline(transaction=True)
    if current_cell and current_cell != new_cell:
        # Remove from old cell
        pipe.srem(f"geo:cell:{entity_type}:{current_cell}", entity_id)

    # Add to new cell
    pipe.sadd(f"geo:cell:{entity_type}:{new_cell}", entity_id)
    # Update entity position hash
    pipe.hset(f"geo:loc:{entity_type}:{entity_id}",
              mapping={"cell6": new_cell, "lat": new_lat, "lon": new_lon,
                       "ts": int(time.time() * 1000)})
    pipe.execute()

    # Async write to Cassandra (durable store)
    cassandra_session.execute_async(
        "UPDATE entities SET lat=%s, lon=%s, updated_at=toTimestamp(now()) "
        "WHERE entity_id=%s",
        (new_lat, new_lon, entity_id)
    )
Watch Out

Never use Redis GEO commands (GEOADD, GEORADIUS) for this scale. Redis GEO internally uses a sorted set with geohash scores and scans a range - it does not support cell membership queries across neighbor cells without N separate range scans. At 500 million entities across a single Redis instance, the sorted set alone exceeds 20 GB and serialized range queries become the bottleneck.

The H3 Hexagonal Index

H3 is Uber’s open-source hierarchical hexagonal grid system. Where geohash uses rectangles of varying aspect ratios, H3 uses hexagons at uniform resolution levels. The practical advantage for proximity queries is that all 6 neighbors of a hexagon share an equal-length edge - the centroid-to-neighbor-centroid distance is the same in all directions. Geohash cells have two different neighbor distances (longer on East/West, shorter on North/South) which means a radius query must expand to 8 neighbors to guarantee coverage, and still gets rectangular corner false positives.

H3 at resolution 8 covers roughly 0.74 km per hexagonal edge (about 460 m cell-to-cell centroid distance). At resolution 7, cells are ~5x larger. The k-ring operation in H3 returns the set of hexagons within k hops from a center hexagon - a single operation that returns a geographically uniform ring.

# H3 hexagonal proximity expansion
import h3

def find_nearby_h3(redis_client, lat, lon, radius_km, entity_type="driver"):
    # Resolution 8: ~460m edge, good for 1-5km queries
    # Resolution 9: ~174m edge, better for sub-1km queries
    resolution = 8 if radius_km > 1.0 else 9

    center_hex = h3.latlng_to_cell(lat, lon, resolution)

    # k-ring radius: how many hexagon hops to cover the radius
    # At res 8, each hop is ~460m. For 2km radius, k=5 covers ~2.3km
    k = max(1, int(radius_km / 0.46) + 1)
    hex_cells = h3.grid_disk(center_hex, k)

    # Batch Redis lookup across all hex cells
    pipe = redis_client.pipeline(transaction=False)
    for hex_id in hex_cells:
        pipe.smembers(f"h3:cell:{entity_type}:{hex_id}")
    results = pipe.execute()

    candidate_ids = set()
    for members in results:
        candidate_ids.update(members)

    return candidate_ids

H3 is preferred over Geohash when you need density maps, hotspot analytics, or are building Voronoi-like zone systems (surge pricing zones, delivery zone assignment). Geohash remains simpler for basic proximity queries because the encoding/decoding is more widely supported and the neighbor computation is O(1) with a fixed lookup table.

Key Insight

H3’s equidistant hexagonal neighbors mean you can use a tighter k-ring expansion than geohash’s 8-neighbor rectangle expansion for the same radius coverage, reducing the number of candidate cells by 30-40% and cutting Redis pipeline commands proportionally.

The R-Tree and Bounding Box Pre-Filter

An R-tree is the workhorse spatial index in traditional databases. It recursively partitions space into Minimum Bounding Rectangles (MBRs), building a tree where each internal node bounds all objects in its subtree. PostGIS builds R-trees over geometry columns using the GiST index type.

The bounding box pre-filter is the trick that makes R-trees fast: before computing exact distances, the database finds all MBRs that overlap the query bounding box (a square enclosing the search circle). This prunes the search to a small fraction of entries without any expensive trigonometric computation. The exact Haversine check runs only on the survivors.

-- PostGIS R-tree proximity query with bounding box pre-filter
-- The && operator uses the R-tree index (bounding box overlap test)
-- ST_DWithin confirms exact distance after the index narrows candidates
SELECT
    entity_id,
    name,
    entity_type,
    ST_Distance(
        location::geography,
        ST_SetSRID(ST_MakePoint(-122.4194, 37.7749), 4326)::geography
    ) / 1000.0 AS distance_km
FROM entities
WHERE
    -- Bounding box pre-filter uses R-tree index (fast)
    location && ST_Expand(
        ST_SetSRID(ST_MakePoint(-122.4194, 37.7749), 4326)::geography,
        2000  -- 2000 meters bounding box
    )
    -- Exact distance filter (Haversine via geography type)
    AND ST_DWithin(
        location::geography,
        ST_SetSRID(ST_MakePoint(-122.4194, 37.7749), 4326)::geography,
        2000  -- 2000 meters exact
    )
    AND entity_type = 'driver'
    AND status = 'available'
ORDER BY distance_km
LIMIT 50;

R-trees excel for static or slowly-changing data. At 500 million moving entities with 100 million writes per second, the R-tree becomes unworkable because every write requires an R-tree node split/merge operation on the write-locked index. We use PostGIS R-trees for static entity types (restaurants, stores, service areas) and fall back to the Redis geohash index for high-churn moving entities.

Real World

Google Maps uses spatial indexing hierarchies for POI search (static data, R-tree friendly) but a separate cell-based system for live traffic and user positions. Foursquare’s Pilgrim SDK uses quadtree-based venue matching for geofence detection - different indexing approaches for different update cadences within the same product.

The Location Ingestion and Index Update Pipeline

The ingestion pipeline does one critical thing beyond just accepting writes: it decouples the GPS ping rate from the index write rate through index update batching.

A driver app sending GPS pings every 3 seconds does not need the index updated every 3 seconds. If the driver moves 50 meters between two pings, they likely remain in the same geohash-6 cell (which covers 1.2 km). An update that does not change cell membership is wasteful - it burns a Redis write but provides zero improvement to query accuracy. The Flink processor detects this:

# Flink-style location update coalescing and cell-change detection
from collections import defaultdict
import time

class LocationUpdateCoalescer:
    # Processes a 100ms window of updates, emits only cell-changing moves
    def __init__(self):
        self.last_cell = {}  # entity_id -> current geohash6 cell

    def process_window(self, updates):
        # updates: list of (entity_id, entity_type, lat, lon, ts)
        # Last-write-wins per entity within the window
        latest_per_entity = {}
        for entity_id, entity_type, lat, lon, ts in updates:
            key = (entity_id, entity_type)
            if key not in latest_per_entity or ts > latest_per_entity[key][4]:
                latest_per_entity[key] = (entity_id, entity_type, lat, lon, ts)

        index_updates = []
        for (entity_id, entity_type), (_, _, lat, lon, ts) in latest_per_entity.items():
            new_cell = geohash2.encode(lat, lon, precision=6)
            old_cell = self.last_cell.get((entity_id, entity_type))
            if old_cell != new_cell:
                # Cell changed - emit an index update
                index_updates.append({
                    "entity_id": entity_id,
                    "entity_type": entity_type,
                    "old_cell": old_cell,
                    "new_cell": new_cell,
                    "lat": lat,
                    "lon": lon,
                    "ts": ts
                })
                self.last_cell[(entity_id, entity_type)] = new_cell

        return index_updates

This batching has a dramatic effect on throughput. If only 15% of pings result in a cell change (a driver circling a block or stuck in traffic), we reduce index write volume from 100 million/second to 15 million/second - a 6.7x reduction that makes the difference between a Redis cluster that keeps up and one that falls over.

Location update frequency also needs adaptive throttling. A parked delivery truck at a restaurant does not need sub-second GPS updates. The server can send a “throttle to 30-second pings” signal to the client SDK when the entity has been stationary (same cell for 10 consecutive pings). The client resumes high-frequency pings as soon as it detects movement.

Watch Out

Never let the client choose its own update frequency without server-side enforcement. A misconfigured or malicious client sending 10 updates per second per entity would saturate Kafka partitions and starve legitimate updates. The ingestion service must enforce a per-entity rate limit (1 write per second max) with the excess silently dropped using a last-write-wins policy.

Data Model

The data model separates three concerns: cell membership (Redis, for fast set lookups), precise position (Cassandra, durable source of truth), and entity metadata (Cassandra, query-time hydration).

-- Cassandra entity table with geospatial fields
-- Partitioned by entity_type + geohash4 prefix for geographic locality
CREATE TABLE IF NOT EXISTS proximity.entities (
    entity_id     UUID,
    entity_type   TEXT,            -- 'driver', 'restaurant', 'courier'
    geohash4      TEXT,            -- precision-4 geohash, ~39km cell (shard key)
    geohash6      TEXT,            -- precision-6 geohash, ~1.2km cell
    lat           DOUBLE,
    lon           DOUBLE,
    name          TEXT,
    status        TEXT,            -- 'available', 'busy', 'offline'
    metadata      MAP<TEXT, TEXT>, -- flexible attributes
    created_at    TIMESTAMP,
    updated_at    TIMESTAMP,
    PRIMARY KEY ((entity_type, geohash4), entity_id)
) WITH CLUSTERING ORDER BY (entity_id ASC)
  AND compaction = {'class': 'LeveledCompactionStrategy'}
  AND gc_grace_seconds = 86400
  AND default_time_to_live = 0;

-- Secondary index for direct entity_id lookup (cross-partition)
CREATE INDEX ON proximity.entities (entity_id);

-- Location history table (separate, optional, with TTL)
CREATE TABLE IF NOT EXISTS proximity.location_history (
    entity_id   UUID,
    recorded_at TIMESTAMP,
    lat         DOUBLE,
    lon         DOUBLE,
    geohash6    TEXT,
    PRIMARY KEY (entity_id, recorded_at)
) WITH CLUSTERING ORDER BY (recorded_at DESC)
  AND default_time_to_live = 3600;  -- 1 hour history, then auto-expire

The Redis index structures:

geo:cell:{entity_type}:{geohash6}  -> SET of entity_ids
geo:loc:{entity_type}:{entity_id}  -> HASH {lat, lon, cell6, ts}
geo:shard:{geohash2_prefix}        -> ZSET (entity_id -> last_update_ts)

Partitioning strategy: Cassandra partitions by (entity_type, geohash4) so all entities of the same type in the same 39 km city-sized cell reside in the same Cassandra partition, enabling efficient multi-get by entity ID. The geohash4 prefix acts as a geographic colocation key. Redis shards by geohash2 prefix (the first two characters of the geohash, giving 1024 possible shard keys), distributing load evenly across Redis cluster nodes while keeping geographically nearby cells on the same node for pipeline efficiency.

Key Insight

Cassandra is the durable ground truth. Redis is the queryable cache of cell membership. When they diverge - due to a Redis node restart, a missed update, or a hot partition - the rebuild procedure is simply: scan Cassandra by entity_type, recompute geohash6 for each entity’s lat/lon, and repopulate Redis. This asymmetry (Redis as a derived view) makes the system far easier to reason about operationally.

Key Algorithms and Protocols

Geohash Encoding and Neighbor Expansion

Geohash encodes a (lat, lon) pair by recursively bisecting a bounding box - odd bits divide the longitude range, even bits divide the latitude range. The result is a Base32 string where shared prefix length indicates geographic proximity. Two hashes with the same 5-character prefix are within the same ~40 km cell; same 6 characters means within the same ~1.2 km cell.

The neighbor lookup algorithm computes the 8 adjacent cells for any given cell in O(1) time by manipulating the bit representation at the boundary. This is the core operation that enables proximity queries without a full scan:

# Geohash neighbor computation - O(1) bit manipulation
# Used to expand a center cell into a 3x3 grid for proximity coverage
NEIGHBOR_MAP = {
    "right":  {"even": "bc01fg45telegramhi89jkmns7", "odd": "p0r21436x8zb9dcf5h7kjnmqesgutwvy"},
    "left":   {"even": "238967debc01telegramfghi45jkmns", "odd": "14365h7k9dcfesgujnmqp0r2twvyx8telegramzb"},
    "top":    {"even": "p0r21436x8zb9dcf5h7kjnmqesgutwvy", "odd": "bc01fg45telegramhi89jkmns7"},
    "bottom": {"even": "14365h7k9dcfesgujnmqp0r2twvyx8zb", "odd": "238967debc01telegramfghi45jkmns"},
}

BASE32 = "0123456789bcdefghjkmnpqrstuvwxyz"

def geohash_neighbor(geohash_str, direction):
    # Returns adjacent geohash cell in the given direction
    last_char = geohash_str[-1]
    parent = geohash_str[:-1]
    char_type = "odd" if len(geohash_str) % 2 else "even"

    neighbor_chars = NEIGHBOR_MAP[direction][char_type]
    if last_char in neighbor_chars[:8] and len(parent) > 0:
        parent = geohash_neighbor(parent, direction)

    return parent + BASE32[neighbor_chars.index(last_char)]

def get_all_neighbors(geohash_str):
    # Returns dict of all 8 neighboring cells
    n  = geohash_neighbor(geohash_str, "top")
    s  = geohash_neighbor(geohash_str, "bottom")
    e  = geohash_neighbor(geohash_str, "right")
    w  = geohash_neighbor(geohash_str, "left")
    ne = geohash_neighbor(e, "top")
    nw = geohash_neighbor(w, "top")
    se = geohash_neighbor(e, "bottom")
    sw = geohash_neighbor(w, "bottom")
    return {"n": n, "s": s, "e": e, "w": w, "ne": ne, "nw": nw, "se": se, "sw": sw}

Time complexity: O(precision) for encoding, O(1) for neighbor lookup given the character lookup tables. For precision 6, encoding takes under 1 microsecond in Python (negligible vs. network I/O).

Query Precision vs Recall

The key tradeoff in cell-based proximity is between precision (fraction of returned entities that are actually within the radius) and recall (fraction of in-radius entities that are returned).

Recall is straightforward: we always expand to all 8 neighbors of the center cell, ensuring 100% recall for any entity whose position is inside the query circle. There is a theoretical edge case at cell boundaries - an entity right on the border of a neighbor cell that falls just inside the query circle. The 9-cell expansion covers this entirely because any point within the query circle must be in either the center cell or one of its immediate 8 neighbors, by the geometric property that a geohash-6 cell diagonal (~1.34 km) is smaller than our minimum useful query radius (0.5 km) only if the query radius is less than ~0.4 km. For sub-0.5km queries we bump to precision 7 (171m x 86m cells) automatically.

Precision is controllable via the Haversine post-filter threshold. Setting the post-filter to exactly the query radius gives maximum precision with 100% recall. We can also introduce a small negative buffer (e.g., accept only entities within radius * 0.98) to reduce false positives from entities hugging the boundary - useful when the response payload is large.

Key Insight

Precision degradation at cell corners is bounded: the worst case false-positive rate from geohash-6 cell corners is under 22% of candidates (the corner triangle area as a fraction of cell area). After the Haversine filter, precision is 100%. The corner candidates add Redis lookup work but never corrupt the final result set.

Scaling and Performance

Geospatial proximity service sharding strategy showing geographic shard groups and hot cell detection

The system scales horizontally along geographic dimensions. The Redis geohash index shards by geohash-2 prefix, distributing the 1024 possible 2-character prefixes across Redis cluster nodes. Each node handles a geographically coherent subset of the world, so a multi-cell proximity query typically touches only 1 to 2 Redis nodes (the 9 query cells usually share a 2-character prefix).

Hot cell problem: Cities like Tokyo, Manhattan, or central London have dramatically higher entity density than rural areas. A geohash-6 cell in Times Square might contain 12,000 entities versus 3 in rural Montana. A SMEMBERS on a 12,000-element set takes ~2ms in Redis versus ~0.01ms for a 3-element set. We detect hot cells (SCARD exceeding a threshold, typically 5,000 members) and automatically promote them to precision-8 sub-indexing, where the hot cell is recursively divided into 32x32 = 1,024 finer cells.

Capacity estimation:

Given:
  - 500M active entities globally
  - Average 1 entity ID (16 bytes UUID) per Redis Set member
  - Geohash-6 produces ~600M distinct cells on Earth's land surface (75B / 1.2km^2 cells)
  - Average occupancy: ~0.83 entities/cell (sparse, long tail)
  - Hot cells: 99th percentile = 2,000 entities/cell, max = ~50,000

Redis memory (geohash index):
  - 500M entity IDs x 16 bytes = 8 GB raw
  - Redis Set overhead: ~80 bytes per set + 16 bytes per member
  - Occupied cells: ~100M cells with at least 1 entity = 8 GB overhead
  - Total Redis: ~18-25 GB (fits in 2x16GB Redis cluster nodes per region)

Redis memory (loc hash):
  - 500M entities x ~120 bytes/hash = 60 GB
  - Sharded across 8-10 Redis nodes x 8GB each = 64-80 GB cluster

Kafka write throughput:
  - 100M GPS pings/second x 60 bytes/message = 6 GB/second inbound
  - After coalescing (15% cell-change rate): ~15M index writes/second
  - Redis pipeline batches of 100 writes: 150,000 pipeline calls/second

Query throughput:
  - Assume 1M QPS (peak, all users refreshing maps simultaneously)
  - Each query: 9 Redis SMEMBERS + 1 Cassandra mget
  - Redis: 9M ops/second (standard Redis cluster handles 1M ops/second/node, need 9+ nodes)
  - Cassandra: 1M mget/second across 20-candidate reads = 20M reads/second (standard for Cassandra fleet)

Storage (Cassandra):
  - 500M entities x 200 bytes/row = 100 GB
  - Replication factor 3 = 300 GB total
  - Fits on 5-10 standard Cassandra nodes

The dominant bottleneck is Redis write throughput during peak location update windows. The Flink coalescing layer is the critical mitigation - without it, raw write pressure at 100M/second would require 100+ Redis nodes just for writes.

Real World

Uber’s H3-based geofencing system uses a tiered approach: a coarse R-tree (resolution 3, ~100km cells) eliminates 99.9% of non-candidate geofences before running fine-grained point-in-polygon checks. Grab’s GeoSearch service uses a similar two-pass approach with Redis Set membership for the first pass and PostGIS ST_Contains for the second. Both systems report p99 query latencies under 5ms at hundreds of millions of active entities.

Failure Modes and Recovery

FailureDetectionImpactRecovery
Redis node crashRedis Sentinel alerts, query latency spikeQueries return partial results (missing cells on failed node), precision dropsRedis Cluster automatic failover to replica in ~10s; rebuild index from Cassandra scan over ~30min
Kafka partition lagConsumer group lag metric exceeds 100k messagesLocation updates delayed; entities appear in stale cells beyond 5s SLOScale Flink parallelism, add Kafka partitions; no data loss since Kafka retains 24h
Cassandra node failureNodetool status, read timeout alertsEntity hydration returns partial results; Cassandra CL=QUORUM masks single-node failure transparentlyCassandra auto-repair, streaming from replica; read quorum still satisfied with 2/3 nodes
Hot cell spike (event)Cell SCARD monitor, Redis CPU spikeSingle Redis node CPU-bound on oversized Set; slow queries on affected cellAutomatic cell promotion to precision-8 sub-index; manual shard rebalance if needed
Location update stormKafka consumer lag, Redis write latencyIndex staleness exceeds SLO; GPS updates queuedRate limiting at ingestion service kicks in; Flink coalescing window extends automatically
Clock skew on entity updatesTimestamp delta monitoring per entityWrong entity position if stale update processed after fresh oneLast-write-wins by updated_at timestamp; NTP enforcement on all GPS clients (max 1s drift)
Watch Out

The most common operational mistake is letting the Redis index and Cassandra diverge silently. This happens when an index update write to Redis succeeds but the corresponding Cassandra write fails (or vice versa). Implement a daily reconciliation job that scans Cassandra, recomputes geohash for each entity, and patches any Redis cell discrepancies. Without this, stale entities accumulate in wrong cells and proximity results silently degrade.

Comparison of Approaches

ApproachQuery LatencyWrite ThroughputComplexityBest Fit
PostGIS R-tree (single node)1-5ms for small datasetsLow - R-tree updates are single-threadedLowStatic POIs, under 10M entities
Redis GEO commands (GEORADIUS)5-20ms at scaleMedium - ZADD is O(log N) per writeLowSimple apps, under 50M entities
Geohash + Redis Sets (this design)2-8ms p99High - SADD/SREM O(1) per cell, pipelinedMediumMoving entities, 100M-1B scale
H3 + Redis Sets2-8ms p99High - equivalent to geohash, uniform cellsMedium-HighDensity analytics, surge zones
Quadtree in memory (custom)1-3msMedium - tree rebalance on updatesVery HighIn-process, single datacenter
S2 Geometry (Google)2-10msHigh - cell hierarchy similar to H3HighComplex geofences, polygon queries

For this system we choose the Geohash + Redis Sets approach for moving entities and PostGIS R-tree for static POIs. The geohash approach gives O(1) per-cell writes (critical for 100M writes/second after coalescing), simple neighbor expansion, and a clear sharding strategy by prefix. PostGIS handles static data where write frequency is low and we want the full power of polygon operations, buffering, and spatial joins for analytics. H3 is added as a second index for surge zone computation and density heatmaps - workloads where its equidistant hexagonal properties justify the added complexity.

Key Takeaways

  • Geohash encodes 2D location into a 1D string prefix hierarchy, enabling O(1) neighbor expansion and set-based cell membership indexing that scales to hundreds of millions of entities.
  • H3 hexagonal indexing eliminates aspect ratio distortion in geohash cells, making it superior for density analytics and zone systems that require uniform spatial coverage.
  • R-trees are excellent for static or slowly-changing data but cannot absorb 100M+ writes per second without becoming a write bottleneck - separate your static and dynamic entity indexes.
  • Bounding box pre-filter is a two-pass strategy: cheap rectangular overlap check on the index, then expensive Haversine distance check only on the small survivor set.
  • Location update frequency must be decoupled from index write frequency - most GPS pings do not change cell membership, and writing them to the index wastes throughput without improving query accuracy.
  • Index update batching in the Flink coalescing layer reduces index write pressure 5-10x by emitting only cell-change events, not raw GPS events.
  • Query precision vs recall is a bounded tradeoff: 9-cell expansion guarantees 100% recall; Haversine post-filter restores 100% precision; the false positive rate from cell corners is at most ~22% before filtering.
  • Hot cells require automatic promotion to finer-grained sub-indexes - dense urban cells with thousands of entities blow up Redis Set latency without this mitigation.

The counter-intuitive lesson: the hardest part of this system is not the query algorithm - it is the write path. Geohash lookup is elegant and fast. Absorbing 100 million location updates per second without corrupting the index, without write amplification, and without stale data accumulation requires a purpose-built streaming pipeline with coalescing, rate limiting, and continuous reconciliation that together are more complex than the query logic they support.

Frequently Asked Questions

Q: Why not use PostgreSQL PostGIS for everything, including moving entities?

A: PostGIS with a GiST/R-tree index is excellent for static data but its index mutation throughput tops out around 10,000-50,000 updates per second on a single node before write contention on the index becomes the bottleneck. At 500 million entities each updating every 5 seconds, we need 100 million index writes per second - a 2,000-5,000x higher rate than PostGIS can handle. You could shard PostGIS across hundreds of nodes, but at that point you have the operational complexity of a distributed system without the performance characteristics of purpose-built in-memory structures. PostGIS is the right tool for polygon queries, spatial joins, and static entity search.

Q: Why not use Elasticsearch with geo_point type for proximity queries?

A: Elasticsearch uses a quad-tree based spatial index and supports geo_distance queries efficiently. The problem is write latency for index updates: Elasticsearch’s near-real-time (NRT) indexing refreshes every second by default, meaning updated entity positions are invisible to queries for up to 1 second. Reducing the refresh interval to 100ms increases write amplification significantly (more frequent segment flushes). More critically, Elasticsearch is not designed for the 100M writes/second scenario - it uses an LSM-based inverted index that works well for document indexing but is not the most efficient structure for pure set membership tracking of entity IDs.

Q: How do you handle cross-cell queries that span a shard boundary?

A: The 9-cell expansion for a proximity query might span multiple geohash-2 prefixes if the query point is near a prefix boundary. The Query Service detects when cells belong to different Redis shards and issues concurrent pipeline calls to each shard, then merges results. This adds one extra network round-trip (~0.5ms intra-cluster) in the worst case, which is acceptable. We could co-locate boundary cells by padding the shard key, but the implementation complexity outweighs the 0.5ms savings.

Q: Why not store entity positions in the Redis sorted set using geohash as the score?

A: This is exactly what Redis GEO (GEOADD, GEORADIUS) does internally. The problem at 500M entities is that a single sorted set would be enormous (500M members x ~60 bytes = 30 GB in a single Redis instance) and the GEORADIUS scan must traverse a continuous range of the sorted set corresponding to the bounding box. At 500M members, even the sorted set range scan becomes O(log N + K) where K is the number of matching members - and with 500M total members, even a small urban search returns thousands of candidates requiring serial score comparison. The Set per cell approach keeps individual set sizes small (average 1-2, max ~5,000 before cell promotion), making each SMEMBERS operation nearly O(1) in practice.

Q: How do you prevent the index from diverging between Redis and Cassandra during a partial write failure?

A: We use a write-ahead approach: the Cassandra write happens first with IF NOT EXISTS + LWT, and only after confirmation does the Redis index update proceed. If the Redis write fails after Cassandra succeeds, the daily reconciliation job catches the discrepancy. For the reverse case (Redis succeeds, Cassandra fails), the Redis entry is treated as unconfirmed and expires via a background TTL sweep. The system accepts eventual consistency windows up to 200ms on the index; the Cassandra record is always authoritative for the entity’s true position.

Q: Why not use a single unified spatial database like TigerGraph or Neo4j for both proximity and graph queries?

A: Graph databases are optimized for traversal operations along edges (friends of friends, supply chain paths). Proximity is fundamentally a spatial problem - two entities are “connected” not because of an explicit relationship but because of their physical coordinates. Mapping proximity to a graph requires materializing distance edges, which is a Cartesian product (O(N^2) edges for N entities) and completely infeasible at 500M scale. Purpose-built spatial indexes (geohash, H3, R-tree) exploit the geometric structure of the problem in ways a general graph database cannot.

Interview Questions

Q: Walk me through how you would design the data model and index structure for a “find nearby drivers” query at Uber scale.

Expected depth: Candidate should cover geohash or H3 for 2D to 1D encoding, Redis Set per cell for O(1) membership operations, neighbor expansion (8 or 6 neighbors depending on encoding), Haversine post-filter for false positive pruning, Cassandra or equivalent for durable entity storage, and sharding strategy by geohash prefix. Bonus: mention hot cell detection and precision-8 sub-indexing.

Q: How does your system handle 100 million location updates per second without the index falling behind?

Expected depth: Kafka partitioning by entity ID or geohash prefix for ordered processing, Flink coalescing window (100ms) applying last-write-wins per entity, cell-change detection to emit only ~15% of pings as actual index writes, adaptive client throttling for stationary entities, and per-entity server-side rate limiting. Candidate should quantify the 5-10x write reduction from coalescing.

Q: A user in Manhattan complains that nearby driver searches are slow. How do you diagnose and fix the hot cell problem?

Expected depth: SCARD monitoring on Redis cells to detect oversized sets, automatic threshold-based promotion to precision-8 sub-indexing for cells exceeding N members (typically 5,000), Redis shard rebalancing if the hot cell’s shard is CPU-bound, and potentially pre-computing “hot zone” result sets with a background process that refreshes every 500ms and serves from cache for popular query areas. Mention the tradeoff: precision-8 sub-indexing adds complexity but caps SMEMBERS latency.

Q: How would you extend this system to support polygon-based queries (e.g., “find all entities within this delivery zone shape”)?

Expected depth: Two-pass approach - use H3 k-ring or geohash cells that approximate the polygon bounding box for fast candidate retrieval, then use a point-in-polygon algorithm (ray casting or winding number) for exact containment. Pre-compute the set of H3 cells that overlap the polygon, store as a cached cell list. For complex polygons (concave, with holes), PostGIS ST_Within on a small candidate set is acceptable because the bounding-box pre-filter keeps the exact-test set small.

Q: How do you rebuild the geospatial index after a complete Redis cluster failure?

Expected depth: Full Cassandra scan (entity_type, geohash4) partitions in parallel across N workers, each worker reads its partition, recomputes geohash6 from lat/lon, issues Redis SADD and HSET commands in batches of 1000 via pipeline. At 500M entities and Cassandra read throughput of ~1M rows/second across a fleet, rebuild takes ~500 seconds (8-10 minutes). During rebuild, queries fall back to direct Cassandra spatial queries (slower but available). Redis cluster should be rebuilt partition by partition to allow progressive query restoration.

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access
Unlock Full Article