Build Google Search Autocomplete


scalability performance databases

System Design Deep Dive

Google Search Autocomplete

Serving 50ms prefix completions to billions of queries while injecting real-time trending signals - without blowing up your cache

⏱ 14 min read📐 Advanced🏗️ Autocomplete

Every time you type a single character into Google’s search box, a timer starts. You have 50 milliseconds before the user notices lag. In that window, the system must identify the prefix you typed, find the top 10 most relevant completions out of billions of historical queries, weigh in your personal search history, inject any trending signals from the last few minutes, deduplicate near-identical suggestions, and ship a JSON payload back to your browser. Repeat this for 99,000 other users typing simultaneously.

Think of the problem like a librarian who has memorized every book request ever made, knows what’s trending on social media right now, remembers what you personally checked out last week, and still needs to answer your half-finished question in the time it takes to blink. The naive answer - just grep a query log - fails the moment you try to serve millions of users. A full-text search engine is too slow. A simple hash map can’t handle prefix matching. Even a raw trie traversal is too expensive when the tree has hundreds of millions of nodes and you need to score and rank the entire subtree on each keystroke.

Three forces are in direct tension here. Latency demands that results come from memory - no disk reads, no network hops to a single backend. Freshness demands that when “World Cup 2026” starts trending, it appears in suggestions within minutes, which means the in-memory index must be mutable. And personalization demands per-user state that cannot be shared across the cache layer without defeating the purpose of caching. A cold cache for a popular prefix will cascade into thousands of simultaneous trie traversals, which is exactly the thundering herd problem dressed in autocomplete clothes.

We need to solve for O(1) prefix lookup at query time, near-real-time trending signal injection, and per-user result re-ranking simultaneously - while keeping the primary read path under 50ms at p99.

Requirements and Constraints

Functional Requirements

  • Return the top 10 completions for any search prefix typed by a user
  • Completions ranked by a blend of global query frequency and personalized user history
  • Trending queries from the last 5-15 minutes must appear in suggestions when relevant
  • Results must be deduplicated - no near-identical suggestions (e.g., “google maps” and “google map directions” should not both appear if space is tight)
  • Support at least English; architecture must be extensible to Unicode prefix matching
  • Suggestions must filter out queries that violate content policy

Non-Functional Requirements

  • Latency: p99 under 50ms end-to-end including network; p50 under 20ms
  • Throughput: 100,000 suggestion requests per second at peak; Google processes roughly 8.5 billion searches per day, meaning autocomplete fires 3-5x more often than search submissions
  • Availability: 99.99% uptime - autocomplete is on the critical user interaction path
  • Freshness: Trending queries injected within 10 minutes of spike onset
  • Storage: Index covers top 5 billion distinct query strings; personalization state for 4 billion active users

Constraints and Scope

  • We are not building the spell-correction or query understanding layer - those are downstream
  • We assume the content moderation pipeline runs asynchronously and flags are pre-applied to the index
  • We assume user auth tokens are passed on the suggestion request for personalization lookups
  • Geography-based trending (e.g., local trending) is out of scope for v1 but the architecture must support it

High-Level Architecture

The system decomposes into five distinct planes: a client debounce layer that throttles keypress events to reduce request volume, an API gateway handling auth and rate limiting, a suggestion service that orchestrates the read path, an in-memory trie index partitioned by prefix shard, and an async trending pipeline that continuously refreshes the index with real-time frequency scores.

Google Search Autocomplete system architecture overview

On the read path, a user’s keypress fires a debounced HTTP request to the API gateway. The gateway routes the request to the correct suggestion service instance based on the first two characters of the prefix - this is the fundamental sharding unit. The suggestion service first checks a Redis cache keyed on suggest:{prefix}:{user_id} for a personalized hit, or suggest:{prefix} for the anonymous version. On a cache miss, it fans out concurrently to the trie index service (for global candidates) and the personalization service (for per-user boost signals). Results are merged, re-ranked, deduplicated, and returned. Simultaneously, every query event is published to Kafka, where a Flink streaming job aggregates 5-minute rolling windows and feeds updated frequency scores back to the trie index.

Each component has a single responsibility. The Trie Index Service holds the in-memory prefix tree and exposes a GetTopK(prefix, k) RPC. The Personalization Service reads from a user profile store and returns a map of query strings to boost multipliers. The Suggestion Service merges and ranks. The Cache Layer (Redis cluster) absorbs the read amplification. The Trending Pipeline (Kafka + Flink + Trie Updater) closes the feedback loop.

Key Insight

The trie stores pre-computed top-K results at every node, not just at leaf nodes - this makes GetTopK O(prefix_length) rather than O(subtree_size), trading memory for the latency budget we can’t afford to spend at query time.

The Trie Index Service

The trie index service is the heart of the read path - its job is to take a prefix string and return the top-K globally ranked completions in microseconds, not milliseconds.

A library card catalog is the right mental model. If you ask for every book starting with “Ga”, the librarian doesn’t read every card - they walk straight to the “Ga” drawer and pull the pre-sorted top titles. The trie is that catalog, except each drawer (node) already has a pre-sorted list of the most popular entries in its entire subtree. You never need to open child drawers at query time.

The critical insight most engineers miss: a naive trie lets you find all strings with a given prefix in O(subtree_size) time. That’s unacceptable when “go” has hundreds of millions of descendant queries. The solution is cached top-K at every node. Each trie node stores not just its own query data, but a pre-merged sorted list of the top-K completions reachable from that node. When GetTopK("go", 10) arrives, we traverse 2 nodes and read the pre-computed list. O(prefix_length). No subtree traversal.

Trie data structure with top-K caching at each node
// Trie node with cached top-K completions - no subtree scan at query time
package trie

import "sync"

type Suggestion struct {
	Text        string
	GlobalScore float64
	TrendBoost  float64
	FinalScore  float64
}

type TrieNode struct {
	mu          sync.RWMutex
	children    map[rune]*TrieNode
	isEnd       bool
	topK        []Suggestion // pre-computed, sorted descending by FinalScore
	lastUpdated int64        // unix millis - used by updater to skip unchanged nodes
}

type Trie struct {
	root *TrieNode
}

func NewTrie() *Trie {
	return &Trie{root: &TrieNode{children: make(map[rune]*TrieNode)}}
}

// GetTopK returns cached top-k without subtree traversal
func (t *Trie) GetTopK(prefix string, k int) []Suggestion {
	node := t.root
	for _, ch := range prefix {
		node.mu.RLock()
		child, ok := node.children[ch]
		node.mu.RUnlock()
		if !ok {
			return nil
		}
		node = child
	}
	node.mu.RLock()
	defer node.mu.RUnlock()
	if len(node.topK) <= k {
		return node.topK
	}
	result := make([]Suggestion, k)
	copy(result, node.topK[:k])
	return result
}

The trie is partitioned across N shard servers using the first two characters of the prefix as the shard key - shard_id = hashFNV(prefix[0:2]) % num_shards. Each shard server holds a complete in-memory trie for its prefix range. With 5 billion query strings averaging 25 bytes each, total raw data is about 125 GB. With top-K lists at every node (roughly 100 million internal nodes, each storing 10 suggestions of ~50 bytes), the in-memory overhead grows to approximately 50 GB per shard for a 26-shard deployment. This is within range for a high-memory instance.

Watch Out

Updating top-K lists requires locking or copy-on-write semantics on the trie. A naive global lock during index rebuild causes a complete read stall for 30-60 seconds. Always build a shadow trie, swap the pointer atomically, and discard the old trie - never mutate the live tree under a reader lock.

The Query Frequency Scoring Layer

The frequency scoring layer’s job is to translate raw query occurrence counts into a stable, normalizable score that survives both volume spikes and the long tail.

Think of it like a Billboard Hot 100 chart: raw stream counts matter, but a song that was #1 for 20 consecutive weeks counts differently from one that spiked this morning. The scoring function must blend historical authority with recency.

The core formula for a query string q is:

# Query scoring function - blends historical frequency with time-decay and trend boost
import math

def compute_score(
    raw_count: int,
    days_active: int,
    click_through_rate: float,
    trend_boost: float,
    content_quality_penalty: float = 1.0,
) -> float:
    """
    raw_count: total historical query submissions
    days_active: number of days since query was first seen
    click_through_rate: fraction of times this suggestion was clicked when shown
    trend_boost: multiplier from trending pipeline (1.0 = neutral, >1.0 = trending)
    content_quality_penalty: 0.0 to 1.0, 1.0 = no penalty
    """
    # Log-normalize to compress the power law distribution
    log_count = math.log(1 + raw_count)

    # Time decay: older queries lose authority slowly (half-life ~365 days)
    time_decay = 1.0 / (1.0 + math.exp(-0.005 * (days_active - 180)))

    # CTR signal: queries people actually click on are more useful
    ctr_signal = 1.0 + (click_through_rate * 0.5)

    base_score = log_count * time_decay * ctr_signal * content_quality_penalty
    return base_score * trend_boost


def compute_personalized_score(
    base_score: float,
    user_query_count: int,
    user_last_used_days_ago: int,
) -> float:
    """Re-rank for a specific user on top of the global base score."""
    if user_query_count == 0:
        return base_score

    personal_boost = math.log(1 + user_query_count) * 2.0
    # Recency bonus: used in last 7 days gets a strong boost
    recency_multiplier = 3.0 if user_last_used_days_ago <= 7 else 1.2
    return base_score + (personal_boost * recency_multiplier)

The log normalization is essential. Query frequencies follow a power law - “google” has 10 billion occurrences, while “google search autocomplete system design” has 400. Without log normalization, the top-K list for every prefix would be dominated by the same short, generic queries and personalization signals would be too small to change the ordering.

Key Insight

Log-normalizing raw counts before combining with personalization boosts ensures that a user who searched “python dataclasses” 5 times can actually see it promoted above “python” (with 10 billion raw occurrences) in their personal results - without log-scaling, personal signals are always dwarfed.

The Personalization Layer

The personalization layer’s job is to take a global top-10 list and re-rank it for a specific user without adding more than 5ms to the total request latency.

The layer reads from a user profile store that maps (user_id, query_prefix) to a list of (query_string, use_count, last_used_ts) tuples. It does not run a full re-ranking pass from scratch - instead, it applies a boost multiplier to queries already in the global top-10, and can elevate personal queries from a secondary list if the user has a strong history signal.

# Personalization merge - boost global results with user signals
from dataclasses import dataclass
from typing import List, Dict
import time

@dataclass
class SuggestionResult:
    text: str
    score: float
    source: str  # "global" | "personal" | "trending"

def merge_personalized_results(
    global_top10: List[SuggestionResult],
    user_history: Dict[str, dict],  # query_text -> {count, last_used_ts}
    max_results: int = 10,
) -> List[SuggestionResult]:
    """
    Merges global suggestions with user history.
    User queries that appear in global top10 get a boost.
    User queries not in global top10 may be inserted if boost is strong enough.
    """
    now_ts = int(time.time())
    scored = []

    for suggestion in global_top10:
        boosted = suggestion.score
        if suggestion.text in user_history:
            h = user_history[suggestion.text]
            days_since = (now_ts - h["last_used_ts"]) / 86400
            personal_mult = min(3.0, 1.0 + (h["count"] / 10.0))
            recency_mult = 2.5 if days_since < 7 else (1.5 if days_since < 30 else 1.0)
            boosted = suggestion.score * personal_mult * recency_mult
        scored.append(SuggestionResult(suggestion.text, boosted, "global"))

    # Surface personal queries not already in global list
    global_texts = {s.text for s in global_top10}
    for query_text, hist in user_history.items():
        if query_text not in global_texts:
            days_since = (now_ts - hist["last_used_ts"]) / 86400
            if hist["count"] >= 3 and days_since < 30:
                personal_score = math.log(1 + hist["count"]) * 2.0
                scored.append(SuggestionResult(query_text, personal_score, "personal"))

    scored.sort(key=lambda s: s.score, reverse=True)
    return scored[:max_results]

The user profile store needs sub-5ms reads. Cassandra with a partition key of user_id and a clustering key of prefix works well here. For 4 billion users, each storing up to 500 prefix-query pairs, total storage is around 2 TB - manageable in a standard Cassandra cluster.

Real World

Google’s autocomplete personalization uses a client-side component as well: recent searches are stored locally in the browser and merged with server suggestions client-side, which means the personalization latency for recently used queries is effectively 0ms - no server round trip needed.

The trending signal pipeline’s job is to detect queries that are spiking in volume right now and inject a boost multiplier into the trie index within minutes, not hours.

Think of it like a stock ticker: the trending pipeline is a continuous feed of real-time prices, and the trie’s stored scores are the end-of-day close prices. The suggestion service blends both - the stable historical authority and the live momentum signal.

The pipeline is: Kafka (raw query events) -> Flink (windowed aggregation) -> Trie Updater (score injection) -> Redis invalidation (cache purge on hot prefixes).

// Flink streaming job for trending score aggregation - 5-minute tumbling window
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.windowing.time.Time;

public class TrendingScoreJob {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // Source: Kafka topic with raw query events
        DataStream<QueryEvent> queries = env
            .addSource(new FlinkKafkaConsumer<>(
                "query-events",
                new QueryEventDeserializer(),
                kafkaProps
            ));

        // 5-minute tumbling window: count queries per query_text
        DataStream<TrendingScore> trending = queries
            .map(e -> Tuple2.of(e.queryText, 1L))
            .keyBy(t -> t.f0)
            .window(TumblingEventTimeWindows.of(Time.minutes(5)))
            .sum(1)
            .map(t -> {
                // Compare to previous 5-minute window to compute velocity
                long prevCount = windowStateStore.get(t.f0);
                double velocity = prevCount > 0
                    ? (double)(t.f1 - prevCount) / prevCount
                    : 1.0;
                double trendBoost = velocity > 2.0
                    ? Math.min(5.0, 1.0 + Math.log(velocity))
                    : 1.0;
                windowStateStore.put(t.f0, t.f1);
                return new TrendingScore(t.f0, trendBoost, System.currentTimeMillis());
            });

        // Sink: write to Trie Updater service via Kafka
        trending.addSink(new FlinkKafkaProducer<>("trending-scores", new TrendingScoreSerializer(), kafkaProps));
        env.execute("Trending Score Aggregation");
    }
}

The Trie Updater service consumes trending-scores and applies the boost multipliers to affected trie nodes. Critically, updating a score for “world cup 2026” requires re-computing top-K for every ancestor node up to the root - 15 nodes for a 15-character query. This is the bottom-up propagation step and it must happen atomically on each node using copy-on-write.

Watch Out

Do not apply trending boosts to the global trie and cache simultaneously in the same transaction. Update the trie first, then invalidate the relevant cache keys. If you do it in reverse, you’ll repopulate the cache with stale scores before the trie update lands, locking in stale results for the full TTL.

The Cache Warming Strategy

The cache layer’s job is to absorb the 100,000 QPS read volume so that trie traversals happen on cache misses only - which should be less than 5% of total traffic.

A cold cache on deploy is like opening a new library branch on day one with no books on the shelves: every visitor waits while staff manually fetches books from the warehouse. The solution is cache warming - pre-populating the cache before any user traffic arrives.

# Cache warming script - pre-populate top prefixes before traffic starts
import redis
import itertools
import string
from concurrent.futures import ThreadPoolExecutor

def warm_cache(
    redis_client: redis.Redis,
    trie_service,
    top_prefix_count: int = 50000,
    workers: int = 20,
) -> None:
    """
    Generates the most common 1-, 2-, and 3-character prefixes
    and pre-populates the Redis cache with top-10 suggestions.
    """
    chars = string.ascii_lowercase + string.digits

    # All 1-char and 2-char prefixes
    prefixes_1 = list(chars)
    prefixes_2 = [''.join(p) for p in itertools.product(chars, repeat=2)]
    # Top 3-char prefixes from historical frequency data
    prefixes_3 = load_top_prefixes_from_analytics(limit=top_prefix_count)

    all_prefixes = prefixes_1 + prefixes_2 + prefixes_3

    def warm_single(prefix: str) -> None:
        suggestions = trie_service.get_top_k(prefix, k=10)
        if suggestions:
            cache_key = f"suggest:{prefix}"
            pipe = redis_client.pipeline()
            pipe.set(cache_key, serialize(suggestions))
            pipe.expire(cache_key, 3600)  # 1 hour TTL for warmed entries
            pipe.execute()

    with ThreadPoolExecutor(max_workers=workers) as pool:
        list(pool.map(warm_single, all_prefixes))

    print(f"Warmed {len(all_prefixes)} prefixes")

The TTL strategy is not uniform. Short, highly popular prefixes like “go” or “th” get a 300-second TTL and are refreshed continuously by real traffic. Longer 5-character prefixes get a 3600-second TTL. Personalized cache entries (suggest:{prefix}:{user_id}) get a 60-second TTL since they reflect recent user behavior that can change quickly.

Real World

Bing’s autocomplete uses a hierarchical cache where the CDN edge layer caches non-personalized suggestions for common prefixes, and the origin layer handles personalized queries. This means users without login state get suggestions from the edge in under 10ms globally, while logged-in users pay the extra round trip to origin for personalization.

The Result Deduplication Step

The deduplication step’s job is to ensure the final 10 suggestions shown to the user are semantically distinct - not just syntactically unique strings.

Without deduplication, the top-10 for prefix “python” might include “python”, “python 3”, “python 3.11”, “python 3.12”, “python 3.12.3”, “python 3.12 download” - all near-identical from a user intent perspective. The user sees an unhelpful list. Deduplication collapses these into a representative subset.

The approach is edit-distance clustering with a threshold, applied only to the final top-N candidates (not to the full trie subtree):

# Result deduplication using edit-distance clustering on the final candidate set
from difflib import SequenceMatcher
from typing import List

def deduplicate_suggestions(
    candidates: List[str],
    max_results: int = 10,
    similarity_threshold: float = 0.82,
) -> List[str]:
    """
    Remove near-duplicate suggestions using sequence similarity.
    Keeps the highest-scoring candidate from each similarity cluster.
    Candidates must be pre-sorted by score (highest first).
    """
    kept: List[str] = []

    for candidate in candidates:
        is_duplicate = False
        for kept_item in kept:
            ratio = SequenceMatcher(None, candidate, kept_item).ratio()
            if ratio >= similarity_threshold:
                is_duplicate = True
                break
        if not is_duplicate:
            kept.append(candidate)
        if len(kept) >= max_results:
            break

    return kept

This runs on the merged candidate list of 20-30 results after scoring, not on the raw trie output. The O(n^2) pairwise comparison is acceptable at this small scale - 30 candidates with strings under 100 characters is trivially fast. The similarity threshold of 0.82 is tunable; lower values are more aggressive at collapsing near-duplicates.

Data Model

-- Core schema for the autocomplete system

-- Query frequency store (replicated to trie at index build time)
CREATE TABLE query_frequencies (
    query_text       TEXT         NOT NULL,
    raw_count        BIGINT       NOT NULL DEFAULT 0,
    click_count      BIGINT       NOT NULL DEFAULT 0,
    first_seen_date  DATE         NOT NULL,
    last_seen_ts     TIMESTAMPTZ  NOT NULL,
    global_score     DOUBLE PRECISION NOT NULL DEFAULT 0.0,
    is_blocked       BOOLEAN      NOT NULL DEFAULT FALSE,
    PRIMARY KEY (query_text)
);

CREATE INDEX idx_qf_score ON query_frequencies (global_score DESC)
    WHERE is_blocked = FALSE;

-- User search history for personalization
-- Partitioned by user_id for Cassandra-style horizontal scaling
CREATE TABLE user_search_history (
    user_id          UUID         NOT NULL,
    query_text       TEXT         NOT NULL,
    search_count     INT          NOT NULL DEFAULT 1,
    last_searched_ts TIMESTAMPTZ  NOT NULL,
    click_through    BOOLEAN      NOT NULL DEFAULT FALSE,
    PRIMARY KEY (user_id, query_text)
);

CREATE INDEX idx_ush_user_ts ON user_search_history (user_id, last_searched_ts DESC);

-- Trending scores (written by Flink, read by Trie Updater)
CREATE TABLE trending_scores (
    query_text       TEXT         NOT NULL,
    window_start_ts  TIMESTAMPTZ  NOT NULL,
    window_count     BIGINT       NOT NULL,
    previous_count   BIGINT       NOT NULL DEFAULT 0,
    velocity         DOUBLE PRECISION NOT NULL,
    trend_boost      DOUBLE PRECISION NOT NULL DEFAULT 1.0,
    PRIMARY KEY (query_text, window_start_ts)
);

CREATE INDEX idx_ts_boost ON trending_scores (trend_boost DESC, window_start_ts DESC);

-- Cache invalidation log (used by Trie Updater to purge Redis on score changes)
CREATE TABLE cache_invalidation_queue (
    id               BIGSERIAL    PRIMARY KEY,
    prefix           TEXT         NOT NULL,
    reason           TEXT         NOT NULL,
    created_ts       TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
    processed_ts     TIMESTAMPTZ
);

The query_frequencies table is the source of truth for the offline index rebuild. user_search_history is partitioned by user_id and lives in Cassandra in production - the SQL DDL above represents the logical schema. trending_scores is written by the Flink job every 5 minutes and consumed by the Trie Updater.

The sharding key for the trie index is prefix[0:2] - the first two characters of the prefix string. This distributes load across 26x26 = 676 possible buckets, typically consolidated into 32-64 shard servers. The partition key for user_search_history in Cassandra is user_id - reads always target a single user’s history, so this is the natural partition unit.

Data flow through the autocomplete system from keypress to response
Key Insight

The data model separates global scores (updated in batch every 24 hours) from trending boosts (updated every 5 minutes) and personalization (real-time per request). This separation allows each update frequency to evolve independently without contention on the same rows.

Key Algorithms and Protocols

Prefix Index Sharding via Consistent Hashing on Bigrams

The routing function maps a prefix to a trie shard. We use a simple modulo hash on the first two characters rather than consistent hashing, because prefix assignment is static - we never move prefix ownership between shards without a full rebuild.

// Shard routing for prefix lookups - deterministic, no rehashing needed
package router

import (
	"hash/fnv"
)

type ShardRouter struct {
	numShards int
	shards    []string // shard addresses
}

func NewShardRouter(shards []string) *ShardRouter {
	return &ShardRouter{numShards: len(shards), shards: shards}
}

func (r *ShardRouter) GetShard(prefix string) string {
	if len(prefix) == 0 {
		return r.shards[0]
	}
	key := prefix
	if len(prefix) > 2 {
		key = prefix[:2] // shard only on first 2 chars
	}
	h := fnv.New32a()
	h.Write([]byte(key))
	shardIdx := int(h.Sum32()) % r.numShards
	return r.shards[shardIdx]
}

func (r *ShardRouter) GetShardIndex(prefix string) int {
	if len(prefix) == 0 {
		return 0
	}
	key := prefix
	if len(prefix) > 2 {
		key = prefix[:2]
	}
	h := fnv.New32a()
	h.Write([]byte(key))
	return int(h.Sum32()) % r.numShards
}

Time complexity: O(1) per routing decision. Space: O(num_shards) for the routing table.

Bottom-Up Top-K Propagation After Score Update

When a trending query’s score changes, every ancestor node in the trie must have its top-K list updated. The algorithm walks from the updated leaf up to the root, re-merging children’s top-K lists at each level.

// Bottom-up top-K propagation after a score update on a single query
func (t *Trie) UpdateQueryScore(query string, newScore float64) {
	// Find the path from root to the terminal node
	path := make([]*TrieNode, 0, len(query)+1)
	node := t.root
	path = append(path, node)

	for _, ch := range query {
		node.mu.RLock()
		child, ok := node.children[ch]
		node.mu.RUnlock()
		if !ok {
			return // query not in trie, nothing to update
		}
		node = child
		path = append(path, node)
	}

	// Update score at the terminal node
	terminal := path[len(path)-1]
	terminal.mu.Lock()
	for i := range terminal.topK {
		if terminal.topK[i].Text == query {
			terminal.topK[i].FinalScore = newScore
			break
		}
	}
	sortTopK(terminal.topK)
	terminal.mu.Unlock()

	// Propagate upward: re-merge top-K at each ancestor
	for i := len(path) - 2; i >= 0; i-- {
		parent := path[i]
		parent.mu.Lock()
		parent.topK = mergeChildrenTopK(parent, 50) // keep top-50 internally, serve top-10
		parent.mu.Unlock()
	}
}

func mergeChildrenTopK(node *TrieNode, k int) []Suggestion {
	merged := make([]Suggestion, 0, k*len(node.children))
	for _, child := range node.children {
		child.mu.RLock()
		merged = append(merged, child.topK...)
		child.mu.RUnlock()
	}
	sortTopK(merged)
	if len(merged) > k {
		return merged[:k]
	}
	return merged
}

Time complexity: O(prefix_length * K * fan_out) per update, where fan_out is the average number of children at each level. At depth 2, fan_out is up to 36 (alphanumeric). Keeping 50 internal candidates and serving top-10 to users provides headroom for personalization to re-order results.

Key Insight

Keeping more candidates internally (top-50) than are served externally (top-10) is what makes personalization possible without a second round-trip to the trie - the personalization layer can surface a personal result from positions 11-50 without needing a new prefix lookup.

Scaling and Performance

Prefix sharding strategy and horizontal scaling for autocomplete

Capacity Estimation

Given:
  - 100,000 suggestion requests/second at peak
  - Average prefix length: 3 characters
  - Response payload: 10 suggestions * 50 bytes = 500 bytes
  - 4 billion active users, 500 personal queries per user
  - 5 billion distinct global queries, avg 25 bytes each

Read throughput:
  - 100,000 req/s * 500 bytes = 50 MB/s outbound
  - At p95 cache hit rate of 92%: ~8,000 req/s hit the trie directly
  - Each trie lookup: ~2ms -> 8,000 * 2ms = 16 CPU-core-seconds/s across shard fleet

Trie memory per shard (32 shards):
  - Global queries per shard: 5B / 32 = 156M queries
  - Trie nodes: ~156M * 1.3 overhead = ~200M nodes
  - Each node: 50 bytes struct + 10 suggestions * 50 bytes = 550 bytes
  - Per shard: 200M * 550 bytes = ~110 GB RAM

Personalization storage:
  - 4B users * 500 queries * 80 bytes/entry = 160 TB (Cassandra)
  - Hot working set (~5% of users active at peak): 8 TB Redis cache

Cache storage (Redis):
  - Top 50,000 prefixes * 500 bytes/entry = 25 MB (trivial)
  - Personalized prefix entries: 100M users * top-20 prefixes * 500 bytes = ~1 TB
  - Redis cluster: 10 nodes * 128 GB = 1.28 TB capacity

Trending pipeline:
  - 100,000 queries/s into Kafka = 5 MB/s write throughput
  - Flink processes 5-minute windows: 30M events/window
  - Score updates per window: ~500,000 changed queries -> ~500K trie updates

The dominant bottleneck at this scale is trie memory. With 110 GB per shard, each shard server needs 128-256 GB of RAM. A 32-shard deployment requires 32 high-memory instances. This is where vertical scaling (bigger instances) wins over horizontal scaling - splitting into more shards increases routing complexity without reducing per-shard memory, because each shard still needs the full top-K structure for its prefix range.

The secondary bottleneck is Redis under cache invalidation bursts. When a trending query spikes, every cached prefix that is a substring of that query needs invalidation. “world cup” trending means invalidating “w”, “wo”, “wor”, “worl”, “world”, “world ”, “world c”, “world cu”, “world cup” - 9 cache keys. With 500,000 trending queries changing per 5-minute window, that’s up to 4.5 million Redis invalidations per window, or 15,000 DEL operations per second. Redis handles this comfortably at cluster scale, but the key is batching invalidations rather than sending individual DEL commands.

# Batched Redis cache invalidation for trending score changes
import redis
from typing import List

def invalidate_prefix_cache_batch(
    redis_cluster: redis.Redis,
    changed_queries: List[str],
    batch_size: int = 500,
) -> int:
    """
    For each changed query, invalidate all prefix cache keys.
    Returns total number of keys deleted.
    """
    total_deleted = 0
    keys_to_delete = set()

    for query in changed_queries:
        # Add all prefix cache keys for this query
        for length in range(1, min(len(query) + 1, 8)):
            prefix = query[:length]
            keys_to_delete.add(f"suggest:{prefix}")

    key_list = list(keys_to_delete)
    for i in range(0, len(key_list), batch_size):
        batch = key_list[i:i + batch_size]
        deleted = redis_cluster.delete(*batch)
        total_deleted += deleted

    return total_deleted
Real World

Elasticsearch’s Completion Suggester, used by many production autocomplete systems, stores a finite-state transducer (FST) rather than a standard trie. FSTs are memory-compressed trie variants that can represent the same prefix structure in 10-20x less memory - Google’s internal systems use similar FST-based indices to fit the full global query set on fewer machines.

Failure Modes and Recovery

FailureDetectionImpactRecovery
Trie shard crashHealth check fails, p99 latency spike on affected prefix rangeAll queries starting with affected bigrams return empty or stale resultsReplica promotes to primary within 30s; trie rebuild from checkpoint takes 5-15 min
Redis cache cluster partial failureCache hit rate drops below 80% (alert threshold)Trie hit rate spikes; latency increases to 20-40ms but stays functionalRedis Cluster automatic slot migration; backfill from surviving nodes
Flink job lag (trending pipeline stalls)Kafka consumer lag exceeds 10 million messagesTrending scores freeze; trending queries stop getting boostsRestart Flink job from last checkpoint; replay buffered Kafka events
Personalization service timeoutp99 latency on personalization RPC exceeds 10msDegrade gracefully: return global top-10 without personalizationCircuit breaker opens; suggestion service falls back to global results for 30s
Cache stampede on popular prefixSudden p99 spike when cache key expires for “go”, “th”, etc.Thousands of concurrent trie lookups; shard CPU spikesJittered TTL (add random 0-60s to base TTL); serve stale while recomputing
Trie updater falls behindUpdate queue depth exceeds 1 million; trending boosts age > 30 minutesStale trending scores; trending queries lose boost visibilityScale updater horizontally; partition update queue by prefix shard
Watch Out

The most common operational mistake is setting a uniform TTL for all cache keys. Short-prefix keys (“g”, “go”) are fetched millions of times per second and must have short TTLs (60-120s) with active refresh-ahead logic. Long-prefix keys (“googlr”) are rarely fetched and can tolerate 1-hour TTLs. A uniform 300s TTL on all keys causes stampedes on short prefixes and wastes memory on long ones.

Comparison of Approaches

ApproachLatencyMemory UsageUpdate ComplexityBest Fit
Trie with top-K caching at every nodeO(prefix_len), sub-msVery high (top-K per node)Complex - bottom-up propagationHigh-volume prod autocomplete
Raw trie + subtree traversal at query timeO(subtree_size), 10-100msLow (no cached lists)Simple - update leaf onlySmall dataset, low QPS
Inverted index (Elasticsearch Completion)5-20ms (FST lookup)Medium (FST compressed)Medium - rebuild FST on updateGeneral purpose, moderate QPS
Database full-text prefix search (LIKE ‘go%‘)50-500msNone (no index)NonePrototypes only
Redis Sorted Sets per prefixO(1), sub-msVery high (set per prefix)Simple - ZADD on writeSmall prefix universe only
Approximate nearest neighbor (vector similarity)5-15msHigh (embedding vectors)Complex - re-embed on changeSemantic / fuzzy autocomplete

The trie with top-K caching wins for a Google-scale system because it is the only approach that delivers sub-millisecond lookup times regardless of how many queries share a prefix. Redis sorted sets per prefix would work but require storing one sorted set per possible prefix string - at billions of distinct prefixes, the memory cost is prohibitive. Elasticsearch’s FST is the right choice for teams without the infrastructure to maintain a custom trie service; it delivers acceptable latency at manageable memory cost.

Key Takeaways

  • Trie top-K caching: Storing pre-computed top-K lists at every internal node converts prefix lookup from O(subtree_size) to O(prefix_length), which is the foundational optimization that makes sub-50ms latency achievable at scale.
  • Log-normalized scoring: Raw query frequency follows a power law; log-normalizing before combining with personalization boosts ensures personal signals are not dwarfed by globally popular queries.
  • Prefix sharding on bigrams: Partitioning the trie on the first two characters of the prefix distributes load across shards with a static, deterministic routing function - no consistent hash ring needed.
  • Hierarchical TTL caching: Short prefixes get short TTLs with refresh-ahead; long prefixes tolerate long TTLs. A uniform TTL strategy causes stampedes or memory waste depending on which direction you tune.
  • Personalization without extra round trips: Keeping top-50 internal candidates per node (serving top-10) allows the personalization layer to surface personal results by re-ranking the expanded set without a second trie lookup.
  • Trending via velocity, not absolute counts: A query that doubled in volume in 5 minutes is trending; a query that has always had high volume is not. The Flink aggregation computes rate-of-change, not raw count.
  • Result deduplication as a final gate: Near-duplicate suggestions (high edit-distance similarity) are collapsed at the end of the pipeline, after scoring and personalization, to ensure the top-10 represent genuinely distinct user intents.
  • Graceful degradation on personalization failure: The suggestion service must be able to return globally ranked results when the personalization service is unavailable - a circuit breaker here protects latency SLAs during personalization outages.

The counter-intuitive lesson in this system is that the expensive data structure (trie with top-K at every node) is also the one that enables the most aggressive caching. By pre-computing results at index build time, each query does almost no work at request time - the expensive computation was already done. Systems that defer work to query time always struggle to cache effectively, because the query-time computation is too late to cache its inputs.

Frequently Asked Questions

Q: Why not just use Elasticsearch for this? A: Elasticsearch’s Completion Suggester works well up to a few hundred thousand QPS. Beyond that, the FST lookup overhead and JVM GC pauses add unpredictable tail latency. Google-scale autocomplete needs sub-millisecond trie lookups with zero GC pauses, which requires a custom in-memory service with manual memory management. For most production systems handling under 50,000 QPS, Elasticsearch is the right answer.

Q: How do you handle Unicode and multi-language queries? A: The trie key is a Unicode code point sequence, not ASCII bytes. The shard key uses the first two Unicode code points instead of the first two bytes. For CJK characters, a single code point can encode a full word (logographic), so the bigram prefix might only cover 2 characters of semantic content - in practice, CJK autocomplete uses trigrams for sharding and the trie structure remains identical.

Q: How do you bootstrap the trie when deploying a new shard? A: The offline build process generates a serialized trie snapshot from the query_frequencies table. A new shard downloads the snapshot, loads it into memory (5-15 minutes for 110 GB), and starts serving traffic. During the load window, the router sends traffic to an existing replica. Once loaded, the new shard joins the pool and replicas are rebalanced. No downtime.

Q: Why not use CDN caching for all autocomplete requests? A: Non-personalized suggestions for the top 10,000 prefixes can and should be CDN-cached - this eliminates origin load for anonymous users. But personalized suggestions cannot be CDN-cached without routing all requests for a given user to the same CDN PoP, which defeats geographic load distribution. The practical split is: anonymous -> CDN cache; authenticated -> origin with Redis cache.

Q: How does the system handle content policy violations in real time? A: A is_blocked flag on the query_frequencies table gates suggestions at trie build time. For real-time blocks (e.g., breaking news events), the moderation system writes directly to a Redis blocklist (suggest:block:{query_text}), and the suggestion service checks this list before returning results. This allows sub-minute blocking without requiring a full trie rebuild.

Q: Why does the trending pipeline use 5-minute windows rather than 1-minute? A: 1-minute windows cause high false-positive trending signals from query bursts due to a single viral social media post. 5-minute windows smooth out these transients while still delivering trending signal well within the 10-minute freshness SLA. Google reportedly uses a combination of short (1-2 min) and medium (10-15 min) windows to separate flash trends from sustained trends, with different boost multipliers for each.

Interview Questions

Q: Walk me through how a single keystroke “g” becomes 10 autocomplete suggestions in under 50ms.

Expected depth: Trace the path: client debounce, API gateway routing, cache lookup (Redis key format, cache hit path), cache miss path through trie shard, concurrent personalization fetch, top-K merge algorithm, deduplication step, JSON serialization, response. Discuss why each step needs to be async or parallel and how the 50ms budget is allocated (10ms network, 5ms gateway, 20ms trie+personalize, 5ms merge, 10ms buffer).

Q: The “g” prefix is used by hundreds of millions of queries. How do you prevent the trie node for “g” from being a bottleneck?

Expected depth: Discuss read replicas for hot shard shards, the fact that “g” results are almost always served from Redis cache (not the trie directly), refresh-ahead cache population to avoid TTL expiry stampedes, and the option of splitting the “g” shard into “ga-gl” and “gm-gz” sub-shards if write throughput to the trie updater becomes a bottleneck.

Q: How would you update autocomplete suggestions when a new trending topic like “World Cup 2026” starts spiking?

Expected depth: Kafka ingestion of raw query events, Flink 5-minute window aggregation computing velocity (rate of change vs. previous window), trend boost multiplier capped at 5x, Trie Updater applying bottom-up top-K propagation for affected queries, Redis cache invalidation for affected prefix keys (batched DEL), latency from event to visible suggestion under 10 minutes end-to-end.

Q: How would you add geographic trending - showing “local” trending queries per region?

Expected depth: Partition the Kafka query stream by (user_region, query_text), run separate Flink aggregations per region, maintain per-region trending score tables, extend the suggestion service to accept a region parameter, add a region dimension to Redis cache keys (suggest:{region}:{prefix}), discuss memory implications (32 shards x 50 regions = 1,600 shard instances if fully isolated vs. a delta overlay approach).

Q: Design the schema and update strategy for the user personalization store to handle 100,000 profile reads per second.

Expected depth: Cassandra partition key is user_id, clustering key is last_searched_ts DESC (for recency queries); secondary index on (user_id, query_text) for upserts. Read path: Cassandra with L1 Memtable serving hot users, Redis TTL-60s cache for users active in the last 10 minutes. Write path: async Kafka event -> consumer batch-upserts every 5 seconds using Cassandra lightweight transactions to increment counters. Discuss the tradeoff between strong consistency (user always sees their latest search) vs. eventual consistency (cheaper, sufficient for autocomplete).

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access
Unlock Full Article