Build Google Search Autocomplete
scalability performance databases
System Design Deep Dive
Google Search Autocomplete
Serving 50ms prefix completions to billions of queries while injecting real-time trending signals - without blowing up your cache
Every time you type a single character into Google’s search box, a timer starts. You have 50 milliseconds before the user notices lag. In that window, the system must identify the prefix you typed, find the top 10 most relevant completions out of billions of historical queries, weigh in your personal search history, inject any trending signals from the last few minutes, deduplicate near-identical suggestions, and ship a JSON payload back to your browser. Repeat this for 99,000 other users typing simultaneously.
Think of the problem like a librarian who has memorized every book request ever made, knows what’s trending on social media right now, remembers what you personally checked out last week, and still needs to answer your half-finished question in the time it takes to blink. The naive answer - just grep a query log - fails the moment you try to serve millions of users. A full-text search engine is too slow. A simple hash map can’t handle prefix matching. Even a raw trie traversal is too expensive when the tree has hundreds of millions of nodes and you need to score and rank the entire subtree on each keystroke.
Three forces are in direct tension here. Latency demands that results come from memory - no disk reads, no network hops to a single backend. Freshness demands that when “World Cup 2026” starts trending, it appears in suggestions within minutes, which means the in-memory index must be mutable. And personalization demands per-user state that cannot be shared across the cache layer without defeating the purpose of caching. A cold cache for a popular prefix will cascade into thousands of simultaneous trie traversals, which is exactly the thundering herd problem dressed in autocomplete clothes.
We need to solve for O(1) prefix lookup at query time, near-real-time trending signal injection, and per-user result re-ranking simultaneously - while keeping the primary read path under 50ms at p99.
Requirements and Constraints
Functional Requirements
- Return the top 10 completions for any search prefix typed by a user
- Completions ranked by a blend of global query frequency and personalized user history
- Trending queries from the last 5-15 minutes must appear in suggestions when relevant
- Results must be deduplicated - no near-identical suggestions (e.g., “google maps” and “google map directions” should not both appear if space is tight)
- Support at least English; architecture must be extensible to Unicode prefix matching
- Suggestions must filter out queries that violate content policy
Non-Functional Requirements
- Latency: p99 under 50ms end-to-end including network; p50 under 20ms
- Throughput: 100,000 suggestion requests per second at peak; Google processes roughly 8.5 billion searches per day, meaning autocomplete fires 3-5x more often than search submissions
- Availability: 99.99% uptime - autocomplete is on the critical user interaction path
- Freshness: Trending queries injected within 10 minutes of spike onset
- Storage: Index covers top 5 billion distinct query strings; personalization state for 4 billion active users
Constraints and Scope
- We are not building the spell-correction or query understanding layer - those are downstream
- We assume the content moderation pipeline runs asynchronously and flags are pre-applied to the index
- We assume user auth tokens are passed on the suggestion request for personalization lookups
- Geography-based trending (e.g., local trending) is out of scope for v1 but the architecture must support it
High-Level Architecture
The system decomposes into five distinct planes: a client debounce layer that throttles keypress events to reduce request volume, an API gateway handling auth and rate limiting, a suggestion service that orchestrates the read path, an in-memory trie index partitioned by prefix shard, and an async trending pipeline that continuously refreshes the index with real-time frequency scores.
On the read path, a user’s keypress fires a debounced HTTP request to the API gateway. The gateway routes the request to the correct suggestion service instance based on the first two characters of the prefix - this is the fundamental sharding unit. The suggestion service first checks a Redis cache keyed on suggest:{prefix}:{user_id} for a personalized hit, or suggest:{prefix} for the anonymous version. On a cache miss, it fans out concurrently to the trie index service (for global candidates) and the personalization service (for per-user boost signals). Results are merged, re-ranked, deduplicated, and returned. Simultaneously, every query event is published to Kafka, where a Flink streaming job aggregates 5-minute rolling windows and feeds updated frequency scores back to the trie index.
Each component has a single responsibility. The Trie Index Service holds the in-memory prefix tree and exposes a GetTopK(prefix, k) RPC. The Personalization Service reads from a user profile store and returns a map of query strings to boost multipliers. The Suggestion Service merges and ranks. The Cache Layer (Redis cluster) absorbs the read amplification. The Trending Pipeline (Kafka + Flink + Trie Updater) closes the feedback loop.
The trie stores pre-computed top-K results at every node, not just at leaf nodes - this makes GetTopK O(prefix_length) rather than O(subtree_size), trading memory for the latency budget we can’t afford to spend at query time.
The Trie Index Service
The trie index service is the heart of the read path - its job is to take a prefix string and return the top-K globally ranked completions in microseconds, not milliseconds.
A library card catalog is the right mental model. If you ask for every book starting with “Ga”, the librarian doesn’t read every card - they walk straight to the “Ga” drawer and pull the pre-sorted top titles. The trie is that catalog, except each drawer (node) already has a pre-sorted list of the most popular entries in its entire subtree. You never need to open child drawers at query time.
The critical insight most engineers miss: a naive trie lets you find all strings with a given prefix in O(subtree_size) time. That’s unacceptable when “go” has hundreds of millions of descendant queries. The solution is cached top-K at every node. Each trie node stores not just its own query data, but a pre-merged sorted list of the top-K completions reachable from that node. When GetTopK("go", 10) arrives, we traverse 2 nodes and read the pre-computed list. O(prefix_length). No subtree traversal.
// Trie node with cached top-K completions - no subtree scan at query time
package trie
import "sync"
type Suggestion struct {
Text string
GlobalScore float64
TrendBoost float64
FinalScore float64
}
type TrieNode struct {
mu sync.RWMutex
children map[rune]*TrieNode
isEnd bool
topK []Suggestion // pre-computed, sorted descending by FinalScore
lastUpdated int64 // unix millis - used by updater to skip unchanged nodes
}
type Trie struct {
root *TrieNode
}
func NewTrie() *Trie {
return &Trie{root: &TrieNode{children: make(map[rune]*TrieNode)}}
}
// GetTopK returns cached top-k without subtree traversal
func (t *Trie) GetTopK(prefix string, k int) []Suggestion {
node := t.root
for _, ch := range prefix {
node.mu.RLock()
child, ok := node.children[ch]
node.mu.RUnlock()
if !ok {
return nil
}
node = child
}
node.mu.RLock()
defer node.mu.RUnlock()
if len(node.topK) <= k {
return node.topK
}
result := make([]Suggestion, k)
copy(result, node.topK[:k])
return result
}
The trie is partitioned across N shard servers using the first two characters of the prefix as the shard key - shard_id = hashFNV(prefix[0:2]) % num_shards. Each shard server holds a complete in-memory trie for its prefix range. With 5 billion query strings averaging 25 bytes each, total raw data is about 125 GB. With top-K lists at every node (roughly 100 million internal nodes, each storing 10 suggestions of ~50 bytes), the in-memory overhead grows to approximately 50 GB per shard for a 26-shard deployment. This is within range for a high-memory instance.
Updating top-K lists requires locking or copy-on-write semantics on the trie. A naive global lock during index rebuild causes a complete read stall for 30-60 seconds. Always build a shadow trie, swap the pointer atomically, and discard the old trie - never mutate the live tree under a reader lock.
The Query Frequency Scoring Layer
The frequency scoring layer’s job is to translate raw query occurrence counts into a stable, normalizable score that survives both volume spikes and the long tail.
Think of it like a Billboard Hot 100 chart: raw stream counts matter, but a song that was #1 for 20 consecutive weeks counts differently from one that spiked this morning. The scoring function must blend historical authority with recency.
The core formula for a query string q is:
# Query scoring function - blends historical frequency with time-decay and trend boost
import math
def compute_score(
raw_count: int,
days_active: int,
click_through_rate: float,
trend_boost: float,
content_quality_penalty: float = 1.0,
) -> float:
"""
raw_count: total historical query submissions
days_active: number of days since query was first seen
click_through_rate: fraction of times this suggestion was clicked when shown
trend_boost: multiplier from trending pipeline (1.0 = neutral, >1.0 = trending)
content_quality_penalty: 0.0 to 1.0, 1.0 = no penalty
"""
# Log-normalize to compress the power law distribution
log_count = math.log(1 + raw_count)
# Time decay: older queries lose authority slowly (half-life ~365 days)
time_decay = 1.0 / (1.0 + math.exp(-0.005 * (days_active - 180)))
# CTR signal: queries people actually click on are more useful
ctr_signal = 1.0 + (click_through_rate * 0.5)
base_score = log_count * time_decay * ctr_signal * content_quality_penalty
return base_score * trend_boost
def compute_personalized_score(
base_score: float,
user_query_count: int,
user_last_used_days_ago: int,
) -> float:
"""Re-rank for a specific user on top of the global base score."""
if user_query_count == 0:
return base_score
personal_boost = math.log(1 + user_query_count) * 2.0
# Recency bonus: used in last 7 days gets a strong boost
recency_multiplier = 3.0 if user_last_used_days_ago <= 7 else 1.2
return base_score + (personal_boost * recency_multiplier)
The log normalization is essential. Query frequencies follow a power law - “google” has 10 billion occurrences, while “google search autocomplete system design” has 400. Without log normalization, the top-K list for every prefix would be dominated by the same short, generic queries and personalization signals would be too small to change the ordering.
Log-normalizing raw counts before combining with personalization boosts ensures that a user who searched “python dataclasses” 5 times can actually see it promoted above “python” (with 10 billion raw occurrences) in their personal results - without log-scaling, personal signals are always dwarfed.
The Personalization Layer
The personalization layer’s job is to take a global top-10 list and re-rank it for a specific user without adding more than 5ms to the total request latency.
The layer reads from a user profile store that maps (user_id, query_prefix) to a list of (query_string, use_count, last_used_ts) tuples. It does not run a full re-ranking pass from scratch - instead, it applies a boost multiplier to queries already in the global top-10, and can elevate personal queries from a secondary list if the user has a strong history signal.
# Personalization merge - boost global results with user signals
from dataclasses import dataclass
from typing import List, Dict
import time
@dataclass
class SuggestionResult:
text: str
score: float
source: str # "global" | "personal" | "trending"
def merge_personalized_results(
global_top10: List[SuggestionResult],
user_history: Dict[str, dict], # query_text -> {count, last_used_ts}
max_results: int = 10,
) -> List[SuggestionResult]:
"""
Merges global suggestions with user history.
User queries that appear in global top10 get a boost.
User queries not in global top10 may be inserted if boost is strong enough.
"""
now_ts = int(time.time())
scored = []
for suggestion in global_top10:
boosted = suggestion.score
if suggestion.text in user_history:
h = user_history[suggestion.text]
days_since = (now_ts - h["last_used_ts"]) / 86400
personal_mult = min(3.0, 1.0 + (h["count"] / 10.0))
recency_mult = 2.5 if days_since < 7 else (1.5 if days_since < 30 else 1.0)
boosted = suggestion.score * personal_mult * recency_mult
scored.append(SuggestionResult(suggestion.text, boosted, "global"))
# Surface personal queries not already in global list
global_texts = {s.text for s in global_top10}
for query_text, hist in user_history.items():
if query_text not in global_texts:
days_since = (now_ts - hist["last_used_ts"]) / 86400
if hist["count"] >= 3 and days_since < 30:
personal_score = math.log(1 + hist["count"]) * 2.0
scored.append(SuggestionResult(query_text, personal_score, "personal"))
scored.sort(key=lambda s: s.score, reverse=True)
return scored[:max_results]
The user profile store needs sub-5ms reads. Cassandra with a partition key of user_id and a clustering key of prefix works well here. For 4 billion users, each storing up to 500 prefix-query pairs, total storage is around 2 TB - manageable in a standard Cassandra cluster.
Google’s autocomplete personalization uses a client-side component as well: recent searches are stored locally in the browser and merged with server suggestions client-side, which means the personalization latency for recently used queries is effectively 0ms - no server round trip needed.
The Trending Signal Injection
The trending signal pipeline’s job is to detect queries that are spiking in volume right now and inject a boost multiplier into the trie index within minutes, not hours.
Think of it like a stock ticker: the trending pipeline is a continuous feed of real-time prices, and the trie’s stored scores are the end-of-day close prices. The suggestion service blends both - the stable historical authority and the live momentum signal.
The pipeline is: Kafka (raw query events) -> Flink (windowed aggregation) -> Trie Updater (score injection) -> Redis invalidation (cache purge on hot prefixes).
// Flink streaming job for trending score aggregation - 5-minute tumbling window
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.windowing.time.Time;
public class TrendingScoreJob {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// Source: Kafka topic with raw query events
DataStream<QueryEvent> queries = env
.addSource(new FlinkKafkaConsumer<>(
"query-events",
new QueryEventDeserializer(),
kafkaProps
));
// 5-minute tumbling window: count queries per query_text
DataStream<TrendingScore> trending = queries
.map(e -> Tuple2.of(e.queryText, 1L))
.keyBy(t -> t.f0)
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.sum(1)
.map(t -> {
// Compare to previous 5-minute window to compute velocity
long prevCount = windowStateStore.get(t.f0);
double velocity = prevCount > 0
? (double)(t.f1 - prevCount) / prevCount
: 1.0;
double trendBoost = velocity > 2.0
? Math.min(5.0, 1.0 + Math.log(velocity))
: 1.0;
windowStateStore.put(t.f0, t.f1);
return new TrendingScore(t.f0, trendBoost, System.currentTimeMillis());
});
// Sink: write to Trie Updater service via Kafka
trending.addSink(new FlinkKafkaProducer<>("trending-scores", new TrendingScoreSerializer(), kafkaProps));
env.execute("Trending Score Aggregation");
}
}
The Trie Updater service consumes trending-scores and applies the boost multipliers to affected trie nodes. Critically, updating a score for “world cup 2026” requires re-computing top-K for every ancestor node up to the root - 15 nodes for a 15-character query. This is the bottom-up propagation step and it must happen atomically on each node using copy-on-write.
Do not apply trending boosts to the global trie and cache simultaneously in the same transaction. Update the trie first, then invalidate the relevant cache keys. If you do it in reverse, you’ll repopulate the cache with stale scores before the trie update lands, locking in stale results for the full TTL.
The Cache Warming Strategy
The cache layer’s job is to absorb the 100,000 QPS read volume so that trie traversals happen on cache misses only - which should be less than 5% of total traffic.
A cold cache on deploy is like opening a new library branch on day one with no books on the shelves: every visitor waits while staff manually fetches books from the warehouse. The solution is cache warming - pre-populating the cache before any user traffic arrives.
# Cache warming script - pre-populate top prefixes before traffic starts
import redis
import itertools
import string
from concurrent.futures import ThreadPoolExecutor
def warm_cache(
redis_client: redis.Redis,
trie_service,
top_prefix_count: int = 50000,
workers: int = 20,
) -> None:
"""
Generates the most common 1-, 2-, and 3-character prefixes
and pre-populates the Redis cache with top-10 suggestions.
"""
chars = string.ascii_lowercase + string.digits
# All 1-char and 2-char prefixes
prefixes_1 = list(chars)
prefixes_2 = [''.join(p) for p in itertools.product(chars, repeat=2)]
# Top 3-char prefixes from historical frequency data
prefixes_3 = load_top_prefixes_from_analytics(limit=top_prefix_count)
all_prefixes = prefixes_1 + prefixes_2 + prefixes_3
def warm_single(prefix: str) -> None:
suggestions = trie_service.get_top_k(prefix, k=10)
if suggestions:
cache_key = f"suggest:{prefix}"
pipe = redis_client.pipeline()
pipe.set(cache_key, serialize(suggestions))
pipe.expire(cache_key, 3600) # 1 hour TTL for warmed entries
pipe.execute()
with ThreadPoolExecutor(max_workers=workers) as pool:
list(pool.map(warm_single, all_prefixes))
print(f"Warmed {len(all_prefixes)} prefixes")
The TTL strategy is not uniform. Short, highly popular prefixes like “go” or “th” get a 300-second TTL and are refreshed continuously by real traffic. Longer 5-character prefixes get a 3600-second TTL. Personalized cache entries (suggest:{prefix}:{user_id}) get a 60-second TTL since they reflect recent user behavior that can change quickly.
Bing’s autocomplete uses a hierarchical cache where the CDN edge layer caches non-personalized suggestions for common prefixes, and the origin layer handles personalized queries. This means users without login state get suggestions from the edge in under 10ms globally, while logged-in users pay the extra round trip to origin for personalization.
The Result Deduplication Step
The deduplication step’s job is to ensure the final 10 suggestions shown to the user are semantically distinct - not just syntactically unique strings.
Without deduplication, the top-10 for prefix “python” might include “python”, “python 3”, “python 3.11”, “python 3.12”, “python 3.12.3”, “python 3.12 download” - all near-identical from a user intent perspective. The user sees an unhelpful list. Deduplication collapses these into a representative subset.
The approach is edit-distance clustering with a threshold, applied only to the final top-N candidates (not to the full trie subtree):
# Result deduplication using edit-distance clustering on the final candidate set
from difflib import SequenceMatcher
from typing import List
def deduplicate_suggestions(
candidates: List[str],
max_results: int = 10,
similarity_threshold: float = 0.82,
) -> List[str]:
"""
Remove near-duplicate suggestions using sequence similarity.
Keeps the highest-scoring candidate from each similarity cluster.
Candidates must be pre-sorted by score (highest first).
"""
kept: List[str] = []
for candidate in candidates:
is_duplicate = False
for kept_item in kept:
ratio = SequenceMatcher(None, candidate, kept_item).ratio()
if ratio >= similarity_threshold:
is_duplicate = True
break
if not is_duplicate:
kept.append(candidate)
if len(kept) >= max_results:
break
return kept
This runs on the merged candidate list of 20-30 results after scoring, not on the raw trie output. The O(n^2) pairwise comparison is acceptable at this small scale - 30 candidates with strings under 100 characters is trivially fast. The similarity threshold of 0.82 is tunable; lower values are more aggressive at collapsing near-duplicates.
Data Model
-- Core schema for the autocomplete system
-- Query frequency store (replicated to trie at index build time)
CREATE TABLE query_frequencies (
query_text TEXT NOT NULL,
raw_count BIGINT NOT NULL DEFAULT 0,
click_count BIGINT NOT NULL DEFAULT 0,
first_seen_date DATE NOT NULL,
last_seen_ts TIMESTAMPTZ NOT NULL,
global_score DOUBLE PRECISION NOT NULL DEFAULT 0.0,
is_blocked BOOLEAN NOT NULL DEFAULT FALSE,
PRIMARY KEY (query_text)
);
CREATE INDEX idx_qf_score ON query_frequencies (global_score DESC)
WHERE is_blocked = FALSE;
-- User search history for personalization
-- Partitioned by user_id for Cassandra-style horizontal scaling
CREATE TABLE user_search_history (
user_id UUID NOT NULL,
query_text TEXT NOT NULL,
search_count INT NOT NULL DEFAULT 1,
last_searched_ts TIMESTAMPTZ NOT NULL,
click_through BOOLEAN NOT NULL DEFAULT FALSE,
PRIMARY KEY (user_id, query_text)
);
CREATE INDEX idx_ush_user_ts ON user_search_history (user_id, last_searched_ts DESC);
-- Trending scores (written by Flink, read by Trie Updater)
CREATE TABLE trending_scores (
query_text TEXT NOT NULL,
window_start_ts TIMESTAMPTZ NOT NULL,
window_count BIGINT NOT NULL,
previous_count BIGINT NOT NULL DEFAULT 0,
velocity DOUBLE PRECISION NOT NULL,
trend_boost DOUBLE PRECISION NOT NULL DEFAULT 1.0,
PRIMARY KEY (query_text, window_start_ts)
);
CREATE INDEX idx_ts_boost ON trending_scores (trend_boost DESC, window_start_ts DESC);
-- Cache invalidation log (used by Trie Updater to purge Redis on score changes)
CREATE TABLE cache_invalidation_queue (
id BIGSERIAL PRIMARY KEY,
prefix TEXT NOT NULL,
reason TEXT NOT NULL,
created_ts TIMESTAMPTZ NOT NULL DEFAULT NOW(),
processed_ts TIMESTAMPTZ
);
The query_frequencies table is the source of truth for the offline index rebuild. user_search_history is partitioned by user_id and lives in Cassandra in production - the SQL DDL above represents the logical schema. trending_scores is written by the Flink job every 5 minutes and consumed by the Trie Updater.
The sharding key for the trie index is prefix[0:2] - the first two characters of the prefix string. This distributes load across 26x26 = 676 possible buckets, typically consolidated into 32-64 shard servers. The partition key for user_search_history in Cassandra is user_id - reads always target a single user’s history, so this is the natural partition unit.
The data model separates global scores (updated in batch every 24 hours) from trending boosts (updated every 5 minutes) and personalization (real-time per request). This separation allows each update frequency to evolve independently without contention on the same rows.
Key Algorithms and Protocols
Prefix Index Sharding via Consistent Hashing on Bigrams
The routing function maps a prefix to a trie shard. We use a simple modulo hash on the first two characters rather than consistent hashing, because prefix assignment is static - we never move prefix ownership between shards without a full rebuild.
// Shard routing for prefix lookups - deterministic, no rehashing needed
package router
import (
"hash/fnv"
)
type ShardRouter struct {
numShards int
shards []string // shard addresses
}
func NewShardRouter(shards []string) *ShardRouter {
return &ShardRouter{numShards: len(shards), shards: shards}
}
func (r *ShardRouter) GetShard(prefix string) string {
if len(prefix) == 0 {
return r.shards[0]
}
key := prefix
if len(prefix) > 2 {
key = prefix[:2] // shard only on first 2 chars
}
h := fnv.New32a()
h.Write([]byte(key))
shardIdx := int(h.Sum32()) % r.numShards
return r.shards[shardIdx]
}
func (r *ShardRouter) GetShardIndex(prefix string) int {
if len(prefix) == 0 {
return 0
}
key := prefix
if len(prefix) > 2 {
key = prefix[:2]
}
h := fnv.New32a()
h.Write([]byte(key))
return int(h.Sum32()) % r.numShards
}
Time complexity: O(1) per routing decision. Space: O(num_shards) for the routing table.
Bottom-Up Top-K Propagation After Score Update
When a trending query’s score changes, every ancestor node in the trie must have its top-K list updated. The algorithm walks from the updated leaf up to the root, re-merging children’s top-K lists at each level.
// Bottom-up top-K propagation after a score update on a single query
func (t *Trie) UpdateQueryScore(query string, newScore float64) {
// Find the path from root to the terminal node
path := make([]*TrieNode, 0, len(query)+1)
node := t.root
path = append(path, node)
for _, ch := range query {
node.mu.RLock()
child, ok := node.children[ch]
node.mu.RUnlock()
if !ok {
return // query not in trie, nothing to update
}
node = child
path = append(path, node)
}
// Update score at the terminal node
terminal := path[len(path)-1]
terminal.mu.Lock()
for i := range terminal.topK {
if terminal.topK[i].Text == query {
terminal.topK[i].FinalScore = newScore
break
}
}
sortTopK(terminal.topK)
terminal.mu.Unlock()
// Propagate upward: re-merge top-K at each ancestor
for i := len(path) - 2; i >= 0; i-- {
parent := path[i]
parent.mu.Lock()
parent.topK = mergeChildrenTopK(parent, 50) // keep top-50 internally, serve top-10
parent.mu.Unlock()
}
}
func mergeChildrenTopK(node *TrieNode, k int) []Suggestion {
merged := make([]Suggestion, 0, k*len(node.children))
for _, child := range node.children {
child.mu.RLock()
merged = append(merged, child.topK...)
child.mu.RUnlock()
}
sortTopK(merged)
if len(merged) > k {
return merged[:k]
}
return merged
}
Time complexity: O(prefix_length * K * fan_out) per update, where fan_out is the average number of children at each level. At depth 2, fan_out is up to 36 (alphanumeric). Keeping 50 internal candidates and serving top-10 to users provides headroom for personalization to re-order results.
Keeping more candidates internally (top-50) than are served externally (top-10) is what makes personalization possible without a second round-trip to the trie - the personalization layer can surface a personal result from positions 11-50 without needing a new prefix lookup.
Scaling and Performance
Capacity Estimation
Given:
- 100,000 suggestion requests/second at peak
- Average prefix length: 3 characters
- Response payload: 10 suggestions * 50 bytes = 500 bytes
- 4 billion active users, 500 personal queries per user
- 5 billion distinct global queries, avg 25 bytes each
Read throughput:
- 100,000 req/s * 500 bytes = 50 MB/s outbound
- At p95 cache hit rate of 92%: ~8,000 req/s hit the trie directly
- Each trie lookup: ~2ms -> 8,000 * 2ms = 16 CPU-core-seconds/s across shard fleet
Trie memory per shard (32 shards):
- Global queries per shard: 5B / 32 = 156M queries
- Trie nodes: ~156M * 1.3 overhead = ~200M nodes
- Each node: 50 bytes struct + 10 suggestions * 50 bytes = 550 bytes
- Per shard: 200M * 550 bytes = ~110 GB RAM
Personalization storage:
- 4B users * 500 queries * 80 bytes/entry = 160 TB (Cassandra)
- Hot working set (~5% of users active at peak): 8 TB Redis cache
Cache storage (Redis):
- Top 50,000 prefixes * 500 bytes/entry = 25 MB (trivial)
- Personalized prefix entries: 100M users * top-20 prefixes * 500 bytes = ~1 TB
- Redis cluster: 10 nodes * 128 GB = 1.28 TB capacity
Trending pipeline:
- 100,000 queries/s into Kafka = 5 MB/s write throughput
- Flink processes 5-minute windows: 30M events/window
- Score updates per window: ~500,000 changed queries -> ~500K trie updates
The dominant bottleneck at this scale is trie memory. With 110 GB per shard, each shard server needs 128-256 GB of RAM. A 32-shard deployment requires 32 high-memory instances. This is where vertical scaling (bigger instances) wins over horizontal scaling - splitting into more shards increases routing complexity without reducing per-shard memory, because each shard still needs the full top-K structure for its prefix range.
The secondary bottleneck is Redis under cache invalidation bursts. When a trending query spikes, every cached prefix that is a substring of that query needs invalidation. “world cup” trending means invalidating “w”, “wo”, “wor”, “worl”, “world”, “world ”, “world c”, “world cu”, “world cup” - 9 cache keys. With 500,000 trending queries changing per 5-minute window, that’s up to 4.5 million Redis invalidations per window, or 15,000 DEL operations per second. Redis handles this comfortably at cluster scale, but the key is batching invalidations rather than sending individual DEL commands.
# Batched Redis cache invalidation for trending score changes
import redis
from typing import List
def invalidate_prefix_cache_batch(
redis_cluster: redis.Redis,
changed_queries: List[str],
batch_size: int = 500,
) -> int:
"""
For each changed query, invalidate all prefix cache keys.
Returns total number of keys deleted.
"""
total_deleted = 0
keys_to_delete = set()
for query in changed_queries:
# Add all prefix cache keys for this query
for length in range(1, min(len(query) + 1, 8)):
prefix = query[:length]
keys_to_delete.add(f"suggest:{prefix}")
key_list = list(keys_to_delete)
for i in range(0, len(key_list), batch_size):
batch = key_list[i:i + batch_size]
deleted = redis_cluster.delete(*batch)
total_deleted += deleted
return total_deleted
Elasticsearch’s Completion Suggester, used by many production autocomplete systems, stores a finite-state transducer (FST) rather than a standard trie. FSTs are memory-compressed trie variants that can represent the same prefix structure in 10-20x less memory - Google’s internal systems use similar FST-based indices to fit the full global query set on fewer machines.
Failure Modes and Recovery
| Failure | Detection | Impact | Recovery |
|---|---|---|---|
| Trie shard crash | Health check fails, p99 latency spike on affected prefix range | All queries starting with affected bigrams return empty or stale results | Replica promotes to primary within 30s; trie rebuild from checkpoint takes 5-15 min |
| Redis cache cluster partial failure | Cache hit rate drops below 80% (alert threshold) | Trie hit rate spikes; latency increases to 20-40ms but stays functional | Redis Cluster automatic slot migration; backfill from surviving nodes |
| Flink job lag (trending pipeline stalls) | Kafka consumer lag exceeds 10 million messages | Trending scores freeze; trending queries stop getting boosts | Restart Flink job from last checkpoint; replay buffered Kafka events |
| Personalization service timeout | p99 latency on personalization RPC exceeds 10ms | Degrade gracefully: return global top-10 without personalization | Circuit breaker opens; suggestion service falls back to global results for 30s |
| Cache stampede on popular prefix | Sudden p99 spike when cache key expires for “go”, “th”, etc. | Thousands of concurrent trie lookups; shard CPU spikes | Jittered TTL (add random 0-60s to base TTL); serve stale while recomputing |
| Trie updater falls behind | Update queue depth exceeds 1 million; trending boosts age > 30 minutes | Stale trending scores; trending queries lose boost visibility | Scale updater horizontally; partition update queue by prefix shard |
The most common operational mistake is setting a uniform TTL for all cache keys. Short-prefix keys (“g”, “go”) are fetched millions of times per second and must have short TTLs (60-120s) with active refresh-ahead logic. Long-prefix keys (“googlr”) are rarely fetched and can tolerate 1-hour TTLs. A uniform 300s TTL on all keys causes stampedes on short prefixes and wastes memory on long ones.
Comparison of Approaches
| Approach | Latency | Memory Usage | Update Complexity | Best Fit |
|---|---|---|---|---|
| Trie with top-K caching at every node | O(prefix_len), sub-ms | Very high (top-K per node) | Complex - bottom-up propagation | High-volume prod autocomplete |
| Raw trie + subtree traversal at query time | O(subtree_size), 10-100ms | Low (no cached lists) | Simple - update leaf only | Small dataset, low QPS |
| Inverted index (Elasticsearch Completion) | 5-20ms (FST lookup) | Medium (FST compressed) | Medium - rebuild FST on update | General purpose, moderate QPS |
| Database full-text prefix search (LIKE ‘go%‘) | 50-500ms | None (no index) | None | Prototypes only |
| Redis Sorted Sets per prefix | O(1), sub-ms | Very high (set per prefix) | Simple - ZADD on write | Small prefix universe only |
| Approximate nearest neighbor (vector similarity) | 5-15ms | High (embedding vectors) | Complex - re-embed on change | Semantic / fuzzy autocomplete |
The trie with top-K caching wins for a Google-scale system because it is the only approach that delivers sub-millisecond lookup times regardless of how many queries share a prefix. Redis sorted sets per prefix would work but require storing one sorted set per possible prefix string - at billions of distinct prefixes, the memory cost is prohibitive. Elasticsearch’s FST is the right choice for teams without the infrastructure to maintain a custom trie service; it delivers acceptable latency at manageable memory cost.
Key Takeaways
- Trie top-K caching: Storing pre-computed top-K lists at every internal node converts prefix lookup from O(subtree_size) to O(prefix_length), which is the foundational optimization that makes sub-50ms latency achievable at scale.
- Log-normalized scoring: Raw query frequency follows a power law; log-normalizing before combining with personalization boosts ensures personal signals are not dwarfed by globally popular queries.
- Prefix sharding on bigrams: Partitioning the trie on the first two characters of the prefix distributes load across shards with a static, deterministic routing function - no consistent hash ring needed.
- Hierarchical TTL caching: Short prefixes get short TTLs with refresh-ahead; long prefixes tolerate long TTLs. A uniform TTL strategy causes stampedes or memory waste depending on which direction you tune.
- Personalization without extra round trips: Keeping top-50 internal candidates per node (serving top-10) allows the personalization layer to surface personal results by re-ranking the expanded set without a second trie lookup.
- Trending via velocity, not absolute counts: A query that doubled in volume in 5 minutes is trending; a query that has always had high volume is not. The Flink aggregation computes rate-of-change, not raw count.
- Result deduplication as a final gate: Near-duplicate suggestions (high edit-distance similarity) are collapsed at the end of the pipeline, after scoring and personalization, to ensure the top-10 represent genuinely distinct user intents.
- Graceful degradation on personalization failure: The suggestion service must be able to return globally ranked results when the personalization service is unavailable - a circuit breaker here protects latency SLAs during personalization outages.
The counter-intuitive lesson in this system is that the expensive data structure (trie with top-K at every node) is also the one that enables the most aggressive caching. By pre-computing results at index build time, each query does almost no work at request time - the expensive computation was already done. Systems that defer work to query time always struggle to cache effectively, because the query-time computation is too late to cache its inputs.
Frequently Asked Questions
Q: Why not just use Elasticsearch for this? A: Elasticsearch’s Completion Suggester works well up to a few hundred thousand QPS. Beyond that, the FST lookup overhead and JVM GC pauses add unpredictable tail latency. Google-scale autocomplete needs sub-millisecond trie lookups with zero GC pauses, which requires a custom in-memory service with manual memory management. For most production systems handling under 50,000 QPS, Elasticsearch is the right answer.
Q: How do you handle Unicode and multi-language queries? A: The trie key is a Unicode code point sequence, not ASCII bytes. The shard key uses the first two Unicode code points instead of the first two bytes. For CJK characters, a single code point can encode a full word (logographic), so the bigram prefix might only cover 2 characters of semantic content - in practice, CJK autocomplete uses trigrams for sharding and the trie structure remains identical.
Q: How do you bootstrap the trie when deploying a new shard?
A: The offline build process generates a serialized trie snapshot from the query_frequencies table. A new shard downloads the snapshot, loads it into memory (5-15 minutes for 110 GB), and starts serving traffic. During the load window, the router sends traffic to an existing replica. Once loaded, the new shard joins the pool and replicas are rebalanced. No downtime.
Q: Why not use CDN caching for all autocomplete requests? A: Non-personalized suggestions for the top 10,000 prefixes can and should be CDN-cached - this eliminates origin load for anonymous users. But personalized suggestions cannot be CDN-cached without routing all requests for a given user to the same CDN PoP, which defeats geographic load distribution. The practical split is: anonymous -> CDN cache; authenticated -> origin with Redis cache.
Q: How does the system handle content policy violations in real time?
A: A is_blocked flag on the query_frequencies table gates suggestions at trie build time. For real-time blocks (e.g., breaking news events), the moderation system writes directly to a Redis blocklist (suggest:block:{query_text}), and the suggestion service checks this list before returning results. This allows sub-minute blocking without requiring a full trie rebuild.
Q: Why does the trending pipeline use 5-minute windows rather than 1-minute? A: 1-minute windows cause high false-positive trending signals from query bursts due to a single viral social media post. 5-minute windows smooth out these transients while still delivering trending signal well within the 10-minute freshness SLA. Google reportedly uses a combination of short (1-2 min) and medium (10-15 min) windows to separate flash trends from sustained trends, with different boost multipliers for each.
Interview Questions
Q: Walk me through how a single keystroke “g” becomes 10 autocomplete suggestions in under 50ms.
Expected depth: Trace the path: client debounce, API gateway routing, cache lookup (Redis key format, cache hit path), cache miss path through trie shard, concurrent personalization fetch, top-K merge algorithm, deduplication step, JSON serialization, response. Discuss why each step needs to be async or parallel and how the 50ms budget is allocated (10ms network, 5ms gateway, 20ms trie+personalize, 5ms merge, 10ms buffer).
Q: The “g” prefix is used by hundreds of millions of queries. How do you prevent the trie node for “g” from being a bottleneck?
Expected depth: Discuss read replicas for hot shard shards, the fact that “g” results are almost always served from Redis cache (not the trie directly), refresh-ahead cache population to avoid TTL expiry stampedes, and the option of splitting the “g” shard into “ga-gl” and “gm-gz” sub-shards if write throughput to the trie updater becomes a bottleneck.
Q: How would you update autocomplete suggestions when a new trending topic like “World Cup 2026” starts spiking?
Expected depth: Kafka ingestion of raw query events, Flink 5-minute window aggregation computing velocity (rate of change vs. previous window), trend boost multiplier capped at 5x, Trie Updater applying bottom-up top-K propagation for affected queries, Redis cache invalidation for affected prefix keys (batched DEL), latency from event to visible suggestion under 10 minutes end-to-end.
Q: How would you add geographic trending - showing “local” trending queries per region?
Expected depth: Partition the Kafka query stream by (user_region, query_text), run separate Flink aggregations per region, maintain per-region trending score tables, extend the suggestion service to accept a region parameter, add a region dimension to Redis cache keys (suggest:{region}:{prefix}), discuss memory implications (32 shards x 50 regions = 1,600 shard instances if fully isolated vs. a delta overlay approach).
Q: Design the schema and update strategy for the user personalization store to handle 100,000 profile reads per second.
Expected depth: Cassandra partition key is user_id, clustering key is last_searched_ts DESC (for recency queries); secondary index on (user_id, query_text) for upserts. Read path: Cassandra with L1 Memtable serving hot users, Redis TTL-60s cache for users active in the last 10 minutes. Write path: async Kafka event -> consumer batch-upserts every 5 seconds using Cassandra lightweight transactions to increment counters. Discuss the tradeoff between strong consistency (user always sees their latest search) vs. eventual consistency (cheaper, sufficient for autocomplete).
Premium Content
Unlock the full article along with everything else in the archive — all in one place.