Build Instagram Explore Real-Time Recommendation Engine
scalability data-engineering performance
System Design Deep Dive
Instagram Explore Recommendation Engine
Serving 2 billion personalized grids in real-time while surfacing viral content within minutes of it breaking
Imagine you are the librarian for a collection of a billion photographs, responsible for handing every visitor a personally curated stack the moment they walk through the door - before they have even told you what they want. Now imagine you have 2 billion visitors per day, each staying only 30 seconds if the first page does not captivate them, and half your collection was uploaded in the last 24 hours. That is the engineering problem behind Instagram Explore.
The naive approach - score every post in the catalog against every user profile - collapses immediately. 2 billion users times 100 billion pieces of content equals 200 quintillion pairs. Even a microsecond comparison per pair would take more compute than exists on Earth. The solution is a staged funnel that discards 99.99% of candidates before any expensive model sees them, preserves diversity so the grid does not collapse into an echo chamber, and injects trending signals fast enough that a post going viral appears on relevant Explore grids within minutes rather than hours.
Three tensions define every decision in this system. Relevance versus freshness: the most personalized recommendations come from a model trained on weeks of engagement history, but content that went viral three minutes ago has no engagement history at all. Personalization versus diversity: a pure relevance model converges on a narrow slice of content that matches past behavior, suffocating discovery and accelerating user churn. Latency versus depth: deeper neural re-ranking produces better results but burns more compute; shallow retrieval is fast but imprecise.
We need to solve for candidate retrieval at billion-user scale, embedding freshness under content churn, real-time trend injection, and sub-200ms end-to-end latency simultaneously.
Requirements and Constraints
Functional Requirements
- Generate a personalized Explore grid of 100-200 posts for any user on demand
- Refresh recommendations as the user scrolls (pagination via cursor)
- Surface content from accounts the user does not follow
- Detect trending content and inject it into relevant grids within 5 minutes of a trend breaking
- Respect content policy: exclude blocked accounts, reported posts, restricted categories per user settings
Non-Functional Requirements
- 2 billion users, ~400 million daily active Explore sessions
- Peak: ~5 million grid requests per second during prime time
- End-to-end latency: P99 under 200ms, P50 under 80ms
- Recall@100 (top-100 retrieved items contain at least one that will be engaged): >60%
- Embedding freshness: new content indexed within 60 seconds of upload
- Trend detection: trending posts appear in relevant grids within 5 minutes
- 99.99% availability (Explore is a major engagement surface)
Constraints
- Cold-start users (new signups) receive a popularity-based fallback grid
- Content from private accounts is excluded unless the user follows them
- We do not build a real-time bidding or ads layer here - that is a separate system
- Geographic content restrictions enforced at retrieval time, not re-ranking
High-Level Architecture
The system is a four-stage funnel: candidate retrieval, lightweight ranking, neural re-ranking, and post-processing (diversity + trend injection). Surrounding this funnel are two supporting systems: an embedding pipeline that keeps content and user vectors fresh, and a trend detection pipeline that feeds viral signals into the top of the funnel.
A user opens Explore. The request hits an API gateway that authenticates the session and routes to a recommendation orchestrator. The orchestrator fans out to a candidate retrieval service which runs approximate nearest-neighbor (ANN) search against a 100-billion-vector index, returning 2,000-5,000 candidates. A lightweight ranker - a linear model running in under 10ms - scores these candidates and cuts the list to 500. A neural re-ranker - a two-tower deep model - re-scores the top 500 and cuts to 200. Finally a post-processor applies diversity filtering, injects trending posts, enforces blocks and restrictions, and returns the final grid.
The funnel’s dramatic narrowing (100B candidates → 5,000 → 500 → 200) is deliberate: each stage is orders of magnitude more expensive than the previous, so every cut must be precise enough to retain the true positives that matter.
The Embedding Pipeline
The embedding pipeline is the engine that converts raw content into the vectors the retrieval stage searches. Think of it as a translation service: raw pixels, captions, and engagement counts become points in a shared 256-dimensional space where semantic similarity corresponds to proximity.
When a post is uploaded, a media processing service extracts a visual embedding using a vision transformer (ViT) fine-tuned on Instagram’s content distribution. A text encoder processes the caption and hashtags. An engagement encoder computes early engagement signals (saves-to-impressions ratio, share velocity) as a sparse feature vector. These three signals are concatenated and projected through a learned fusion layer to produce the final 256-dimensional content embedding.
The resulting vector is written to two stores: a hot embedding cache (Redis, 7-day TTL) for recently uploaded content, and a persistent ANN index (FAISS with HNSW graph) for the full catalog. Index ingestion must happen within 60 seconds of upload to meet the freshness SLA.
# Content embedding fusion layer (inference path)
import numpy as np
def fuse_content_embedding(
visual_emb: np.ndarray, # 512-dim ViT output
text_emb: np.ndarray, # 256-dim text encoder output
engagement_features: np.ndarray, # 64-dim sparse engagement
fusion_weights: np.ndarray, # learned 832x256 projection
fusion_bias: np.ndarray, # 256-dim bias
) -> np.ndarray:
# Concatenate multi-modal signals
raw = np.concatenate([visual_emb, text_emb, engagement_features]) # 832-dim
# Project to shared embedding space
fused = np.tanh(raw @ fusion_weights + fusion_bias) # 256-dim
# L2 normalize for cosine similarity in ANN index
return fused / (np.linalg.norm(fused) + 1e-9)
User embeddings are computed offline nightly via a separate pipeline that aggregates a user’s 90-day engagement history - every like, save, share, and extended view - through the same encoder stack, producing a single 256-dim user interest vector. However, nightly embeddings miss intra-session signal: if a user just spent 10 minutes engaging with travel content, their nightly embedding may reflect last week’s cooking obsession. A session context module computes a lightweight online adjustment vector from the last 20 actions in the current session, which is added at query time before the ANN search.
Pinterest’s PinSage uses a similar multi-modal fusion approach, combining visual features from ResNet with text features from fastText, then projecting through a learned combiner - the architectural pattern predates Instagram’s two-tower but validates the multi-modal fusion strategy.
Candidate Retrieval
Candidate retrieval must scan 100 billion content vectors and return 5,000 relevant candidates in under 30ms. The only practical approach is approximate nearest neighbor (ANN) search.
The content embedding index is sharded by content ID modulo N across 200 ANN index shards. Each shard holds roughly 500 million vectors. A query against the full index fans out to all 200 shards simultaneously, each returning its top-50 candidates, then a merge step combines and deduplicates the 10,000 results to produce the top-5,000 by cosine similarity.
The ANN algorithm of choice is HNSW (Hierarchical Navigable Small World). HNSW builds a multi-layer proximity graph where each node connects to its M nearest neighbors at each layer. Search descends from the top sparse layer through increasingly dense layers, trading a small recall penalty (typically 2-5%) for query latency of 1-5ms per shard versus hours for brute force.
# HNSW index query with session-adjusted user vector
import faiss
import numpy as np
def retrieve_candidates(
base_user_emb: np.ndarray, # 256-dim nightly user embedding
session_delta: np.ndarray, # 256-dim session adjustment
index_shards: list, # list of faiss.IndexHNSWFlat shards
top_k_per_shard: int = 50,
total_candidates: int = 5000,
) -> list[int]:
# Combine nightly + session signal
query_vec = base_user_emb + 0.3 * session_delta
query_vec = query_vec / (np.linalg.norm(query_vec) + 1e-9)
all_results = []
for shard in index_shards:
distances, ids = shard.search(
query_vec.reshape(1, -1).astype(np.float32),
top_k_per_shard
)
all_results.extend(zip(distances[0], ids[0]))
# Deduplicate and return top candidates by similarity
seen = set()
ranked = sorted(all_results, reverse=True)
final = []
for score, cid in ranked:
if cid not in seen and len(final) < total_candidates:
seen.add(cid)
final.append(cid)
return final
Without content-freshness boosting at retrieval time, new content (uploaded in the last 24 hours) will almost never appear - the HNSW graph was built before these vectors existed, and newly added nodes sit at leaf positions with few connections. Separate “hot content” pools that bypass ANN and get injected directly address this.
The Two-Tower Re-Ranker
The two-tower model is the system’s most accurate - and most expensive - component. Think of it as a matchmaker that was trained by watching billions of past engagements: it encodes user context and candidate content through two separate neural towers, then computes the dot product to estimate engagement probability.
The user tower takes the 256-dim user embedding, the session context vector, device type, time of day, and recent engagement sequence (last 5 interactions as a transformer-encoded sequence) and outputs a 128-dim context vector.
The content tower takes the 256-dim content embedding, engagement velocity (saves per hour since upload), creator authority score, and content age, and outputs a 128-dim content vector.
The score for a (user, content) pair is sigmoid(user_vec · content_vec), predicting probability of a “meaningful engagement” (save or 3-second view, per Instagram’s internal definition).
# Two-tower model inference (simplified)
import torch
import torch.nn as nn
class TwoTowerScorer(nn.Module):
def __init__(self, user_dim=256, content_dim=256, hidden=128):
super().__init__()
self.user_tower = nn.Sequential(
nn.Linear(user_dim + 32, hidden * 2),
nn.ReLU(),
nn.Linear(hidden * 2, hidden),
)
self.content_tower = nn.Sequential(
nn.Linear(content_dim + 16, hidden * 2),
nn.ReLU(),
nn.Linear(hidden * 2, hidden),
)
def forward(self, user_features, content_features):
u = self.user_tower(user_features) # (batch, 128)
c = self.content_tower(content_features) # (batch, 128)
# Dot product + sigmoid for engagement probability
return torch.sigmoid((u * c).sum(dim=-1))
Running the two-tower model on 500 candidates per request at 5 million requests per second would require ~2.5 billion inferences per second. The model runs on GPU inference servers with batching: requests are held for at most 5ms to fill a batch, then scored in a single forward pass.
The two-tower architecture’s power comes from pre-computation: the content tower runs once when a post is created and caches the result. At query time, only the user tower runs live - so inference cost is proportional to the number of users, not users times candidates.
Trend Detection Pipeline
Trending content breaks faster than any nightly embedding update cycle can capture. The trend detection pipeline runs as a parallel stream: it reads the post engagement event stream (likes, saves, shares) and computes a velocity score - engagements per minute divided by the account’s historical engagement baseline - for every post published in the last 2 hours.
-- Velocity scoring query (runs every 60 seconds on streaming aggregation table)
-- Engagement velocity relative to creator's historical baseline
SELECT
p.post_id,
p.creator_id,
p.created_at,
(e.eng_last_5min / NULLIF(h.avg_eng_per_5min, 0)) AS velocity_ratio,
e.eng_last_5min AS raw_velocity,
p.embedding_vector
FROM posts p
JOIN post_engagement_5min e ON p.post_id = e.post_id
JOIN creator_engagement_baseline h ON p.creator_id = h.creator_id
WHERE p.created_at > NOW() - INTERVAL '2 hours'
AND velocity_ratio > 5.0 -- 5x above creator's normal engagement rate
ORDER BY velocity_ratio DESC
LIMIT 10000;
Posts with velocity ratio above 5x are promoted to a trending pool - an in-memory list sharded by content category (food, travel, fashion, etc.) maintained in Redis. The post-processing stage checks this pool and injects 1-3 trending posts per grid page, weighted by overlap between the post’s content category and the user’s inferred interest categories.
Twitter’s trending topics system uses a similar velocity-over-baseline approach (Count-Min Sketch for approximate counting at stream scale), distinguishing genuine breakout trends from topics that have always been popular - the same principle applies here at the post level.
Data Model
-- Content embedding store
CREATE TABLE content_embeddings (
post_id BIGINT PRIMARY KEY,
creator_id BIGINT NOT NULL,
embedding vector(256) NOT NULL, -- pgvector for warm tier
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
indexed_at TIMESTAMPTZ, -- when added to HNSW index
content_type VARCHAR(16) NOT NULL, -- 'photo', 'reel', 'carousel'
is_public BOOLEAN NOT NULL DEFAULT true,
category_id SMALLINT,
engagement_velocity FLOAT DEFAULT 0.0
);
CREATE INDEX ON content_embeddings USING hnsw (embedding vector_cosine_ops);
CREATE INDEX ON content_embeddings (created_at DESC);
CREATE INDEX ON content_embeddings (category_id, engagement_velocity DESC);
-- User embedding store
CREATE TABLE user_embeddings (
user_id BIGINT PRIMARY KEY,
embedding vector(256) NOT NULL,
computed_at TIMESTAMPTZ NOT NULL,
interest_categories INT[] NOT NULL DEFAULT '{}',
embedding_version INT NOT NULL DEFAULT 1
);
-- Impression log (used for training feedback)
CREATE TABLE explore_impressions (
impression_id BIGINT GENERATED ALWAYS AS IDENTITY,
user_id BIGINT NOT NULL,
post_id BIGINT NOT NULL,
shown_at TIMESTAMPTZ NOT NULL DEFAULT now(),
rank_position SMALLINT NOT NULL,
retrieval_score FLOAT,
ranker_score FLOAT,
engaged BOOLEAN DEFAULT NULL, -- filled in async
engagement_type VARCHAR(16) -- 'like', 'save', 'share', 'view'
) PARTITION BY RANGE (shown_at);
CREATE INDEX ON explore_impressions (user_id, shown_at DESC);
CREATE INDEX ON explore_impressions (post_id, shown_at DESC);
-- Trending pool (maintained in Redis, mirrored here for auditability)
CREATE TABLE trending_posts (
post_id BIGINT NOT NULL,
category_id SMALLINT NOT NULL,
velocity_ratio FLOAT NOT NULL,
added_at TIMESTAMPTZ NOT NULL DEFAULT now(),
expires_at TIMESTAMPTZ NOT NULL,
PRIMARY KEY (post_id, category_id)
);
The partitioning strategy for explore_impressions uses daily range partitions with a 30-day retention window. Impressions are the training signal for model retraining - they need fast writes but are only read in bulk during offline training jobs, never on the hot request path.
Key Algorithms and Protocols
Embedding Freshness via Incremental Index Updates
HNSW indexes are expensive to rebuild from scratch (hours for 100 billion vectors). Freshness is maintained through incremental insertion: new vectors are inserted into the live HNSW graph as posts arrive. HNSW supports online inserts with acceptable recall degradation (typically 1-3% recall drop versus a fresh build). A background job rebuilds each shard from scratch weekly to restore full recall.
# Incremental HNSW insert with freshness tracking
import faiss
import time
def add_to_index(
shard: faiss.IndexHNSWFlat,
post_id: int,
embedding: list[float],
freshness_cache: dict, # post_id -> insert_timestamp
) -> None:
vec = [embedding]
ids = [post_id]
shard.add_with_ids(
__import__('numpy').array(vec, dtype='float32'),
__import__('numpy').array(ids, dtype='int64')
)
freshness_cache[post_id] = time.time()
Diversity Injection via MMR
Maximum Marginal Relevance (MMR) prevents the post-processor from returning 200 near-identical posts. For each candidate in ranked order, MMR computes a score that balances relevance against similarity to already-selected posts:
MMR_score(c) = lambda * relevance(user, c) - (1 - lambda) * max_sim(c, selected)
where max_sim is the cosine similarity to the most similar already-selected post.
# Maximum Marginal Relevance diversity selection
import numpy as np
def mmr_select(
candidates: list[tuple[int, float, np.ndarray]], # (post_id, score, embedding)
k: int = 200,
lambda_param: float = 0.7,
) -> list[int]:
selected = []
selected_embs = []
remaining = list(candidates)
while len(selected) < k and remaining:
if not selected_embs:
# First item: take highest relevance score
best = max(remaining, key=lambda x: x[1])
else:
emb_matrix = np.stack(selected_embs) # (n_selected, 256)
best, best_score = None, -float('inf')
for pid, rel_score, emb in remaining:
sims = emb_matrix @ emb # cosine sim to all selected
max_sim = sims.max()
mmr = lambda_param * rel_score - (1 - lambda_param) * max_sim
if mmr > best_score:
best, best_score = (pid, rel_score, emb), mmr
selected.append(best[0])
selected_embs.append(best[2])
remaining.remove(best)
return selected
lambda_param of 0.7 means 70% weight on relevance, 30% on diversity. Tuning this parameter is empirically done against a “session depth” metric (how many posts a user scrolls through before leaving) - higher diversity consistently improves session depth up to a point before hurting engagement rate.
Scaling and Performance
The four-stage funnel scales each component independently because their bottlenecks are completely different.
Candidate Retrieval is memory-bound. Each ANN shard holds 500M vectors of 256 floats = 500GB per shard uncompressed. With product quantization (PQ16), this compresses to ~8GB per shard. 200 shards = 1.6TB of total index memory, spread across ~200 servers with 16GB of HNSW index per server.
Lightweight Ranking is CPU-bound and stateless. It scales horizontally without coordination; each node independently scores its batch of candidates.
Neural Re-ranking is GPU-bound. The two-tower model runs on A100 GPUs. At 5M requests/second with 500 candidates each, we need ~25 billion forward passes per second through the content tower alone - which is why content tower results are pre-cached.
Capacity Estimation:
Given:
- 400M daily active Explore sessions
- Peak: 5M requests/second
- 500 candidates per re-ranking call
ANN Index Memory:
- 100B vectors x 256-dim x 4 bytes = 100TB raw
- With PQ16 compression: ~100TB / 16 = 6.25TB
- Spread across 200 shards with replication: ~200 servers
GPU Inference:
- Content tower pre-cached: 1 forward pass per new post (not per request)
- User tower: 5M forward passes/sec x 1ms GPU time = 5,000 GPU-ms/sec
- At A100 throughput of ~100,000 inferences/sec: ~50 A100 GPUs for user tower
Training Data:
- 400M sessions x 100 impressions = 40B impression events/day
- At 64 bytes/event: ~2.5TB of training data per day
YouTube’s recommendation system (described in their 2016 Deep Neural Networks paper) pioneered the same two-stage architecture: candidate generation via approximate nearest neighbor, then neural ranking - with explicit separation between user and content towers to enable content-side pre-computation.
Failure Modes and Recovery
| Failure | Detection | Impact | Recovery |
|---|---|---|---|
| ANN shard down | Health check miss + error rate spike | 1/200 of candidates missing, reduced recall | Traffic rerouted to replica shard automatically |
| User embedding stale (>24h) | Embedding age check at retrieval | Explore grid uses yesterday’s preferences | Fall back to category-level interest signals from session history |
| Trend pipeline lag | Velocity score staleness alert | Trending content delayed by lag amount | Burst processing; trending posts have 5-min TTL so stale ones expire naturally |
| Neural re-ranker GPU OOM | GPU memory alert, inference latency spike | Re-ranking skipped, lightweight ranker results used directly | Graceful degradation: serve lightweight-only results at lower quality |
| New content not indexed | Freshness SLA breach alert | New posts invisible in Explore | Hot content pool bypass inserts new posts from last 30min directly into candidate pool |
| Cold-start user | Missing embedding lookup | No personalized vector available | Serve curated “popular in your region” grid; backfill embedding from first 5 interactions |
The most dangerous operational mistake is letting the impression feedback loop stall. If the async job that marks impressions as “engaged” falls behind, the model keeps training on incomplete labels - teaching it that most impressions were ignored, which degrades ranking quality over days before anyone notices.
Comparison of Approaches
| Approach | Retrieval Latency | Recall | Complexity | Best Fit |
|---|---|---|---|---|
| ANN (HNSW) + Two-Tower | 20-50ms | ~85% | High | Large-scale personalized recommendations |
| Collaborative filtering only | 100-500ms | ~70% | Medium | Smaller user bases, catalog-heavy domains |
| Content-based only | 10-20ms | ~55% | Low | New platforms with thin engagement history |
| Hybrid BM25 + embedding | 30-80ms | ~75% | Medium | Text-heavy content (articles, tweets) |
| Full neural (no retrieval funnel) | >1s | ~95% | Very High | Impractical at 2B users without funnel |
The two-tower ANN funnel is the right choice at Instagram’s scale because it achieves near-optimal recall while keeping per-request compute bounded. Collaborative filtering alone fails because it struggles with new content and does not encode visual/semantic content signals. Full neural scoring is impractical at 5M RPS without the retrieval stage to cut candidates by 99.9% first.
Key Takeaways
- Two-tower architecture separates user and content encoding so the content tower pre-computes once per post rather than once per (user, post) pair.
- ANN with HNSW trades 2-5% recall for 100x latency improvement versus brute force - at this scale, that tradeoff is non-negotiable.
- Session context delta corrects nightly embeddings for intra-session interest shifts, improving immediate relevance without recomputing the full user embedding.
- Trend velocity ratio normalizes for creator popularity - a creator with 10M followers going 5x above their baseline is more notable than an account with 100 followers going 5x above theirs.
- MMR diversity injection with a tunable lambda lets product teams dial between relevance-heavy and discovery-heavy grids through a single parameter.
- Hot content bypass pool solves the ANN freshness problem for newly uploaded content without requiring index rebuilds.
- Graceful degradation at every stage ensures a result is always returned, even if quality degrades when GPU re-rankers are overloaded.
The counter-intuitive lesson is that making recommendations less accurate - intentionally injecting diversity through MMR and trends - measurably increases total session engagement. Pure relevance optimization converges on a local maximum; diversity is not a quality concession but an engagement amplifier.
Frequently Asked Questions
Q: Why not just use collaborative filtering? It worked for Netflix in 2007.
A: Collaborative filtering (matrix factorization) captures “users like you also engaged with” signals but is blind to visual and semantic content features. On a visual platform like Instagram, two photos can look completely different yet appeal to the same user - and a new post has zero collaborative signal until it accumulates thousands of engagements. The two-tower model encodes content features from day one.
Q: Why HNSW over alternatives like IVF-PQ or ScaNN?
A: HNSW supports online inserts without index rebuilding, which is critical given Instagram’s content upload rate. IVF (Inverted File Index) requires full index retraining when the distribution shifts. HNSW’s recall-latency tradeoff is also more favorable at high recall targets (90%+). ScaNN can outperform HNSW on raw throughput but lacks efficient online update support.
Q: How do you prevent the recommendation system from creating filter bubbles?
A: Three mechanisms: MMR diversity injection (explicit dissimilarity penalty), topic quota limits (no more than 30% of a grid from a single category), and “exploration budget” - 10% of positions are reserved for posts outside the user’s inferred interest graph, sampled from globally popular content. The exploration budget is how new interests are discovered.
Q: How does the system handle A/B testing of ranking model changes?
A: User IDs are hashed into experiment buckets at the orchestrator layer. Each bucket receives a different model version identifier, which is passed through the entire funnel. Impression logs record the model version alongside every shown/engaged event, enabling offline metric comparison. Shadow mode evaluation runs a new model in parallel for 48 hours, logging its scores without using them, before any traffic is shifted.
Q: Why do you need a separate lightweight ranker? Why not just use the two-tower model on all 5,000 candidates?
A: At 5M requests/second with 5,000 candidates each, running the full two-tower model on every candidate would require 25 billion GPU inferences per second - roughly 250,000 A100 GPUs. The lightweight ranker (a logistic regression over 50 hand-engineered features) cuts this to 500 candidates before the expensive model runs, reducing GPU requirements by 10x while losing less than 5% recall.
Interview Questions
Q: Walk me through how you’d design the candidate retrieval stage for 2 billion users.
Expected depth: Discuss ANN algorithms (HNSW vs IVF-PQ), index sharding strategy (by content ID range or hash), query fan-out and merge, memory requirements with quantization, online update support for new content, and the freshness problem for recently uploaded posts.
Q: How would you ensure the Explore grid surfaces trending content within 5 minutes of a viral post breaking?
Expected depth: Cover streaming engagement aggregation, velocity-over-baseline scoring (why raw count is insufficient), trending pool design (Redis, TTL, category sharding), injection mechanics at post-processing stage, and why the ANN index cannot be relied on for trending signal.
Q: The two-tower model is giving great offline metrics but engagement dropped in A/B test. What do you investigate?
Expected depth: Discuss training-serving skew (features computed differently offline vs online), label leakage (engagement signal used as both feature and label), diversity collapse (model learned a narrow high-engagement niche), position bias in training data, and feedback loop staleness in impression labels.
Q: How would you handle a new user with zero engagement history?
Expected depth: Cold-start strategies: onboarding interest selection, geo-based popularity grid, zero-shot content-only recommendations using only content embeddings without user tower, progressive personalization as first interactions arrive, and how quickly the embedding pipeline converges to a useful representation (typically 20-50 engagements).
Premium Content
Unlock the full article along with everything else in the archive — all in one place.