Build Amazon's Product Search Ranking

data-engineering scalability performance

System Design Deep Dive

Amazon’s Product Search Ranking

Ranking 500 million products in real-time while personalizing every result for 300 million users

⏱ 14 min read📐 Advanced🏗️ Search

Imagine a librarian responsible for 500 million books. Every second, some books go out of stock, prices change, and new bestseller lists arrive. A customer walks up and asks for “wireless headphones.” The librarian must, in 200 milliseconds, recall which books match, weigh which are in stock at the right price, factor in that this particular customer always buys Sony over Bose, and hand over a ranked list of 20 results - while simultaneously serving 100,000 other customers asking different questions. That is Amazon product search.

The challenge is not retrieval alone. Any search engine can find products matching a query. The hard part is that three forces are permanently in tension. Freshness demands that a product showing “In Stock” in search results is actually purchasable right now - Amazon loses revenue every time a customer clicks a result only to find it unavailable. But reindexing 500 million products fast enough to reflect every price and inventory change is expensive. Personalization demands that the same query “protein powder” surfaces whey isolate for a bodybuilder and plant-based options for a vegan - but per-user ranking at 100K QPS is a latency nightmare. And recall quality demands that a relevant product is never buried so deep it becomes invisible, even if it lacks the keyword density of competitors who optimize their listings.

The naive solution - build one big Elasticsearch cluster, query it at search time, and sort by a formula - fails at three different seams. Elasticsearch cannot ingest 500 million document updates fast enough to keep inventory fresh. A single ranking formula cannot personalize for 300 million users. And a static index cannot blend real-time signals like sales velocity without rebuilding the index on every change. We need a multi-stage architecture that separates recall from ranking, static signals from real-time signals, and global relevance from personal preference.

Requirements and Constraints

Functional Requirements

Query-time ranking of products matching a search term, ordered by a relevance and conversion score
Real-time signal injection: price, inventory availability, and sales velocity must reflect state within 30 seconds of change
Per-user personalization: results must be re-ranked based on the individual user’s purchase and browse history
Full catalog indexing: 500 million products must be discoverable; new products must appear in results within 5 minutes of creation

Non-Functional Requirements

Latency: P99 under 200ms end-to-end including network; P50 under 80ms
Throughput: 100,000 search requests per second at peak
Catalog size: 500 million active product listings
Signal freshness: Price and inventory signals must reflect real state within 30 seconds; sales velocity within 5 minutes
Availability: 99.99% - search is on the revenue-critical path; every second of downtime costs thousands of dollars

Constraints and Scope

Re-ranking is applied only to the top 1,000 candidates from the recall phase - not the entire matching set
Spell correction, query understanding, and synonym expansion are upstream systems; we receive a clean tokenized query
The advertising auction (sponsored products) runs in a separate system; we only rank organic results here
User auth token is present on every search request; anonymous users receive global ranking with no personalization

High-Level Architecture

The system has five layers. The Search API accepts queries and orchestrates the pipeline. The Recall Engine retrieves up to 1,000 candidate products using an inverted index. The Feature Store provides real-time signals for each candidate. The Ranking Service scores and orders candidates using a learning-to-rank model. The Personalization Layer applies per-user adjustments to the final ranking before returning results.

The read path is strictly sequential within layers but parallel across sub-components. The Search API fans out the tokenized query to all relevant index shards simultaneously (scatter), collects up to 1,000 candidates (gather), then passes the candidate set to the Ranking Service. The Ranking Service calls the Feature Store in parallel for all 1,000 candidates in a single batch request, constructs feature vectors, runs the L2R model, then passes the scored list to the Personalization Layer.

Key Insight

Separating recall from ranking is the foundational decision that makes this system possible at scale. Recall (inverted index) is fast but produces an unordered set. Ranking (L2R model + real-time features) is slow but runs only on the top 1,000 candidates. Never try to rank all matching products - a popular query like “laptop” matches 2 million products; ranking all 2 million is physically impossible at 200ms P99.

The Catalog Indexing Pipeline

Every product in Amazon’s catalog starts as a row in a product database. To become searchable, it must be transformed into a set of index tokens and written into an inverted index shard. This is the catalog indexing pipeline, and it has two modes: a full rebuild that runs nightly, and an incremental delta path that keeps the index fresh in near-real-time.

The full rebuild pipeline reads the entire product catalog from the source database, extracts features (title tokens, category, brand, bullet points), computes static signals (historical sales rank, review count), builds inverted index segments, and deploys them to the index shard cluster. This runs on a Spark cluster over 4-6 hours, producing ~500GB of raw inverted index data.

The incremental path handles changes between full rebuilds. A change data capture (CDC) feed from the product database emits an event for every product that changes - price update, new description, inventory change, new review. The Feature Extractor processes these events and writes delta index segments. Delta segments are merged into the live index on a rolling basis.

# Index document builder - converts raw product record to index entry
from dataclasses import dataclass, field
from typing import List, Dict
import re

@dataclass
class ProductIndexDoc:
    product_id: str
    tokens: List[str]
    title: str
    brand: str
    category_path: List[str]
    static_score: float       # pre-computed at index time, rarely changes
    review_count: int
    avg_rating: float

def build_index_doc(product: dict) -> ProductIndexDoc:
    """
    Convert a raw product record into an index document.
    Static signals are computed here and stored in the index.
    Real-time signals (price, inventory) are NOT stored here -
    they are injected at query time from the Feature Store.
    """
    title_tokens = tokenize(product["title"])
    bullet_tokens = []
    for bullet in product.get("bullet_points", []):
        bullet_tokens.extend(tokenize(bullet))

    # De-duplicate tokens, keep title tokens weighted higher
    all_tokens = list(set(title_tokens + bullet_tokens))

    # Static quality score: function of review count and rating
    review_count = product.get("review_count", 0)
    avg_rating = product.get("avg_rating", 0.0)
    static_score = compute_static_score(review_count, avg_rating)

    return ProductIndexDoc(
        product_id=product["asin"],
        tokens=all_tokens,
        title=product["title"],
        brand=product.get("brand", ""),
        category_path=product.get("category_path", []),
        static_score=static_score,
        review_count=review_count,
        avg_rating=avg_rating,
    )

def tokenize(text: str) -> List[str]:
    """Lowercase, strip punctuation, split on whitespace."""
    text = text.lower()
    text = re.sub(r"[^a-z0-9\s]", " ", text)
    return [t for t in text.split() if len(t) > 1]

def compute_static_score(review_count: int, avg_rating: float) -> float:
    """Wilson lower bound approximation for review quality signal."""
    if review_count == 0:
        return 0.0
    import math
    z = 1.96  # 95% confidence
    n = review_count
    p = avg_rating / 5.0
    lower = (p + z*z/(2*n) - z * math.sqrt((p*(1-p) + z*z/(4*n))/n)) / (1 + z*z/n)
    return lower

Index freshness is the hardest operational challenge here. The delta path must process product changes fast enough that inventory and pricing in the index never diverge more than 30 seconds from reality. In practice, price and inventory signals are so volatile that they are never stored in the inverted index at all - they live exclusively in the Feature Store and are injected at query time. Only the structural product data (title, description, category) lives in the inverted index, which changes slowly enough for the delta path to keep up.

Real World

Elasticsearch and Apache Solr both support near-real-time indexing via segment refresh intervals (default 1 second in Elasticsearch). Amazon A9 (the search team) historically built custom index infrastructure rather than adopting Lucene-based systems, because at 500 million products and 100K QPS, the JVM GC pauses in Elasticsearch create unacceptable tail latency spikes at P99. Custom C++ index servers with predictable memory layout eliminate this problem.

Recall: The Inverted Index and BM25

The inverted index is the engine that turns a query like “wireless noise cancelling headphones” into a candidate set of 1,000 products in under 20ms. The structure is simple: for every token in the vocabulary, maintain a posting list - a sorted list of (product_id, term_frequency) pairs for every product containing that token.

At query time, the query is tokenized into terms. For each term, the posting lists are fetched. The candidate set is the union of all products appearing in any posting list. Each candidate is then scored by BM25, the industry-standard probabilistic retrieval function that balances term frequency against document length.

BM25 formula:

BM25(q, d) = sum over terms t in q of:
    IDF(t) * (TF(t,d) * (k1 + 1)) / (TF(t,d) + k1 * (1 - b + b * |d| / avgdl))

Where IDF(t) = log((N - df(t) + 0.5) / (df(t) + 0.5) + 1), N is total documents, df(t) is documents containing term t, TF(t,d) is term frequency in document d, |d| is document length, avgdl is average document length, k1 = 1.2, b = 0.75.

# BM25 scorer - returns score for a single document given a query
import math
from typing import List, Dict

class BM25Scorer:
    def __init__(self, k1: float = 1.2, b: float = 0.75):
        self.k1 = k1
        self.b = b

    def score(
        self,
        query_terms: List[str],
        doc_term_freqs: Dict[str, int],
        doc_length: int,
        avg_doc_length: float,
        term_doc_counts: Dict[str, int],
        total_docs: int,
    ) -> float:
        """
        Compute BM25 score for a single document.

        query_terms: tokenized query
        doc_term_freqs: {term: count} for the candidate document
        doc_length: total token count in document
        avg_doc_length: average token count across all documents
        term_doc_counts: {term: number_of_docs_containing_term}
        total_docs: total number of documents in index
        """
        score = 0.0
        length_norm = 1 - self.b + self.b * (doc_length / avg_doc_length)

        for term in set(query_terms):  # de-duplicate query terms
            if term not in doc_term_freqs:
                continue
            tf = doc_term_freqs[term]
            df = term_doc_counts.get(term, 0)
            if df == 0:
                continue

            # Inverse document frequency with smoothing
            idf = math.log((total_docs - df + 0.5) / (df + 0.5) + 1)

            # Normalized term frequency
            tf_norm = (tf * (self.k1 + 1)) / (tf + self.k1 * length_norm)

            score += idf * tf_norm

        return score

    def batch_score(
        self,
        query_terms: List[str],
        candidates: List[dict],
        avg_doc_length: float,
        term_doc_counts: Dict[str, int],
        total_docs: int,
    ) -> List[tuple]:
        """Score all candidates and return sorted (product_id, bm25_score) list."""
        results = []
        for doc in candidates:
            s = self.score(
                query_terms,
                doc["term_freqs"],
                doc["length"],
                avg_doc_length,
                term_doc_counts,
                total_docs,
            )
            results.append((doc["product_id"], s))
        results.sort(key=lambda x: x[1], reverse=True)
        return results[:1000]  # return top 1000 candidates for ranking

The inverted index is sharded across multiple servers. Each shard holds a subset of the product catalog. A query fans out to all shards simultaneously, each shard returns its local top-1000, and the orchestrator merges and re-scores to produce the global top-1000. This scatter-gather pattern is the standard approach and adds only the inter-shard network round-trip to the latency budget.

Key Insight

BM25 is a strong recall baseline but a poor final ranker. It rewards keyword density and penalizes long documents - which means a terse product title stuffed with keywords beats a well-written detailed description. This is precisely why a learning-to-rank layer sits on top: it can correct for BM25’s known biases using signals BM25 cannot see, like conversion rate, click-through rate, and real-time inventory.

Ranking: Learning-to-Rank

BM25 gives us 1,000 candidates ordered by textual relevance. The learning-to-rank (L2R) model’s job is to reorder them by the probability of conversion - will the user actually buy this product if it appears at position 1?

L2R comes in three flavors. Pointwise approaches score each document independently (regression or classification). Pairwise approaches learn from pairs of documents - “product A should rank above product B for this query” (RankSVM, RankNet). Listwise approaches optimize the ranking of the entire list directly (LambdaMART, ListNet). At Amazon’s scale, LambdaMART - implemented as gradient-boosted decision trees (XGBoost or LightGBM) - is the standard approach. It directly optimizes NDCG (Normalized Discounted Cumulative Gain), the metric that most closely tracks revenue impact.

The feature vector for each candidate product is the heart of the L2R model. Features fall into three categories: query-document features (how well does this product match this query?), document features (how good is this product independent of the query?), and context features (what does the system know about this user right now?).

# Feature vector construction for L2R ranking
from dataclasses import dataclass
from typing import Optional
import numpy as np

@dataclass
class RankingFeatures:
    # Query-document features
    bm25_score: float
    title_exact_match: float        # 1.0 if query appears verbatim in title
    brand_query_match: float        # 1.0 if query term matches brand name
    category_relevance: float       # cosine sim between query and category embedding

    # Document features (static, from index)
    static_quality_score: float     # Wilson lower bound on reviews
    review_count_log: float         # log(1 + review_count)
    avg_rating: float
    days_since_launch_norm: float   # newer products get a freshness boost

    # Real-time document features (from Feature Store, injected at query time)
    is_in_stock: float              # 0.0 or 1.0
    price_percentile: float         # where this product sits in category price range
    sales_velocity_7d: float        # normalized sales count last 7 days
    inventory_depth_norm: float     # clipped to [0, 1], 1 = well stocked

    # Context features
    user_category_affinity: float   # how often user buys from this category
    user_brand_affinity: float      # how often user buys this brand
    query_price_sensitivity: float  # inferred from user's purchase history

def build_feature_vector(features: RankingFeatures) -> np.ndarray:
    return np.array([
        features.bm25_score,
        features.title_exact_match,
        features.brand_query_match,
        features.category_relevance,
        features.static_quality_score,
        features.review_count_log,
        features.avg_rating,
        features.days_since_launch_norm,
        features.is_in_stock,
        features.price_percentile,
        features.sales_velocity_7d,
        features.inventory_depth_norm,
        features.user_category_affinity,
        features.user_brand_affinity,
        features.query_price_sensitivity,
    ], dtype=np.float32)

def rank_candidates(
    candidates: list,
    model,           # pre-loaded LightGBM or XGBoost booster
    feature_builder,
) -> list:
    """
    Rank the top-1000 candidates from recall using the L2R model.
    Returns candidates sorted by predicted conversion score.
    """
    if not candidates:
        return []

    # Build feature matrix: shape (num_candidates, num_features)
    feature_matrix = np.stack([
        build_feature_vector(feature_builder(c)) for c in candidates
    ])

    # Model inference: single batch call
    scores = model.predict(feature_matrix)

    # Pair scores with candidates and sort descending
    scored = list(zip(candidates, scores))
    scored.sort(key=lambda x: x[1], reverse=True)
    return [c for c, _ in scored]

The model is trained offline on logged search sessions. Each session provides (query, product, position, clicked, purchased) tuples. Purchases are weighted 10x more than clicks in the loss function, since purchase is the direct revenue signal. The model is retrained daily on a rolling 90-day window and deployed as a serialized booster binary loaded into the Ranking Service at startup.

Watch Out

Training-serving skew is the silent killer in L2R systems. If the feature pipeline at training time computes sales_velocity_7d by averaging over a 7-day window, but the Feature Store at serving time returns the past 24-hour velocity (because 7-day aggregates are too expensive to maintain in real-time), the model will produce garbage predictions. Every feature definition must be bit-for-bit identical between training and serving. Use a shared feature computation library, not separate implementations.

Real-Time Feature Injection

The L2R model needs real-time signals - is this product in stock right now? What is today’s price? How fast is it selling this week? These signals change constantly and cannot live in the inverted index (which would require continuous reindexing) or in the L2R model weights (which are retrained daily). They must be injected at query time from a low-latency feature store.

The feature store for this system is a Redis cluster keyed on product_id. For each product, the value is a compact binary-encoded struct of all real-time signals. The Ranking Service issues a single MGET for all 1,000 candidate product IDs - one network round-trip fetches all 1,000 feature sets.

The signals are written to Redis by separate upstream services:

Inventory Service writes stock status within 5 seconds of a warehouse event
Pricing Service writes the current buy-box price on every price change
Sales Velocity Aggregator runs a Flink job computing rolling 7-day and 24-hour sales counts, writing to Redis every 60 seconds

# Feature Store client - Redis-backed, batch lookup for all candidates
import redis
import struct
from typing import List, Dict, Optional

FEATURE_STRUCT_FORMAT = "!fffffBxx"  # 5 floats + 1 byte + 2 padding bytes = 22 bytes
FEATURE_STRUCT_SIZE = struct.calcsize(FEATURE_STRUCT_FORMAT)

class FeatureStoreClient:
    def __init__(self, redis_cluster: redis.RedisCluster):
        self.redis = redis_cluster

    def batch_get(self, product_ids: List[str]) -> Dict[str, dict]:
        """
        Fetch real-time features for all candidates in a single MGET.
        Returns {product_id: feature_dict} for products found in store.
        """
        if not product_ids:
            return {}

        keys = [f"feat:{pid}" for pid in product_ids]
        raw_values = self.redis.mget(keys)

        result = {}
        for pid, raw in zip(product_ids, raw_values):
            if raw is None:
                # Product not in feature store - use safe defaults
                result[pid] = self._default_features()
                continue
            result[pid] = self._decode(raw)
        return result

    def _decode(self, raw: bytes) -> dict:
        if len(raw) < FEATURE_STRUCT_SIZE:
            return self._default_features()
        price, sales_vel_7d, sales_vel_24h, inventory_depth, price_percentile, in_stock_byte, *_ = struct.unpack(
            FEATURE_STRUCT_FORMAT, raw[:FEATURE_STRUCT_SIZE]
        )
        return {
            "price": price,
            "sales_velocity_7d": sales_vel_7d,
            "sales_velocity_24h": sales_vel_24h,
            "inventory_depth": min(1.0, inventory_depth),
            "price_percentile": price_percentile,
            "is_in_stock": float(in_stock_byte),
        }

    def _default_features(self) -> dict:
        """Safe fallback when product is absent from feature store."""
        return {
            "price": 0.0,
            "sales_velocity_7d": 0.0,
            "sales_velocity_24h": 0.0,
            "inventory_depth": 0.5,
            "price_percentile": 0.5,
            "is_in_stock": 1.0,
        }

    def write_inventory_signal(self, product_id: str, in_stock: bool, depth: float) -> None:
        """Called by Inventory Service on every warehouse event."""
        key = f"feat:{product_id}"
        # Use HSET to update only inventory fields, avoid overwriting price
        self.redis.hset(key, mapping={
            "in_stock": int(in_stock),
            "inventory_depth": depth,
        })
        self.redis.expire(key, 3600)  # TTL of 1 hour - refreshed on every write

The single MGET for 1,000 keys is the critical design choice. A naive implementation might fetch features per candidate in a loop - 1,000 round trips at 0.5ms each = 500ms, which violates the entire latency budget. A batch MGET takes roughly the same time as a single GET - about 1-2ms for 1,000 keys in a co-located Redis cluster.

Real World

Uber’s Michelangelo feature store and Twitter’s Feature Store use similar Redis-backed architectures for online serving. The common pattern is a two-tier store: Redis for hot real-time features (sub-5ms reads), and a columnar store like Apache Parquet on S3 for training data. Features written to Redis are also mirrored to S3 for offline training, ensuring training-serving consistency.

The Personalization Layer

Every user who searches Amazon has a history: categories browsed, brands purchased, price points preferred, products clicked but not bought. The Personalization Layer takes the L2R-ranked list of 20 results and applies per-user adjustments that reflect this history.

Personalization operates on two signals. Direct affinity is computed from explicit purchase and browse history: if a user has bought Nike shoes three times in the last year, their Nike affinity in the footwear category is high. Collaborative filtering provides implicit signals: users with similar purchase histories tend to prefer similar products, even for categories where the individual user has no direct history.

The personalization service returns a boost vector: a map from product_id to a multiplier in the range [0.5, 3.0]. A product the user has bought before gets a boost of ~2.5. A product from a brand the user dislikes gets a penalty of 0.5. The final ranking blends the L2R score with the personalization multiplier.

# Personalization score blending
from dataclasses import dataclass
from typing import List, Dict

@dataclass
class RankedProduct:
    product_id: str
    l2r_score: float

def blend_personalization(
    l2r_ranked: List[RankedProduct],
    user_boosts: Dict[str, float],   # product_id -> multiplier from personalization service
    alpha: float = 0.3,              # weight for personalization vs pure L2R
) -> List[RankedProduct]:
    """
    Blend L2R scores with personalization boosts.

    alpha=0.0 means pure L2R (no personalization).
    alpha=1.0 means pure personalization (ignore L2R score).
    alpha=0.3 is a typical starting point - tune via A/B test.

    Final score = (1 - alpha) * l2r_score + alpha * (l2r_score * boost)
               = l2r_score * ((1 - alpha) + alpha * boost)
               = l2r_score * (1 + alpha * (boost - 1))
    """
    blended = []
    for product in l2r_ranked:
        boost = user_boosts.get(product.product_id, 1.0)
        # Clamp boost to prevent extreme re-ordering
        boost = max(0.5, min(3.0, boost))
        final_score = product.l2r_score * (1.0 + alpha * (boost - 1.0))
        blended.append(RankedProduct(product.product_id, final_score))

    blended.sort(key=lambda x: x.l2r_score, reverse=True)
    return blended


def compute_category_affinity(
    user_purchase_history: List[dict],
    target_category: str,
    decay_halflife_days: int = 90,
) -> float:
    """
    Compute affinity score for a category based on purchase history.
    Recent purchases weighted more heavily via exponential decay.
    """
    import math
    import time

    now = time.time()
    score = 0.0
    for purchase in user_purchase_history:
        if purchase.get("category") != target_category:
            continue
        days_ago = (now - purchase["timestamp"]) / 86400
        decay = math.exp(-math.log(2) * days_ago / decay_halflife_days)
        score += decay

    # Normalize to [0, 1] range with sigmoid
    return 1.0 / (1.0 + math.exp(-score + 2.0))

The personalization service is called with the user’s ID and the list of top-20 product IDs (not all 1,000 candidates). Computing collaborative filtering boosts for 1,000 products per query would be prohibitive. Instead, personalization runs on the post-L2R top-20, accepting that some personally relevant products might be outside the top-20 and therefore missed. This is the designed tradeoff between recall and latency.

Key Insight

The alpha parameter in score blending is not a constant - it should be a function of how strong the personalization signal is for this user. A user with 200 purchases in a category has strong personalization signal; alpha should be 0.4 or higher. A new user with 2 purchases has weak signal; alpha should be 0.05 to avoid noisy boosts dominating the L2R signal. Adaptive alpha is often a larger ranking quality improvement than the model itself.

Data Model

Product Index Document Schema (Protobuf)

// Product index document - stored in inverted index shards
syntax = "proto3";

message ProductIndexDoc {
  string product_id = 1;         // ASIN
  string title = 2;
  string brand = 3;
  repeated string category_path = 4;
  repeated string index_tokens = 5;      // tokenized, lowercased, de-duped
  map<string, int32> term_freqs = 6;     // token -> count in this document
  int32 doc_length = 7;                  // total token count

  // Static signals - computed at index build time
  float static_quality_score = 8;        // Wilson lower bound on reviews
  int32 review_count = 9;
  float avg_rating = 10;
  int64 launch_epoch_seconds = 11;

  // Shard routing key - derived from product_id hash
  int32 shard_id = 12;
}

Real-Time Feature Store Schema (Redis Hash)

Key pattern: feat:{product_id}
TTL: 3600 seconds (refreshed on every write)

Hash fields:
  price            float32   current buy-box price in USD
  in_stock         uint8     1 = in stock, 0 = out of stock
  inventory_depth  float32   normalized [0,1], 1 = well stocked
  sales_vel_7d     float32   normalized units sold in last 7 days
  sales_vel_24h    float32   normalized units sold in last 24 hours
  price_percentile float32   percentile within category [0,1]
  updated_at       int64     unix epoch seconds of last write

User Preference Store Schema (PostgreSQL)

-- User category and brand affinity, updated daily
CREATE TABLE user_category_affinity (
    user_id          UUID            NOT NULL,
    category_id      VARCHAR(64)     NOT NULL,
    affinity_score   FLOAT           NOT NULL,
    purchase_count   INT             NOT NULL DEFAULT 0,
    last_purchase_ts TIMESTAMPTZ,
    updated_at       TIMESTAMPTZ     NOT NULL DEFAULT NOW(),
    PRIMARY KEY (user_id, category_id)
) PARTITION BY HASH (user_id);

CREATE TABLE user_brand_affinity (
    user_id          UUID            NOT NULL,
    brand_id         VARCHAR(64)     NOT NULL,
    affinity_score   FLOAT           NOT NULL,
    purchase_count   INT             NOT NULL DEFAULT 0,
    last_purchase_ts TIMESTAMPTZ,
    updated_at       TIMESTAMPTZ     NOT NULL DEFAULT NOW(),
    PRIMARY KEY (user_id, brand_id)
) PARTITION BY HASH (user_id);

-- Partition into 64 hash buckets to distribute 300 million users
-- Each partition averages ~4.7 million users

The product index is partitioned by a hash of the product_id, distributing 500 million products across N shards. The user affinity tables are partitioned by user_id hash, distributing 300 million users across 64 partitions. The feature store in Redis is partitioned by Redis Cluster slot assignment based on product_id CRC16 hash.

Key Algorithms and Protocols

BM25 with Field Weighting

Amazon’s product titles, brands, and bullet points are not equal. A query term appearing in a product title is far more relevant than the same term buried in a long description. Field-weighted BM25 applies different k1 and b parameters per field and sums weighted BM25 scores:

# Field-weighted BM25 for multi-field product documents
import math
from typing import Dict

FIELD_WEIGHTS = {
    "title": 3.0,
    "brand": 2.0,
    "bullet_points": 1.5,
    "description": 1.0,
}

FIELD_PARAMS = {
    "title":         {"k1": 1.2, "b": 0.5},
    "brand":         {"k1": 1.0, "b": 0.0},  # brand names shouldn't be length-penalized
    "bullet_points": {"k1": 1.5, "b": 0.75},
    "description":   {"k1": 1.2, "b": 0.9},
}

def field_weighted_bm25(
    query_terms: list,
    doc_fields: Dict[str, Dict[str, int]],    # {field: {term: count}}
    field_lengths: Dict[str, int],
    avg_field_lengths: Dict[str, float],
    term_doc_counts: Dict[str, int],
    total_docs: int,
) -> float:
    total_score = 0.0
    for field, weight in FIELD_WEIGHTS.items():
        params = FIELD_PARAMS[field]
        k1 = params["k1"]
        b = params["b"]
        field_tfs = doc_fields.get(field, {})
        field_len = field_lengths.get(field, 0)
        avg_len = avg_field_lengths.get(field, 10.0)
        length_norm = 1 - b + b * (field_len / avg_len)

        field_score = 0.0
        for term in set(query_terms):
            tf = field_tfs.get(term, 0)
            if tf == 0:
                continue
            df = term_doc_counts.get(term, 1)
            idf = math.log((total_docs - df + 0.5) / (df + 0.5) + 1)
            tf_norm = (tf * (k1 + 1)) / (tf + k1 * length_norm)
            field_score += idf * tf_norm

        total_score += weight * field_score

    return total_score

Key Insight

Setting b=0.0 for the brand field is deliberate. Brand names are single tokens regardless of whether the brand name is “Nike” or “Beats by Dr. Dre” - length normalization would penalize long brand names for no reason. Field-specific BM25 parameters are more impactful than they appear; tuning b from 0.75 to 0.5 on the title field alone can measurably shift recall quality for multi-word queries.

Consistent Hashing for Index Shard Assignment

Index shards use consistent hashing to assign products to shards. This ensures that when a shard is added or removed, only a minimal fraction of products need to be reassigned:

# Consistent hashing ring for index shard assignment
import hashlib
import bisect
from typing import List

class ConsistentHashRing:
    def __init__(self, shards: List[str], virtual_nodes: int = 150):
        self.ring: dict = {}
        self.sorted_keys: List[int] = []
        self.virtual_nodes = virtual_nodes

        for shard in shards:
            self.add_shard(shard)

    def add_shard(self, shard: str) -> None:
        for i in range(self.virtual_nodes):
            key = self._hash(f"{shard}:vn{i}")
            self.ring[key] = shard
            bisect.insort(self.sorted_keys, key)

    def remove_shard(self, shard: str) -> None:
        for i in range(self.virtual_nodes):
            key = self._hash(f"{shard}:vn{i}")
            if key in self.ring:
                del self.ring[key]
                self.sorted_keys.remove(key)

    def get_shard(self, product_id: str) -> str:
        if not self.ring:
            raise ValueError("No shards available")
        h = self._hash(product_id)
        idx = bisect.bisect_right(self.sorted_keys, h)
        if idx == len(self.sorted_keys):
            idx = 0
        return self.ring[self.sorted_keys[idx]]

    def _hash(self, key: str) -> int:
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

Score Fusion: Combining BM25 and L2R

After the L2R model scores all 1,000 candidates, the scores must be fused with the BM25 recall score. Raw scores from two different models are not directly comparable. Reciprocal Rank Fusion (RRF) is a robust score-independent method:

# Reciprocal Rank Fusion for combining BM25 and L2R rankings
from typing import List, Tuple

def reciprocal_rank_fusion(
    bm25_ranked: List[str],        # product_ids, best first
    l2r_ranked: List[str],         # product_ids, best first
    k: int = 60,                   # RRF smoothing constant
    bm25_weight: float = 0.3,
    l2r_weight: float = 0.7,
) -> List[str]:
    """
    Fuse two ranked lists using weighted Reciprocal Rank Fusion.
    Returns product_ids sorted by fused score, best first.
    """
    scores: dict = {}

    for rank, product_id in enumerate(bm25_ranked, start=1):
        scores[product_id] = scores.get(product_id, 0.0)
        scores[product_id] += bm25_weight * (1.0 / (k + rank))

    for rank, product_id in enumerate(l2r_ranked, start=1):
        scores[product_id] = scores.get(product_id, 0.0)
        scores[product_id] += l2r_weight * (1.0 / (k + rank))

    return sorted(scores.keys(), key=lambda pid: scores[pid], reverse=True)

Key Insight

RRF is preferred over score normalization for fusion because it is robust to score distribution differences between models. BM25 scores are unbounded and query-length-dependent. L2R scores are typically probabilities in [0, 1]. Normalizing and adding these directly produces garbage. RRF cares only about rank position, not score magnitude, making it model-agnostic and numerically stable.

Scaling and Performance

Back-of-Envelope Calculations

Catalog size:
  500M products * 1KB average index entry = 500GB raw inverted index
  With 32 shards: ~15.6GB per shard (fits in memory on 32GB instances)
  Replication factor 3: 32 shards * 3 = 96 index server instances

Query throughput:
  100K QPS, 200ms P99 budget
  Concurrent queries in flight: 100,000 * 0.2s = 20,000 simultaneous
  Per shard receives: 100K QPS fan-out / 32 shards = 3,125 QPS per shard
  Each shard query: ~10ms (BM25 over 15M products) -> shard capacity: 100 QPS/core
  Shard instances needed: 3,125 QPS / 100 QPS/core = ~32 cores per shard group
  With 8-core instances: 4 instances per shard, 128 index instances total

Feature Store:
  500M products * 5 real-time signals * 8 bytes = 20GB hot data (fits in Redis)
  Read: 100K QPS * 1,000 product IDs = 100M Redis key reads/sec via MGET
  Redis MGET 1000 keys: ~2ms latency, ~500K ops/s per node
  Redis nodes needed: 100K QPS * 1,000 keys / 500K ops/node = 200 nodes
  With Redis Cluster at 16 slots/node: 200 nodes (standard large cluster)

Ranking Service:
  100K QPS, each ranks 1,000 candidates
  LightGBM inference: ~5ms for 1,000 candidates (vectorized)
  Ranking service instances: 100K * 0.005s = 500 concurrent -> ~50 instances

Personalization:
  Feature vector per user: ~50 category affinities + ~100 brand affinities = ~1.2KB
  300M users * 1.2KB = 360GB total (Cassandra cluster)
  Hot working set (5% of users): 18GB (fits in Redis with TTL)

Index freshness SLA check:
  CDC events: assume 1M product changes/minute (price, inventory)
  Delta indexer: 1M events/min / 60s = ~17K events/sec
  At 100 bytes/event: 1.7MB/s Kafka throughput (trivial)
  Index write: delta segment merge runs every 30s, achieving 30s freshness SLA

The dominant scaling bottleneck is the Feature Store. A single MGET for 1,000 keys times 100K QPS means 100 billion Redis operations per second across the cluster. Even with Redis Cluster and pipelining, this requires a very large cluster. The practical mitigation is batching: the Ranking Service groups multiple user queries into a single MGET, amortizing the per-request overhead. With batching at the application layer, 10 queries sharing an MGET for their overlapping candidate sets can reduce Redis load by 3-5x.

Real World

Google Shopping handles a similar product ranking problem with a twist: they use a two-tower neural model for dense retrieval (instead of BM25) and a separate pointwise scoring model. The two-tower model embeds queries and products into a shared vector space, enabling semantic matching beyond exact keyword overlap. Amazon’s A9 team has published research on similar neural retrieval approaches, though the production system reportedly still uses BM25 for the recall stage due to its interpretability and debuggability advantages.

Failure Modes and Recovery

Failure	Detection	Impact	Recovery
Index shard unavailable	Health check fails, scatter-gather timeout	Queries missing that shard’s products return partial results	Serve stale replica; auto-failover to replica shard within 10s
Feature store (Redis) down	MGET timeout exceeds 5ms	All L2R features default to safe values; ranking degrades to static quality score	Circuit breaker opens; serve BM25-only results with static signals for up to 60s
Ranking model crash	Process exits, no predictions returned	Fallback to BM25-only ranking without L2R re-scoring	Watchdog restarts process; pre-loaded model in memory; recovery in <5s
Stale index (delta pipeline lag)	Freshness monitor detects index age >60s	Out-of-stock products appear in results; prices stale	Alert on-call; drain delta pipeline; is_in_stock filter applied at serving time as safety valve
Personalization service timeout	P99 latency exceeds 30ms	Personalization skipped; global L2R ranking returned	Circuit breaker opens; degrade gracefully; no user-visible error
Catalog indexing pipeline failure	Full rebuild does not complete in 8h window	New products missing from index; stale static signals	Alert; rerun from last checkpoint; new products served via delta path immediately

Watch Out

The most dangerous failure mode is a silent feature store degradation where Redis returns stale data rather than timing out. If the Inventory Service stops writing to Redis but Redis continues serving cached values, the system will happily rank out-of-stock products highly. Always maintain a staleness timestamp in each Redis key and treat features older than 5 minutes as missing (use defaults) rather than trusting them.

Comparison of Approaches

Approach	Latency	Ranking Quality	Complexity	Best Fit
BM25 only (no L2R)	Very low - <20ms	Moderate - good keyword match, poor conversion prediction	Low	MVP, small catalog, no click data
BM25 + L2R (LambdaMART)	Low - 50-100ms	High - directly optimizes conversion, handles feature interactions	Medium	Production at scale with click logs
Neural/Dense retrieval (two-tower)	Medium - 100-200ms (ANN search)	Very high - semantic matching beyond keywords	High	When keyword recall misses semantically relevant products
Hybrid (BM25 recall + neural rerank)	Medium - 80-150ms	Highest - best of both worlds	Very high	Mature system with ML infrastructure

The BM25 + L2R approach is the right choice for this system. It delivers strong ranking quality, is interpretable (feature importance is visible), and is debuggable when rankings seem wrong. Dense retrieval is compelling for handling synonyms and semantic gaps in recall, but it adds complexity and latency that is only justified once BM25 + L2R has been thoroughly optimized.

Key Takeaways

Inverted index for recall, L2R for ranking: Never try to rank all matching products. Use BM25 to get 1,000 candidates fast, then apply the expensive L2R model only on those candidates.
Real-time feature injection without reindexing: Price, inventory, and sales velocity live in a separate feature store (Redis), not in the inverted index. This decouples the high-change-rate signals from the low-change-rate index structure.
Index freshness via two paths: A nightly full rebuild handles structural product changes; a CDC-driven delta path handles high-velocity changes. Never try to keep a single index perfectly fresh - separate batch and streaming concerns.
BM25 baseline is deceptively strong: A well-tuned BM25 with field weighting (higher weight for title than description) beats many naive neural approaches. Always establish a BM25 baseline before adding L2R complexity.
Training-serving skew is the L2R failure mode: Features must be computed identically in training and serving. Divergence between batch training features and real-time serving features silently degrades model quality without any error signals.
Personalization is a multiplier, not a replacement: Blend L2R scores with personalization boosts using a tunable alpha parameter. Pure personalization ignores global relevance; pure L2R ignores individual preferences. The blend is the product.
Scatter-gather adds a fixed latency floor: Each additional index shard adds coordination overhead. At 32 shards, the scatter-gather merge takes ~5ms. Design the shard count based on data size, not query latency - the merge overhead is sub-linear.
Circuit breakers preserve degraded availability: Every downstream call (feature store, personalization, model server) must have a circuit breaker. When personalization is down, serve L2R results. When L2R is down, serve BM25 results. The index itself is the last line of defense.

Frequently Asked Questions

Q: Why not use a single Elasticsearch cluster for the whole thing? A: Elasticsearch works well for catalogs up to ~50 million products. At 500 million products, two problems emerge. First, JVM GC pauses in Elasticsearch create P99 latency spikes that violate the 200ms SLA. Second, Elasticsearch’s built-in ranking (BM25) cannot incorporate real-time signals without expensive scripted queries that bypass caching entirely. Custom index servers with C++ or Go eliminate the GC problem, and the separate Feature Store solves the real-time signal problem.

Q: Why not build the L2R model into the index at index time? A: Because L2R scores depend on the query. A product’s relevance to “wireless headphones” is completely different from its relevance to “headphones for running.” You cannot precompute a product’s rank independent of the query it will be matched against. Only static signals like review count and quality score can be embedded into the index at build time.

Q: Why is the personalization layer applied after L2R instead of before? A: Because personalization boosts based on user history can contradict relevance. If you boost before L2R, a product the user bought before (even if irrelevant to this query) might make the top-1,000 candidate set, crowding out genuinely relevant products. Applying personalization after L2R as a re-ranking multiplier means the candidate set is always query-relevant, and personalization only adjusts the ordering within relevance.

Q: How do you handle new products with no review data or sales history? A: New products get a “cold start” boost in the static quality score - a configurable multiplier applied to newly listed products for the first 30 days. This ensures new products appear in recall results and get some user exposure. Once a product accumulates clicks and purchases, the L2R model’s learned signals take over and the cold-start boost is phased out.

Q: Why not use vector similarity search (like FAISS) for recall instead of BM25? A: Vector search enables semantic matching - finding “noise-cancelling audio device” when the query is “headphones for focus.” But it requires embedding both the query and all 500 million products into the same vector space, which is computationally expensive and slow to update. BM25 is simpler, faster, and more interpretable. The right architecture is hybrid: BM25 for high-precision keyword recall plus vector search for semantic fallback when BM25 returns fewer than 100 candidates.

Q: How do you handle catalog size growth - what happens at 1 billion products? A: The architecture scales horizontally. Doubling the catalog size means doubling the index shard count from 32 to 64. The scatter-gather merge overhead increases modestly (from 5ms to 8ms). The Feature Store doubles in size from 20GB to 40GB - still fits in a Redis cluster. The main constraint is the nightly full rebuild, which would take 10-12 hours instead of 5-6; at 1 billion products, you likely need to move to a continuous rebuild approach where segments are rebuilt on a rolling 12-hour cycle rather than a single nightly job.

Interview Questions

Q: Walk me through what happens when a user searches for “wireless headphones” on Amazon. Trace the request end-to-end.

Expected depth: Tokenization to [“wireless”, “headphones”], fan-out to all 32 index shards simultaneously, each shard returns local top-1,000 using BM25, orchestrator merges to global top-1,000 using field-weighted BM25 scores, Ranking Service batches a single MGET to Redis Feature Store for all 1,000 product IDs (price, inventory, sales velocity), constructs 15-feature vectors, runs LightGBM inference to produce conversion score for each candidate, takes top-20, calls Personalization Service with user_id and top-20 product IDs, applies boost multipliers, returns final ranked list. Total latency budget: 15ms index shards + 2ms MGET + 5ms LightGBM + 10ms personalization + overhead = ~50ms P50, ~150ms P99.

Q: The inventory signal shows “in stock” in search results but the product page shows “out of stock.” What went wrong and how do you fix it?

Expected depth: Identify the root cause: staleness in the Feature Store. The Inventory Service wrote “out of stock” to the product database, but the write to Redis either failed, was delayed, or Redis cached an old value. Fix requires: (1) the Inventory Service to write directly to Redis on every warehouse event, not asynchronously; (2) a staleness timestamp in every Redis key; (3) the Ranking Service to treat features older than 5 minutes as missing and apply safe defaults (is_in_stock=0 means do not demote, just show as “check availability”); (4) a reconciliation job that scans Redis for stale keys every 10 minutes.

Q: How would you add support for multi-modal search - searching by image instead of text?

Expected depth: Add a two-tower embedding model: image encoder + product image encoder. At index time, embed every product’s primary image into a 256-dimensional vector and store in a vector index (FAISS or HNSW). At query time, if the query is an image, embed it and do ANN search in the vector index to get top-1,000 visually similar products. Pass these candidates to the same Ranking Service (L2R model still works - BM25 score is 0 for image queries, but all other features remain valid). The personalization layer is unchanged. The hardest part is training the product image encoder at 500 million product scale.

Q: Your L2R model is 6 months old. Sales patterns have shifted significantly. How do you detect model staleness and what is the retraining pipeline?

Expected depth: Detect staleness via shadow deployment - run the new candidate model alongside the live model on 1% of traffic and compare NDCG and conversion rate offline. Also monitor feature drift: if the distribution of the sales_velocity_7d feature shifts significantly (KL divergence or PSI test), the model’s learned weights may no longer apply. Retraining pipeline: daily Spark job exports (query, product, position, clicked, purchased) from the last 90 days from the search event log; LambdaMART training on a GPU cluster takes 2-4 hours; output is a serialized booster binary; shadow-tested for 24 hours; promoted to production via blue-green deploy with instant rollback on NDCG regression.

Q: How would you design the system to support faceted search filtering - e.g., “filter to only show products under $50 in stock”?

Expected depth: Filters are applied at the recall phase, not the ranking phase. The inverted index stores the is_in_stock flag and price tier as separate fields. A faceted query adds posting list intersections: candidates must appear in both the query token posting lists AND the in_stock=true posting list AND the price_bucket=under_50 posting list. The BM25 score is computed only on the intersected candidate set. This keeps the ranking pipeline identical - filters reduce the candidate set size, which actually makes L2R faster. The Feature Store price signal is still checked at ranking time as a freshness safety valve, since the index price tier may be slightly stale.

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access

Unlock Full Article