Build Google Photos Duplicate Detection Pipeline

data-engineering scalability performance

System Design Deep Dive

Google Photos Duplicate Detection

Detecting near-duplicate images at 28 billion uploads per day without storing a full copy of every photo

⏱ 14 min read📐 Advanced🏗️ Ml-Pipeline

Imagine a library where every new book donation is compared against every existing book in the collection before being shelved - but the collection has 4 trillion books, the donations arrive at a rate of 28 billion per day, and “same book” includes not just identical copies but also slightly worn copies, cropped copies, and photocopies of select pages. That is roughly the challenge Google Photos faces with duplicate detection. The goal is not to refuse storage - users expect every photo they upload to be accessible - but to avoid storing redundant bytes for content that is functionally identical, and to present users with deduplicated views of their library.

The first wrong instinct is byte-level hashing (MD5, SHA-256). Two JPEGs of the same physical photograph taken five minutes apart, or the same photo saved at two different quality settings, will produce completely different hashes despite representing the same image. Screenshots of the same meme shared across messaging apps are compressed differently by each platform. Photos burst-shot on a phone look nearly identical but differ in pixel values due to motion blur and focus variance. Byte hashing catches only literal duplicates - the same file uploaded twice. It misses the 95% of “duplicates” that matter.

Perceptual hashing solves this by transforming an image into a fingerprint that is robust to minor changes. Two perceptually similar images (same subject, slightly different crop, brightness, or compression) will produce the same or similar fingerprints. But perceptual hashing alone at Google Photos scale - 4 trillion stored photos, 28 billion new uploads per day - creates a matching problem. You can’t compare each incoming photo’s hash against 4 trillion stored hashes in constant time. The hash comparison space is astronomically large.

This is where Locality-Sensitive Hashing (LSH) enters. LSH is a probabilistic technique that organizes similar fingerprints into the same “bucket” with high probability, while dissimilar fingerprints land in different buckets. Instead of comparing one incoming hash against 4 trillion hashes, you only compare it against the hashes in its LSH bucket - typically a few thousand. This makes approximate nearest-neighbor search tractable at billion-scale without sacrificing too much recall.

The core architectural decisions we need to solve for simultaneously: how to generate perceptual fingerprints that are robust to real-world image variations; how to implement LSH bucketing that achieves acceptable precision and recall at 4 trillion scale; how to tune the similarity threshold to avoid both over-deduplication (merging non-duplicate photos) and under-deduplication (storing redundant copies); and how to reclaim storage for detected duplicates without making photos inaccessible to their owners.

Requirements and Constraints

Functional Requirements

Detect near-duplicate photos at upload time and link them to an existing canonical copy
Support perceptual similarity (not just byte-level) - same image at different resolutions, crops, quality settings
Present users with a deduplicated view of their library (show one copy, not N copies)
Allow users to access their original upload even when deduplication has occurred
Reclaim storage for true duplicates: multiple users uploading the same stock photo or viral meme
Detect duplicates within a single user’s library (burst photos, accidental re-uploads)

Non-Functional Requirements

28 billion uploads per day (peak: 5x average = ~1.6 million uploads per second)
4 trillion total stored images as the index to search against
Duplicate detection latency: under 2 seconds per upload (asynchronous, not blocking storage)
False positive rate (incorrectly merging non-duplicate photos): under 0.001%
False negative rate (missing actual duplicates): under 5%
Index storage: must not exceed 2x the storage of the photos themselves
Similarity search p99 latency: under 100ms
System availability: 99.99%

Constraints and Assumptions

We focus on photographic images (JPEG, PNG, HEIC) - not videos, GIFs, or documents
“Duplicate” is defined as visual similarity above a tunable threshold, not semantic similarity
Cross-user deduplication (two users uploading the same photo) is handled at the storage layer, not the index layer
We assume photos are stored in object storage (GCS equivalent) after upload
Hash computation happens on a decoded image thumbnail, not the raw file
We do not handle adversarial attacks (images designed to collide in the hash space)

High-Level Architecture

The pipeline has five major stages that form an assembly line: Image Ingestion, Feature Extraction, LSH Indexing, Similarity Search, and Storage Reclamation.

Google Photos duplicate detection pipeline architecture showing all five stages and data flows

Image Ingestion receives the raw upload, stores it durably in object storage, and places a processing job on a distributed queue. This stage is synchronous to the user - they see the photo in their library immediately. Everything downstream is asynchronous.

Feature Extraction consumes from the queue, decodes the image, resizes it to a standard thumbnail (typically 32x32 for pHash), and computes the perceptual hash. For higher-accuracy matching, it also runs a CNN-based feature extractor that produces a 512-dimensional embedding vector. The hash is a 64-bit integer; the embedding is 512 floats.

LSH Indexing takes the 64-bit pHash and runs it through L LSH hash functions (typically L=8), generating L “band hashes” that determine which LSH buckets this image belongs to. The image is inserted into all L buckets, and its 512-d embedding is written to the vector feature store.

Similarity Search queries the LSH bucket(s) that the incoming image maps to, retrieves candidate matches, and re-ranks them by exact Hamming distance (for pHash) or cosine similarity (for embeddings). If any candidate exceeds the similarity threshold, the upload is flagged as a near-duplicate.

Storage Reclamation is an offline background job that finds confirmed duplicate groups, designates one as the canonical copy, updates metadata pointers, and physically deletes redundant object storage copies. Users still see their photo; they just retrieve it from the canonical copy.

Key Insight

The LSH step is a recall amplification mechanism, not a precision mechanism. It efficiently narrows the candidate set from 4 trillion photos to a few thousand - but it will miss some true duplicates (false negatives). The precision and recall knobs are the number of LSH bands (more bands = higher recall, more candidates to check) and the similarity threshold (lower = more aggressive deduplication, higher false positive risk).

Perceptual Hashing

Perceptual hashing is the process of converting an image to a compact fingerprint that changes slowly as the image changes gradually - much like how a piece of music’s melody is recognizable even when played slightly faster or in a different key.

The most widely used algorithm is pHash (Discrete Cosine Transform hash). It works by:

Resize the image to 32x32 pixels (discards fine detail, retains structure)
Convert to grayscale (color variations don’t affect structural similarity)
Compute the 2D Discrete Cosine Transform (DCT) of the 32x32 grid
Take the top-left 8x8 DCT coefficients (the low-frequency components - the “melody” of the image)
Compute the median of these 64 values
For each of the 64 values: set bit to 1 if above median, 0 if below

The result is a 64-bit integer. Two images with similar structure produce hashes that differ in only a few bits. The Hamming distance between two hashes (number of differing bits) is the similarity metric: distance 0 means identical structure, distance 1-5 means very similar, distance > 10 typically means different images.

# Perceptual hash (pHash) implementation using DCT
import numpy as np
from PIL import Image
from scipy.fftpack import dct

def compute_phash(image_bytes: bytes, hash_size: int = 8) -> int:
    """
    Compute a 64-bit perceptual hash of an image.
    hash_size=8 produces an 8x8=64 bit hash.
    Returns the hash as a Python integer.
    """
    # Decode and preprocess
    from io import BytesIO
    img = Image.open(BytesIO(image_bytes))

    # Convert to grayscale and resize to hash_size * 4 = 32x32
    # The 4x scale gives the DCT more data to work with
    high_freq_factor = 4
    img_size = hash_size * high_freq_factor  # 32x32
    img = img.convert('L').resize((img_size, img_size), Image.LANCZOS)
    pixels = np.array(img, dtype=float)

    # 2D DCT - compute frequency components
    dct_coeffs = dct(dct(pixels, axis=0, norm='ortho'), axis=1, norm='ortho')

    # Take the top-left hash_size x hash_size low-frequency coefficients
    low_freq = dct_coeffs[:hash_size, :hash_size]

    # Compute median and binarize
    median = np.median(low_freq)
    bits = (low_freq > median).flatten()

    # Pack 64 bits into a single integer
    hash_val = 0
    for bit in bits:
        hash_val = (hash_val << 1) | int(bit)

    return hash_val

def hamming_distance(hash_a: int, hash_b: int, hash_bits: int = 64) -> int:
    """Hamming distance between two perceptual hashes (number of differing bits)."""
    xor = hash_a ^ hash_b
    # Count set bits (popcount) - Python has bin() for this
    return bin(xor).count('1')

def are_perceptual_duplicates(hash_a: int, hash_b: int, threshold: int = 8) -> bool:
    """
    Two images are near-duplicates if their Hamming distance <= threshold.
    threshold=8 means they can differ in up to 8/64 = 12.5% of bits.
    Typical tuning: 4-6 for strict matching, 8-10 for loose matching.
    """
    return hamming_distance(hash_a, hash_b) <= threshold

Watch Out

pHash has known failure modes: two images with the same dominant frequency structure but different content (e.g., two solid-color images, or two images with the same composition but different subjects) can produce similar hashes. This is why the pipeline uses a two-pass approach: pHash for fast candidate filtering, then CNN embedding cosine similarity for precise confirmation. Never use pHash alone as the final deduplication decision.

Locality-Sensitive Hashing

LSH solves the scalability problem. With 4 trillion stored hashes, brute-force Hamming distance comparison at 1.6 million uploads per second is computationally impossible - that’s 6.4 x 10^18 comparisons per second. LSH trades some recall for tractable query time.

The intuition behind LSH for Hamming distance: if two hashes differ in only 4 bits out of 64, and we randomly select a subset of 16 bits from each hash, there’s a high probability that both hashes have identical values in those 16 bits. Two dissimilar hashes (differing in 30 bits) are much less likely to agree on any random 16-bit subset.

Band hashing is the practical implementation. We partition the 64-bit pHash into L bands of B bits each (e.g., 8 bands of 8 bits). For each band, we compute a hash of that band’s value along with a band ID to produce a bucket ID. An image is placed into L buckets. At query time, the incoming image is hashed into the same L buckets, and candidates are retrieved from the union of those buckets.

# Locality-Sensitive Hashing for 64-bit pHash
import hashlib
from typing import List, Tuple, Set

class PHashLSH:
    """
    LSH index for 64-bit perceptual hashes.
    Uses band hashing: partitions the 64-bit hash into L bands of B bits.
    Higher L = better recall but more candidate checks.
    Smaller B = more candidates per bucket.
    """

    def __init__(self, num_bands: int = 8, bits_per_band: int = 8):
        assert num_bands * bits_per_band == 64, "bands * bits_per_band must equal 64"
        self.num_bands = num_bands
        self.bits_per_band = bits_per_band
        self.band_mask = (1 << bits_per_band) - 1
        # In production: Redis hash maps or a distributed KV store
        # Here simulated as a Python dict: bucket_id -> list of (image_id, full_hash)
        self._buckets: dict[int, list[tuple[str, int]]] = {}

    def _band_bucket_id(self, phash: int, band_idx: int) -> int:
        """Extract the band_idx-th B-bit segment of the hash and compute bucket ID."""
        shift = band_idx * self.bits_per_band
        band_value = (phash >> shift) & self.band_mask
        # Include band_idx in the hash key to prevent cross-band collisions
        combined = (band_idx << 32) | band_value
        return combined

    def insert(self, image_id: str, phash: int) -> List[int]:
        """
        Insert an image into all L LSH buckets.
        Returns the list of bucket IDs for debugging.
        """
        bucket_ids = []
        for band_idx in range(self.num_bands):
            bkt = self._band_bucket_id(phash, band_idx)
            if bkt not in self._buckets:
                self._buckets[bkt] = []
            self._buckets[bkt].append((image_id, phash))
            bucket_ids.append(bkt)
        return bucket_ids

    def query(self, phash: int, hamming_threshold: int = 8) -> List[Tuple[str, int]]:
        """
        Find all indexed images within hamming_threshold of the query hash.
        Returns list of (image_id, hamming_distance) sorted by distance.
        """
        # Gather all candidates from all bands (union)
        seen_ids: Set[str] = set()
        candidates: List[Tuple[str, int]] = []

        for band_idx in range(self.num_bands):
            bkt = self._band_bucket_id(phash, band_idx)
            for img_id, stored_hash in self._buckets.get(bkt, []):
                if img_id in seen_ids:
                    continue
                seen_ids.add(img_id)
                dist = hamming_distance(phash, stored_hash)
                if dist <= hamming_threshold:
                    candidates.append((img_id, dist))

        # Sort by Hamming distance (most similar first)
        candidates.sort(key=lambda x: x[1])
        return candidates

def lsh_theoretical_recall(
    num_bands: int, bits_per_band: int, hamming_distance: int
) -> float:
    """
    Probability that two hashes at a given Hamming distance
    will share at least one LSH band bucket (i.e., will be found as candidates).
    """
    # Probability that a single bit agrees: (64 - d) / 64
    # Probability that all bits in a band agree: ((64-d)/64)^B
    # Probability that at least one band agrees: 1 - (1 - ((64-d)/64)^B)^L
    p_band_agree = ((64 - hamming_distance) / 64) ** bits_per_band
    recall = 1.0 - (1.0 - p_band_agree) ** num_bands
    return recall

For 8 bands of 8 bits, the theoretical recall is:

At Hamming distance 4: recall ~ 90%
At Hamming distance 8: recall ~ 58%
At Hamming distance 12: recall ~ 27%

This means at our threshold of 8, we miss about 42% of true duplicates in the first LSH pass. To compensate, we run multiple independent LSH tables with different band configurations and take the union of candidates. With 4 independent LSH tables, recall at distance 8 rises to ~91%.

Key Insight

LSH recall is not the final recall of the deduplication system. The CNN embedding-based ANN search runs in parallel on the same upload and catches duplicates that pHash LSH misses (especially images that have been color-graded, filtered, or had text overlaid - which change pHash significantly but are visually the same photo).

CNN Feature Vectors and ANN Search

For images where pHash is unreliable (heavily filtered photos, screenshots of photos, photos with text overlaid), the system uses a CNN-based embedding. A pre-trained image encoder (similar to Google’s Vision Transformer or a fine-tuned ResNet) maps each image to a 512-dimensional vector where visually similar images cluster near each other in the embedding space.

The ANN (Approximate Nearest Neighbor) index for 512-d vectors at 4 trillion scale is a different beast from the LSH bucket approach. The system uses a hierarchical HNSW (Hierarchical Navigable Small World) index, but even HNSW at 4 trillion scale requires sharding across hundreds of machines with an index-of-indexes routing layer.

# ANN search using HNSW index (faiss-based)
import numpy as np
import faiss

class EmbeddingANNIndex:
    """
    Approximate Nearest Neighbor index for 512-d CNN embeddings.
    Uses HNSW index for low-latency queries.
    In production this is sharded: each shard holds ~100M vectors.
    """

    def __init__(self, dim: int = 512, m: int = 32, ef_construction: int = 200):
        """
        m: number of connections per node in HNSW (higher = better recall, more memory)
        ef_construction: search width during build (higher = better index quality)
        """
        self.dim = dim
        # HNSW index from faiss - ~200 bytes per vector + HNSW overhead
        self.index = faiss.IndexHNSWFlat(dim, m)
        self.index.hnsw.efConstruction = ef_construction
        self.index.hnsw.efSearch = 64  # search width at query time
        # Map from faiss internal ID to image_id
        self.id_map: dict[int, str] = {}
        self._next_id = 0

    def add(self, image_id: str, embedding: np.ndarray) -> None:
        """Add a single 512-d embedding to the index."""
        assert embedding.shape == (self.dim,), f"Expected ({self.dim},), got {embedding.shape}"
        vec = embedding.astype(np.float32).reshape(1, -1)
        # Normalize to unit sphere for cosine similarity via inner product
        faiss.normalize_L2(vec)
        self.index.add(vec)
        self.id_map[self._next_id] = image_id
        self._next_id += 1

    def search(self, query_embedding: np.ndarray, k: int = 20) -> list[tuple[str, float]]:
        """
        Returns up to k nearest neighbors as (image_id, cosine_similarity).
        cosine_similarity of 1.0 = identical, 0.0 = orthogonal.
        """
        vec = query_embedding.astype(np.float32).reshape(1, -1)
        faiss.normalize_L2(vec)
        distances, indices = self.index.search(vec, k)
        results = []
        for dist, idx in zip(distances[0], indices[0]):
            if idx == -1:
                continue  # fewer than k results
            image_id = self.id_map.get(idx)
            if image_id:
                results.append((image_id, float(dist)))  # dist is cosine similarity (after L2 norm)
        return results

    def is_duplicate(self, embedding: np.ndarray, threshold: float = 0.95) -> tuple[bool, str | None]:
        """
        Returns (is_duplicate, canonical_image_id) if similarity >= threshold.
        threshold=0.95 corresponds to very high visual similarity.
        """
        results = self.search(embedding, k=1)
        if results and results[0][1] >= threshold:
            return True, results[0][0]
        return False, None

Real World

Pinterest built a near-identical system called “Visual Discovery” for deduplication of their 300 billion pins. They use pHash for fast initial filtering and a fine-tuned ResNet for high-precision re-ranking. The key lesson from their engineering blog: the similarity threshold is not a constant but a function of the image’s content category. Stock photos require a strict threshold (0.98+) because similar stock photos are intentionally distinct products. User vacation photos can use a looser threshold (0.90) because burst shots of the same scene are true duplicates to deduplicate.

Dedup Pipeline Stages

The full deduplication pipeline is a multi-stage assembly line. Each stage is independently scalable and fails independently without blocking the upload.

LSH internals showing band hashing, bucket assignment, and candidate retrieval for perceptual hash lookup

# Multi-stage deduplication pipeline orchestrator
from dataclasses import dataclass
from enum import Enum
from typing import Optional
import time

class DedupStatus(Enum):
    UNIQUE = "unique"
    NEAR_DUPLICATE = "near_duplicate"
    EXACT_DUPLICATE = "exact_duplicate"
    PENDING = "pending"

@dataclass
class DedupResult:
    image_id: str
    status: DedupStatus
    canonical_id: Optional[str]       # ID of the canonical copy if duplicate
    phash: int
    embedding: list[float]
    hamming_distance: Optional[int]    # if pHash match found
    cosine_similarity: Optional[float] # if embedding match found
    processing_time_ms: float

def run_dedup_pipeline(
    image_id: str,
    image_bytes: bytes,
    phash_lsh_index,       # PHashLSH instance (distributed in production)
    embedding_ann_index,   # EmbeddingANNIndex instance (sharded in production)
    cnn_model,             # Feature extractor
    phash_threshold: int = 8,
    embedding_threshold: float = 0.95,
) -> DedupResult:
    """
    Full deduplication pipeline. Runs pHash and embedding checks in parallel.
    Returns the final dedup decision.
    """
    start_ms = time.monotonic() * 1000

    # Stage 1: Compute pHash (fast, ~5ms)
    phash = compute_phash(image_bytes)

    # Stage 2: Compute CNN embedding (slower, ~50ms on GPU)
    # In production: batched across multiple uploads for throughput
    embedding = cnn_model.encode(image_bytes)  # returns np.ndarray of shape (512,)

    # Stage 3: LSH candidate retrieval (~10ms)
    phash_candidates = phash_lsh_index.query(phash, hamming_threshold=phash_threshold)

    # Stage 4: ANN embedding search (~20ms)
    embedding_candidates = embedding_ann_index.search(embedding, k=10)

    # Stage 5: Candidate re-ranking and decision
    # Check exact SHA-256 match first (instant dedup at zero cost)
    import hashlib
    sha256 = hashlib.sha256(image_bytes).hexdigest()

    result_status = DedupStatus.UNIQUE
    canonical_id = None
    hamming_dist = None
    cosine_sim = None

    # Priority: exact byte match > pHash near-match > embedding near-match
    # (exact SHA-256 match would be caught upstream at the storage layer)

    if phash_candidates:
        best_phash_match = phash_candidates[0]  # (image_id, distance)
        hamming_dist = best_phash_match[1]
        if hamming_dist <= phash_threshold:
            result_status = DedupStatus.NEAR_DUPLICATE
            canonical_id = best_phash_match[0]

    if embedding_candidates:
        best_emb_match = embedding_candidates[0]  # (image_id, similarity)
        cosine_sim = best_emb_match[1]
        if cosine_sim >= embedding_threshold:
            # Embedding match overrides pHash if higher confidence
            if result_status == DedupStatus.UNIQUE or cosine_sim > (1.0 - hamming_dist/64.0 if hamming_dist else 0):
                result_status = DedupStatus.NEAR_DUPLICATE
                canonical_id = best_emb_match[0]

    end_ms = time.monotonic() * 1000

    return DedupResult(
        image_id=image_id,
        status=result_status,
        canonical_id=canonical_id,
        phash=phash,
        embedding=embedding.tolist(),
        hamming_distance=hamming_dist,
        cosine_similarity=cosine_sim,
        processing_time_ms=end_ms - start_ms,
    )

Data Model

Data flow from photo upload through deduplication pipeline to storage reclamation

-- Core schema for the duplicate detection pipeline

-- Perceptual hash index: one row per uploaded image
CREATE TABLE image_phash (
    image_id          UUID         PRIMARY KEY,
    user_id           BIGINT       NOT NULL,
    phash_64          BIGINT       NOT NULL,  -- 64-bit perceptual hash
    sha256_hex        CHAR(64)     NOT NULL,  -- exact byte hash for fast exact dedup
    file_size_bytes   BIGINT       NOT NULL,
    width_px          INT          NOT NULL,
    height_px         INT          NOT NULL,
    format            VARCHAR(8)   NOT NULL,  -- jpeg, png, heic, webp
    uploaded_at       TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
    dedup_status      VARCHAR(16)  NOT NULL DEFAULT 'pending'
        CHECK (dedup_status IN ('pending', 'unique', 'near_duplicate', 'exact_duplicate')),
    canonical_image_id UUID        REFERENCES image_phash(image_id),
    hamming_distance  SMALLINT,    -- distance to canonical, if duplicate
    cosine_similarity FLOAT,       -- embedding similarity to canonical, if duplicate
    dedup_processed_at TIMESTAMPTZ
);

-- Fast exact-duplicate lookup by SHA-256
CREATE UNIQUE INDEX idx_phash_sha256 ON image_phash (sha256_hex);
-- Lookup by user for within-library dedup
CREATE INDEX idx_phash_user ON image_phash (user_id, uploaded_at DESC);
-- Lookup all non-canonical images pointing to a canonical (for storage reclamation)
CREATE INDEX idx_phash_canonical ON image_phash (canonical_image_id)
    WHERE canonical_image_id IS NOT NULL;

-- LSH bucket table: maps band bucket IDs to image IDs
-- This is the in-memory structure externalized for persistence
-- In production: Redis hash maps keyed by bucket_id
CREATE TABLE lsh_bucket (
    bucket_id         BIGINT       NOT NULL,  -- (band_idx << 32) | band_value
    image_id          UUID         NOT NULL REFERENCES image_phash(image_id),
    phash_64          BIGINT       NOT NULL,  -- denormalized for fast Hamming check
    inserted_at       TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
    PRIMARY KEY (bucket_id, image_id)
);
CREATE INDEX idx_lsh_bucket_lookup ON lsh_bucket (bucket_id, phash_64);

-- Dedup group: tracks a set of images that are near-duplicates of each other
CREATE TABLE dedup_group (
    group_id          UUID         PRIMARY KEY DEFAULT gen_random_uuid(),
    canonical_image_id UUID        NOT NULL REFERENCES image_phash(image_id),
    member_count      INT          NOT NULL DEFAULT 1,
    total_bytes_saved BIGINT       NOT NULL DEFAULT 0,  -- bytes reclaimed
    created_at        TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
    last_reclaimed_at TIMESTAMPTZ
);

-- Storage reclamation log
CREATE TABLE storage_reclamation_event (
    reclaim_id        BIGINT       GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    group_id          UUID         NOT NULL REFERENCES dedup_group(group_id),
    reclaimed_image_id UUID        NOT NULL,
    bytes_freed       BIGINT       NOT NULL,
    storage_path      TEXT         NOT NULL,  -- GCS path of deleted object
    reclaimed_at      TIMESTAMPTZ  NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_reclaim_group ON storage_reclamation_event (group_id, reclaimed_at);

-- Feature vector storage: separate table (in practice, a vector DB)
-- Stores 512-d CNN embeddings for ANN search
CREATE TABLE image_embedding (
    image_id          UUID         PRIMARY KEY REFERENCES image_phash(image_id),
    embedding_model   VARCHAR(32)  NOT NULL DEFAULT 'vit-base-patch16',
    embedding_bytes   BYTEA        NOT NULL,  -- 512 x float32 = 2048 bytes
    computed_at       TIMESTAMPTZ  NOT NULL DEFAULT NOW()
);

The sharding key for image_phash is user_id for within-library dedup queries. For cross-library dedup (same photo uploaded by multiple users - the viral meme problem), queries are against the global pHash + LSH index, which is partitioned by phash_64 % N_shards to co-locate similar hashes on the same shard.

Key Algorithms and Protocols

Similarity Threshold Tuning

The similarity threshold is the most consequential operational decision in the dedup pipeline. Setting it too low causes false positives (merging non-duplicate photos, destroying user trust). Setting it too high causes false negatives (storing redundant copies, wasting money). The right threshold varies by use case.

# Threshold tuning analysis using labeled test set
from dataclasses import dataclass
from typing import List, Tuple
import numpy as np

@dataclass
class LabeledPair:
    hash_a: int
    hash_b: int
    is_duplicate: bool  # ground truth label

def precision_recall_curve(
    labeled_pairs: List[LabeledPair],
    thresholds: List[int] = None,
) -> List[Tuple[int, float, float, float]]:
    """
    Returns (threshold, precision, recall, f1) for each threshold.
    Used to select the optimal Hamming distance threshold.
    """
    if thresholds is None:
        thresholds = list(range(0, 32))

    results = []
    total_positives = sum(1 for p in labeled_pairs if p.is_duplicate)

    for threshold in thresholds:
        tp = 0  # correctly identified as duplicate
        fp = 0  # non-duplicate incorrectly flagged as duplicate
        fn = 0  # duplicate not caught (missed)

        for pair in labeled_pairs:
            dist = hamming_distance(pair.hash_a, pair.hash_b)
            predicted_dup = dist <= threshold
            if predicted_dup and pair.is_duplicate:
                tp += 1
            elif predicted_dup and not pair.is_duplicate:
                fp += 1
            elif not predicted_dup and pair.is_duplicate:
                fn += 1

        precision = tp / (tp + fp) if (tp + fp) > 0 else 1.0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0

        results.append((threshold, precision, recall, f1))

    return results

def select_threshold_at_precision(
    labeled_pairs: List[LabeledPair],
    min_precision: float = 0.9999,  # 0.001% false positive rate target
) -> int:
    """
    Select the highest threshold (most aggressive dedup) that still meets
    the minimum precision requirement. This biases toward recall.
    """
    curve = precision_recall_curve(labeled_pairs)
    best_threshold = 0
    for threshold, precision, recall, f1 in curve:
        if precision >= min_precision:
            best_threshold = threshold
    return best_threshold

Storage Reclamation

Once a duplicate group is identified, storage reclamation runs as an offline batch job. It must be careful: reclaiming storage before verifying the canonical copy is durably stored (with redundancy) would cause data loss.

# Storage reclamation - safely free duplicate copies
from typing import List
import time

def reclaim_duplicate_group(
    group_id: str,
    canonical_image_id: str,
    duplicate_image_ids: List[str],
    storage_client,     # GCS or equivalent object storage client
    db_client,          # database client
    min_canonical_replicas: int = 3,  # safety check before deletion
) -> int:
    """
    Free storage for duplicate copies. Returns bytes freed.
    Runs in a transaction to prevent partial state.
    """
    # Safety check: verify canonical exists and is replicated
    canonical_meta = storage_client.get_object_metadata(canonical_image_id)
    if canonical_meta is None:
        raise ValueError(f"Canonical image {canonical_image_id} not found in storage")
    if canonical_meta.replica_count < min_canonical_replicas:
        raise ValueError(f"Canonical has only {canonical_meta.replica_count} replicas, need {min_canonical_replicas}")

    bytes_freed = 0

    for dup_id in duplicate_image_ids:
        if dup_id == canonical_image_id:
            continue  # never delete the canonical

        dup_meta = storage_client.get_object_metadata(dup_id)
        if dup_meta is None:
            continue  # already reclaimed

        # Update metadata pointer before deleting: user reads now redirect to canonical
        # This must commit BEFORE the physical deletion
        db_client.execute("""
            UPDATE image_phash
            SET dedup_status = 'exact_duplicate',
                canonical_image_id = %s,
                dedup_processed_at = NOW()
            WHERE image_id = %s
        """, (canonical_image_id, dup_id))

        # Now safe to delete the physical object
        storage_client.delete_object(dup_id)

        # Log the reclamation for audit
        db_client.execute("""
            INSERT INTO storage_reclamation_event
                (group_id, reclaimed_image_id, bytes_freed, storage_path)
            VALUES (%s, %s, %s, %s)
        """, (group_id, dup_id, dup_meta.size_bytes, dup_meta.storage_path))

        bytes_freed += dup_meta.size_bytes

    # Update group totals
    db_client.execute("""
        UPDATE dedup_group
        SET total_bytes_saved = total_bytes_saved + %s,
            last_reclaimed_at = NOW()
        WHERE group_id = %s
    """, (bytes_freed, group_id))

    return bytes_freed

Key Insight

Storage reclamation must always update the metadata pointer (redirecting the duplicate image_id to the canonical) before deleting the physical object. If you delete first and then crash before updating metadata, users see a broken photo link. The reverse order - update metadata, then delete - means the worst-case scenario is a temporary extra storage cost, not data loss.

Scaling and Performance

Capacity Estimation - Google Photos Dedup Pipeline

Given:
  - 28B uploads per day = 324K uploads/sec average, 1.6M/sec peak (5x)
  - Average image size: 3MB (after compression to HEIC/JPEG)
  - 4 trillion stored images total
  - pHash: 8 bytes per image
  - CNN embedding: 512 floats x 4 bytes = 2048 bytes per image
  - LSH index: 8 buckets x 20 bytes per entry = 160 bytes per image

Index Storage:
  - pHash index: 4T * 8 bytes = 32 TB
  - Embedding index: 4T * 2048 bytes = 8 PB (must be sharded heavily)
  - LSH bucket index: 4T * 160 bytes = 640 TB (stored in Redis/KV)
  - Total index overhead: ~9 PB vs ~12 PB raw image storage (75% overhead - acceptable)

Compute for pHash at peak:
  - 1.6M uploads/sec * 5ms per pHash = 8,000 CPU-seconds/sec = 8,000 CPU cores
  - With 64-core servers: 125 pHash worker servers at peak

Compute for CNN embedding at peak:
  - 1.6M uploads/sec * 50ms per embedding (GPU) = need 80K GPU-ms/sec
  - Modern A100 GPU: ~200 images/sec in batch mode
  - At 1.6M uploads/sec: 8,000 GPUs at peak (batched)
  - With 5x peak factor: steady state ~1,600 GPUs for embeddings

LSH lookup latency:
  - 8 band lookups per image
  - Each bucket lookup: ~1ms (Redis pipeline)
  - Total LSH query: ~8ms with pipelining, ~3ms batched

ANN search latency (HNSW):
  - Each shard: 100M vectors, HNSW with M=32, ef=64: ~5ms per query
  - With routing layer overhead: ~15ms end-to-end

Horizontal scaling of dedup pipeline showing GPU embedding workers, LSH shards, and ANN index sharding

The scaling bottleneck is the CNN embedding computation, not the search. pHash computation is CPU-bound and horizontally scalable. The ANN search is memory-bound and distributable. But running 1.6 million GPU-accelerated CNN forward passes per second requires a dedicated GPU cluster and careful batching to amortize the per-image GPU launch overhead.

The mitigation is selective embedding computation. The pHash LSH check runs first and is cheap. If pHash finds no candidates within a threshold of 12 (a broader pre-filter), there is no need to compute the CNN embedding at all - the image is definitively unique. Approximately 70-80% of uploads in practice have no pHash candidates (they are truly new images), so only 20-30% of uploads actually need CNN embedding, reducing the GPU load by 3-4x.

Real World

Facebook’s “Similar Photos” feature uses an equivalent pipeline. Their engineering team published that they use a two-stage approach: a fast 128-bit pHash check to filter candidates, followed by a ResNet-50 embedding similarity check. The key finding from their paper: combining pHash (which is robust to crops and resizes) with CNN embeddings (which are robust to filters and color grading) achieves near-human-level precision on the dedup decision. No single hash type handles all the ways users create “near-duplicate” photos.

Failure Modes and Recovery

Failure	Detection	Impact	Recovery
LSH shard crash	Health check + increased error rate on bucket lookups	Uploads during crash window bypass LSH check, stored as “unique”	Shard restarts from snapshot; background re-processing job re-queues images stored as unique during outage
CNN embedding worker GPU OOM	Worker exits with OOM, job returns to queue with retry	Image stuck in “pending” dedup status; storage not reclaimed	Auto-restart worker; implement per-batch memory limit to prevent cascade OOM
Storage reclamation crash mid-group	Metadata pointer updated but physical delete incomplete	Small storage waste (orphaned objects); no data loss	Reclamation job is idempotent: re-running it skips images already marked as reclaimed
False positive dedup (wrong merge)	User reports “my photo is wrong”; error rate monitoring	User sees wrong photo when accessing their image	Rollback: restore original pointer, undelete canonical if needed (with 7-day deletion hold)
ANN index corruption	Query returns zero results for known duplicates; integration test failure	High false negative rate on embedding-based matching; dedup misses increases	Roll back to previous HNSW index snapshot; rebuild from embedding store asynchronously
Hash collision attack (malicious uploads)	Anomalous cluster of images all hashing to same pHash value	Storage reclamation erroneously groups unrelated images	Content safety pipeline pre-filters uploads; pHash collision without embedding match is not reclaimed

Watch Out

The 7-day deletion hold on reclaimed storage objects is not optional - it is the entire safety net for false positive dedup events. If you immediately hard-delete reclaimed objects, a false positive merge has permanent consequences (user permanently loses their photo). The hold gives the user support team and automated rollback systems time to detect and reverse the error before the physical bytes are gone.

Comparison of Approaches

Approach	Precision	Recall	Index Size	Query Latency	Best Fit
SHA-256 exact hash	100%	approx 5% (exact duplicates only)	32 bytes/image	under 1ms	Same file uploaded twice
pHash alone (threshold 8)	99.5%	65%	8 bytes/image + LSH	under 15ms	Resize, crop, re-export duplicates
CNN embedding ANN only	98%	88%	2KB/image	15-30ms	Filtered, color-graded duplicates
pHash + CNN embedding (two-pass)	99.9%	91%	2KB/image	20-40ms	Production: handles all duplicate types
Video-frame pHash (for burst photos)	99.2%	85%	8 bytes/frame	under 15ms	Burst photos, video thumbnails

The two-pass approach (pHash pre-filter + CNN embedding confirmation) is the right choice for production. The key non-obvious tradeoff: pHash is the speed champion and CNN is the accuracy champion. Using pHash alone at a strict threshold gives high precision but misses 35% of real duplicates (particularly filtered photos, which is a common user workflow). Using CNN alone is too expensive at 1.6M uploads/second without the pHash pre-filter.

The combined approach uses pHash as a coarse filter (threshold 12, high recall at the expense of precision) to narrow candidates, then CNN embedding for the final decision. This preserves recall while keeping GPU costs manageable. The two hash types are complementary: pHash is sensitive to geometric transforms (crop, flip) but robust to color; CNN embedding is robust to both, but expensive.

Key Takeaways

Perceptual hashing: pHash uses DCT frequency components to produce a 64-bit fingerprint that changes slowly with gradual image changes, making Hamming distance a robust similarity metric.
LSH scales the search: Brute-force comparison of 4 trillion hashes is impossible; LSH band hashing narrows candidates to thousands, making similarity search tractable.
Dual-hash strategy: pHash handles geometric near-duplicates; CNN embedding handles filtered/color-graded near-duplicates; combining both achieves 91% recall at 99.9% precision.
Similarity threshold tuning: The threshold is not a fixed constant but should be calibrated per content category using a labeled test set and precision-recall analysis.
ANN not exact search: HNSW-based approximate nearest neighbor search sacrifices a small amount of recall for a 1000x speedup over exact search - acceptable for dedup where the fallback is just slightly more storage.
Selective CNN embedding: Running pHash first and only escalating to CNN embedding for images with pHash candidates reduces GPU compute by 3-4x, making the pipeline cost-effective.
Storage reclamation safety: Always update metadata pointers before physical deletion; maintain a deletion hold window for rollback; storage reclamation jobs must be idempotent.
Feature vector storage: The 2KB CNN embedding per image at 4 trillion scale (8 PB) is the dominant cost of the dedup index - making embedding model selection a cost-vs-accuracy tradeoff, not just an accuracy question.

The surprising counter-intuitive lesson from this system is that the hardest engineering problem is not the similarity search algorithm - it is the operational safety net around false positives. A dedup system that achieves 99.9% precision sounds excellent, but at 28 billion uploads per day that still means 28 million incorrectly merged photos per day. The system’s quality depends entirely on fast detection of false positives (via user reports and automated integrity checks) and a reliable rollback path that does not permanently destroy user data.

Frequently Asked Questions

Q: Why not use byte-level hashing (SHA-256) for everything? A: Byte-level hashing only catches literal file duplicates - the same bytes uploaded twice. It misses re-exported JPEGs, photos saved at different quality levels, cropped versions of the same photo, and screenshots of photos. In practice, these perceptual near-duplicates vastly outnumber byte-level duplicates. SHA-256 is still run as the first stage (it’s essentially free) but catches only ~5% of actual duplicates.

Q: Why not use a simpler approach like average hashing (aHash) instead of pHash? A: Average hashing (set bits based on above/below mean pixel value in a downscaled image) is simpler but less robust. It is easily fooled by brightness shifts (a slightly underexposed version of a photo will have a very different aHash). pHash using DCT frequencies is invariant to these global brightness changes because the low-frequency DCT components are not sensitive to a uniform brightness offset. The extra complexity of DCT is worth the robustness.

Q: How do you handle a user who genuinely wants to store two very similar photos as separate items? A: The dedup pipeline marks photos as near-duplicates but does not force a merge at the user-visible level. The user’s library view always shows both photos. The dedup only affects backend storage: if two users upload the same viral meme, one physical copy is stored and both metadata records point to it. Within a single user’s library, near-duplicate photos are grouped visually (“you have 12 similar photos - would you like to clean up?”) but not forced-merged.

Q: Why not just hash a thumbnail instead of running a full DCT on a resized image? A: Simple thumbnail comparison (byte hashing a 32x32 thumbnail) is not robust to cropping - a photo cropped to remove a border has a completely different pixel arrangement. pHash is crop-robust because DCT low-frequency components represent global image structure rather than specific pixel positions. A 10% crop changes only ~10% of DCT coefficients, not 100% of pixel values.

Q: How do you prevent the LSH bucket for a popular meme from growing to millions of entries? A: Viral images that are uploaded by millions of users will cause their LSH buckets to accumulate millions of entries. The mitigation is bucket overflow routing: when a bucket exceeds a threshold size (e.g., 10,000 entries), subsequent insertions are redirected to an overflow shard dedicated to that bucket, and queries fan out to both the primary bucket and the overflow shard. Very high-frequency buckets are also tagged as “hot buckets” and their query results are precomputed and cached.

Q: What’s the latency budget and how do you keep dedup from blocking photo availability? A: Dedup is entirely asynchronous. The user’s photo is available in their library immediately after upload. The dedup pipeline runs within 2 seconds as a background job. During that window, the photo appears as a normal, non-deduplicated entry. If dedup finds a match, the metadata is updated and the canonical pointer is set. Users never experience a delay waiting for dedup to complete.

Interview Questions

Q: Walk through how you’d design the LSH bucket structure for 4 trillion pHash values distributed across 100 shard servers. Expected depth: Discuss consistent hashing of bucket_id to shard servers, the tradeoff between number of bands (more bands = more bucket lookups = higher recall but more cross-shard fan-out), bucket size distribution (viral images create hot buckets requiring overflow handling), replication of hot buckets for read scalability, and how to handle shard failures during a dedup query (partial results vs. wait-for-all).

Q: How would you tune the similarity threshold for a new content category (e.g., AI-generated images) where you have no labeled training data? Expected depth: Discuss bootstrap labeling (human annotation of a small sample, ~1,000 pairs), conservative starting threshold (prefer high precision / low recall to avoid false merges), gradual threshold relaxation with monitoring of user complaint rates, the challenge that AI-generated images may have different pHash distributions than natural photos (same subject generated twice may have high visual similarity but are legitimately separate), and A/B testing the threshold change on a 1% traffic slice before global rollout.

Q: The CEO sees that storage reclamation is only freeing 2% of expected bytes. How would you debug this? Expected depth: Discuss the dedup funnel: total uploads -> pHash candidates found -> embedding confirmation -> reclamation job run -> bytes freed. Each stage has a conversion rate. Debug by: checking if the LSH index is stale (not receiving updates), checking if reclamation job is running (cron/scheduler issue), checking if the dedup status is being set correctly in the database, measuring what fraction of uploads are cross-user vs. within-user (cross-user dedup yields much more savings), and comparing theoretical dedup rate with observed (the input dataset may genuinely have fewer duplicates than assumed).

Q: How would you extend the pipeline to detect near-duplicate videos in addition to photos? Expected depth: Discuss keyframe extraction (sample 1 frame per second, hash each frame), video-level pHash as the XOR of frame hashes (robust to minor variations across the full video), the challenge of temporal offset (same video trimmed by 5 seconds at the start is still a duplicate), using audio fingerprinting as a complementary signal, and the storage implications (a 1-minute video has ~60 frames to hash vs. 1 for a photo).

Q: Two users upload photos from the same wedding. Many of their photos are near-duplicates of each other but belong to different people. How does dedup handle this correctly? Expected depth: Discuss the distinction between storage-layer dedup (the physical bytes may be the same) and user-visible dedup (both users should see their own photos). Cover the metadata pointer architecture (each user’s image_id is preserved; only the storage object is potentially shared), privacy implications (user A should not be able to discover that user B has the same photo - canonical pointers must not leak user IDs across accounts), the role of access control at the canonical object level, and cases where dedup should be suppressed (a photo that is a user’s only copy of a memorable moment should never be reclaimed regardless of duplicates).

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access

Unlock Full Article