Build Google Photos Duplicate Detection Pipeline
data-engineering scalability performance
System Design Deep Dive
Google Photos Duplicate Detection
Detecting near-duplicate images at 28 billion uploads per day without storing a full copy of every photo
Imagine a library where every new book donation is compared against every existing book in the collection before being shelved - but the collection has 4 trillion books, the donations arrive at a rate of 28 billion per day, and “same book” includes not just identical copies but also slightly worn copies, cropped copies, and photocopies of select pages. That is roughly the challenge Google Photos faces with duplicate detection. The goal is not to refuse storage - users expect every photo they upload to be accessible - but to avoid storing redundant bytes for content that is functionally identical, and to present users with deduplicated views of their library.
The first wrong instinct is byte-level hashing (MD5, SHA-256). Two JPEGs of the same physical photograph taken five minutes apart, or the same photo saved at two different quality settings, will produce completely different hashes despite representing the same image. Screenshots of the same meme shared across messaging apps are compressed differently by each platform. Photos burst-shot on a phone look nearly identical but differ in pixel values due to motion blur and focus variance. Byte hashing catches only literal duplicates - the same file uploaded twice. It misses the 95% of “duplicates” that matter.
Perceptual hashing solves this by transforming an image into a fingerprint that is robust to minor changes. Two perceptually similar images (same subject, slightly different crop, brightness, or compression) will produce the same or similar fingerprints. But perceptual hashing alone at Google Photos scale - 4 trillion stored photos, 28 billion new uploads per day - creates a matching problem. You can’t compare each incoming photo’s hash against 4 trillion stored hashes in constant time. The hash comparison space is astronomically large.
This is where Locality-Sensitive Hashing (LSH) enters. LSH is a probabilistic technique that organizes similar fingerprints into the same “bucket” with high probability, while dissimilar fingerprints land in different buckets. Instead of comparing one incoming hash against 4 trillion hashes, you only compare it against the hashes in its LSH bucket - typically a few thousand. This makes approximate nearest-neighbor search tractable at billion-scale without sacrificing too much recall.
The core architectural decisions we need to solve for simultaneously: how to generate perceptual fingerprints that are robust to real-world image variations; how to implement LSH bucketing that achieves acceptable precision and recall at 4 trillion scale; how to tune the similarity threshold to avoid both over-deduplication (merging non-duplicate photos) and under-deduplication (storing redundant copies); and how to reclaim storage for detected duplicates without making photos inaccessible to their owners.
Requirements and Constraints
Functional Requirements
- Detect near-duplicate photos at upload time and link them to an existing canonical copy
- Support perceptual similarity (not just byte-level) - same image at different resolutions, crops, quality settings
- Present users with a deduplicated view of their library (show one copy, not N copies)
- Allow users to access their original upload even when deduplication has occurred
- Reclaim storage for true duplicates: multiple users uploading the same stock photo or viral meme
- Detect duplicates within a single user’s library (burst photos, accidental re-uploads)
Non-Functional Requirements
- 28 billion uploads per day (peak: 5x average = ~1.6 million uploads per second)
- 4 trillion total stored images as the index to search against
- Duplicate detection latency: under 2 seconds per upload (asynchronous, not blocking storage)
- False positive rate (incorrectly merging non-duplicate photos): under 0.001%
- False negative rate (missing actual duplicates): under 5%
- Index storage: must not exceed 2x the storage of the photos themselves
- Similarity search p99 latency: under 100ms
- System availability: 99.99%
Constraints and Assumptions
- We focus on photographic images (JPEG, PNG, HEIC) - not videos, GIFs, or documents
- “Duplicate” is defined as visual similarity above a tunable threshold, not semantic similarity
- Cross-user deduplication (two users uploading the same photo) is handled at the storage layer, not the index layer
- We assume photos are stored in object storage (GCS equivalent) after upload
- Hash computation happens on a decoded image thumbnail, not the raw file
- We do not handle adversarial attacks (images designed to collide in the hash space)
High-Level Architecture
The pipeline has five major stages that form an assembly line: Image Ingestion, Feature Extraction, LSH Indexing, Similarity Search, and Storage Reclamation.
Image Ingestion receives the raw upload, stores it durably in object storage, and places a processing job on a distributed queue. This stage is synchronous to the user - they see the photo in their library immediately. Everything downstream is asynchronous.
Feature Extraction consumes from the queue, decodes the image, resizes it to a standard thumbnail (typically 32x32 for pHash), and computes the perceptual hash. For higher-accuracy matching, it also runs a CNN-based feature extractor that produces a 512-dimensional embedding vector. The hash is a 64-bit integer; the embedding is 512 floats.
LSH Indexing takes the 64-bit pHash and runs it through L LSH hash functions (typically L=8), generating L “band hashes” that determine which LSH buckets this image belongs to. The image is inserted into all L buckets, and its 512-d embedding is written to the vector feature store.
Similarity Search queries the LSH bucket(s) that the incoming image maps to, retrieves candidate matches, and re-ranks them by exact Hamming distance (for pHash) or cosine similarity (for embeddings). If any candidate exceeds the similarity threshold, the upload is flagged as a near-duplicate.
Storage Reclamation is an offline background job that finds confirmed duplicate groups, designates one as the canonical copy, updates metadata pointers, and physically deletes redundant object storage copies. Users still see their photo; they just retrieve it from the canonical copy.
The LSH step is a recall amplification mechanism, not a precision mechanism. It efficiently narrows the candidate set from 4 trillion photos to a few thousand - but it will miss some true duplicates (false negatives). The precision and recall knobs are the number of LSH bands (more bands = higher recall, more candidates to check) and the similarity threshold (lower = more aggressive deduplication, higher false positive risk).
Perceptual Hashing
Perceptual hashing is the process of converting an image to a compact fingerprint that changes slowly as the image changes gradually - much like how a piece of music’s melody is recognizable even when played slightly faster or in a different key.
The most widely used algorithm is pHash (Discrete Cosine Transform hash). It works by:
- Resize the image to 32x32 pixels (discards fine detail, retains structure)
- Convert to grayscale (color variations don’t affect structural similarity)
- Compute the 2D Discrete Cosine Transform (DCT) of the 32x32 grid
- Take the top-left 8x8 DCT coefficients (the low-frequency components - the “melody” of the image)
- Compute the median of these 64 values
- For each of the 64 values: set bit to 1 if above median, 0 if below
The result is a 64-bit integer. Two images with similar structure produce hashes that differ in only a few bits. The Hamming distance between two hashes (number of differing bits) is the similarity metric: distance 0 means identical structure, distance 1-5 means very similar, distance > 10 typically means different images.
# Perceptual hash (pHash) implementation using DCT
import numpy as np
from PIL import Image
from scipy.fftpack import dct
def compute_phash(image_bytes: bytes, hash_size: int = 8) -> int:
"""
Compute a 64-bit perceptual hash of an image.
hash_size=8 produces an 8x8=64 bit hash.
Returns the hash as a Python integer.
"""
# Decode and preprocess
from io import BytesIO
img = Image.open(BytesIO(image_bytes))
# Convert to grayscale and resize to hash_size * 4 = 32x32
# The 4x scale gives the DCT more data to work with
high_freq_factor = 4
img_size = hash_size * high_freq_factor # 32x32
img = img.convert('L').resize((img_size, img_size), Image.LANCZOS)
pixels = np.array(img, dtype=float)
# 2D DCT - compute frequency components
dct_coeffs = dct(dct(pixels, axis=0, norm='ortho'), axis=1, norm='ortho')
# Take the top-left hash_size x hash_size low-frequency coefficients
low_freq = dct_coeffs[:hash_size, :hash_size]
# Compute median and binarize
median = np.median(low_freq)
bits = (low_freq > median).flatten()
# Pack 64 bits into a single integer
hash_val = 0
for bit in bits:
hash_val = (hash_val << 1) | int(bit)
return hash_val
def hamming_distance(hash_a: int, hash_b: int, hash_bits: int = 64) -> int:
"""Hamming distance between two perceptual hashes (number of differing bits)."""
xor = hash_a ^ hash_b
# Count set bits (popcount) - Python has bin() for this
return bin(xor).count('1')
def are_perceptual_duplicates(hash_a: int, hash_b: int, threshold: int = 8) -> bool:
"""
Two images are near-duplicates if their Hamming distance <= threshold.
threshold=8 means they can differ in up to 8/64 = 12.5% of bits.
Typical tuning: 4-6 for strict matching, 8-10 for loose matching.
"""
return hamming_distance(hash_a, hash_b) <= threshold
pHash has known failure modes: two images with the same dominant frequency structure but different content (e.g., two solid-color images, or two images with the same composition but different subjects) can produce similar hashes. This is why the pipeline uses a two-pass approach: pHash for fast candidate filtering, then CNN embedding cosine similarity for precise confirmation. Never use pHash alone as the final deduplication decision.
Locality-Sensitive Hashing
LSH solves the scalability problem. With 4 trillion stored hashes, brute-force Hamming distance comparison at 1.6 million uploads per second is computationally impossible - that’s 6.4 x 10^18 comparisons per second. LSH trades some recall for tractable query time.
The intuition behind LSH for Hamming distance: if two hashes differ in only 4 bits out of 64, and we randomly select a subset of 16 bits from each hash, there’s a high probability that both hashes have identical values in those 16 bits. Two dissimilar hashes (differing in 30 bits) are much less likely to agree on any random 16-bit subset.
Band hashing is the practical implementation. We partition the 64-bit pHash into L bands of B bits each (e.g., 8 bands of 8 bits). For each band, we compute a hash of that band’s value along with a band ID to produce a bucket ID. An image is placed into L buckets. At query time, the incoming image is hashed into the same L buckets, and candidates are retrieved from the union of those buckets.
# Locality-Sensitive Hashing for 64-bit pHash
import hashlib
from typing import List, Tuple, Set
class PHashLSH:
"""
LSH index for 64-bit perceptual hashes.
Uses band hashing: partitions the 64-bit hash into L bands of B bits.
Higher L = better recall but more candidate checks.
Smaller B = more candidates per bucket.
"""
def __init__(self, num_bands: int = 8, bits_per_band: int = 8):
assert num_bands * bits_per_band == 64, "bands * bits_per_band must equal 64"
self.num_bands = num_bands
self.bits_per_band = bits_per_band
self.band_mask = (1 << bits_per_band) - 1
# In production: Redis hash maps or a distributed KV store
# Here simulated as a Python dict: bucket_id -> list of (image_id, full_hash)
self._buckets: dict[int, list[tuple[str, int]]] = {}
def _band_bucket_id(self, phash: int, band_idx: int) -> int:
"""Extract the band_idx-th B-bit segment of the hash and compute bucket ID."""
shift = band_idx * self.bits_per_band
band_value = (phash >> shift) & self.band_mask
# Include band_idx in the hash key to prevent cross-band collisions
combined = (band_idx << 32) | band_value
return combined
def insert(self, image_id: str, phash: int) -> List[int]:
"""
Insert an image into all L LSH buckets.
Returns the list of bucket IDs for debugging.
"""
bucket_ids = []
for band_idx in range(self.num_bands):
bkt = self._band_bucket_id(phash, band_idx)
if bkt not in self._buckets:
self._buckets[bkt] = []
self._buckets[bkt].append((image_id, phash))
bucket_ids.append(bkt)
return bucket_ids
def query(self, phash: int, hamming_threshold: int = 8) -> List[Tuple[str, int]]:
"""
Find all indexed images within hamming_threshold of the query hash.
Returns list of (image_id, hamming_distance) sorted by distance.
"""
# Gather all candidates from all bands (union)
seen_ids: Set[str] = set()
candidates: List[Tuple[str, int]] = []
for band_idx in range(self.num_bands):
bkt = self._band_bucket_id(phash, band_idx)
for img_id, stored_hash in self._buckets.get(bkt, []):
if img_id in seen_ids:
continue
seen_ids.add(img_id)
dist = hamming_distance(phash, stored_hash)
if dist <= hamming_threshold:
candidates.append((img_id, dist))
# Sort by Hamming distance (most similar first)
candidates.sort(key=lambda x: x[1])
return candidates
def lsh_theoretical_recall(
num_bands: int, bits_per_band: int, hamming_distance: int
) -> float:
"""
Probability that two hashes at a given Hamming distance
will share at least one LSH band bucket (i.e., will be found as candidates).
"""
# Probability that a single bit agrees: (64 - d) / 64
# Probability that all bits in a band agree: ((64-d)/64)^B
# Probability that at least one band agrees: 1 - (1 - ((64-d)/64)^B)^L
p_band_agree = ((64 - hamming_distance) / 64) ** bits_per_band
recall = 1.0 - (1.0 - p_band_agree) ** num_bands
return recall
For 8 bands of 8 bits, the theoretical recall is:
- At Hamming distance 4: recall ~ 90%
- At Hamming distance 8: recall ~ 58%
- At Hamming distance 12: recall ~ 27%
This means at our threshold of 8, we miss about 42% of true duplicates in the first LSH pass. To compensate, we run multiple independent LSH tables with different band configurations and take the union of candidates. With 4 independent LSH tables, recall at distance 8 rises to ~91%.
LSH recall is not the final recall of the deduplication system. The CNN embedding-based ANN search runs in parallel on the same upload and catches duplicates that pHash LSH misses (especially images that have been color-graded, filtered, or had text overlaid - which change pHash significantly but are visually the same photo).
CNN Feature Vectors and ANN Search
For images where pHash is unreliable (heavily filtered photos, screenshots of photos, photos with text overlaid), the system uses a CNN-based embedding. A pre-trained image encoder (similar to Google’s Vision Transformer or a fine-tuned ResNet) maps each image to a 512-dimensional vector where visually similar images cluster near each other in the embedding space.
The ANN (Approximate Nearest Neighbor) index for 512-d vectors at 4 trillion scale is a different beast from the LSH bucket approach. The system uses a hierarchical HNSW (Hierarchical Navigable Small World) index, but even HNSW at 4 trillion scale requires sharding across hundreds of machines with an index-of-indexes routing layer.
# ANN search using HNSW index (faiss-based)
import numpy as np
import faiss
class EmbeddingANNIndex:
"""
Approximate Nearest Neighbor index for 512-d CNN embeddings.
Uses HNSW index for low-latency queries.
In production this is sharded: each shard holds ~100M vectors.
"""
def __init__(self, dim: int = 512, m: int = 32, ef_construction: int = 200):
"""
m: number of connections per node in HNSW (higher = better recall, more memory)
ef_construction: search width during build (higher = better index quality)
"""
self.dim = dim
# HNSW index from faiss - ~200 bytes per vector + HNSW overhead
self.index = faiss.IndexHNSWFlat(dim, m)
self.index.hnsw.efConstruction = ef_construction
self.index.hnsw.efSearch = 64 # search width at query time
# Map from faiss internal ID to image_id
self.id_map: dict[int, str] = {}
self._next_id = 0
def add(self, image_id: str, embedding: np.ndarray) -> None:
"""Add a single 512-d embedding to the index."""
assert embedding.shape == (self.dim,), f"Expected ({self.dim},), got {embedding.shape}"
vec = embedding.astype(np.float32).reshape(1, -1)
# Normalize to unit sphere for cosine similarity via inner product
faiss.normalize_L2(vec)
self.index.add(vec)
self.id_map[self._next_id] = image_id
self._next_id += 1
def search(self, query_embedding: np.ndarray, k: int = 20) -> list[tuple[str, float]]:
"""
Returns up to k nearest neighbors as (image_id, cosine_similarity).
cosine_similarity of 1.0 = identical, 0.0 = orthogonal.
"""
vec = query_embedding.astype(np.float32).reshape(1, -1)
faiss.normalize_L2(vec)
distances, indices = self.index.search(vec, k)
results = []
for dist, idx in zip(distances[0], indices[0]):
if idx == -1:
continue # fewer than k results
image_id = self.id_map.get(idx)
if image_id:
results.append((image_id, float(dist))) # dist is cosine similarity (after L2 norm)
return results
def is_duplicate(self, embedding: np.ndarray, threshold: float = 0.95) -> tuple[bool, str | None]:
"""
Returns (is_duplicate, canonical_image_id) if similarity >= threshold.
threshold=0.95 corresponds to very high visual similarity.
"""
results = self.search(embedding, k=1)
if results and results[0][1] >= threshold:
return True, results[0][0]
return False, None
Pinterest built a near-identical system called “Visual Discovery” for deduplication of their 300 billion pins. They use pHash for fast initial filtering and a fine-tuned ResNet for high-precision re-ranking. The key lesson from their engineering blog: the similarity threshold is not a constant but a function of the image’s content category. Stock photos require a strict threshold (0.98+) because similar stock photos are intentionally distinct products. User vacation photos can use a looser threshold (0.90) because burst shots of the same scene are true duplicates to deduplicate.
Dedup Pipeline Stages
The full deduplication pipeline is a multi-stage assembly line. Each stage is independently scalable and fails independently without blocking the upload.
# Multi-stage deduplication pipeline orchestrator
from dataclasses import dataclass
from enum import Enum
from typing import Optional
import time
class DedupStatus(Enum):
UNIQUE = "unique"
NEAR_DUPLICATE = "near_duplicate"
EXACT_DUPLICATE = "exact_duplicate"
PENDING = "pending"
@dataclass
class DedupResult:
image_id: str
status: DedupStatus
canonical_id: Optional[str] # ID of the canonical copy if duplicate
phash: int
embedding: list[float]
hamming_distance: Optional[int] # if pHash match found
cosine_similarity: Optional[float] # if embedding match found
processing_time_ms: float
def run_dedup_pipeline(
image_id: str,
image_bytes: bytes,
phash_lsh_index, # PHashLSH instance (distributed in production)
embedding_ann_index, # EmbeddingANNIndex instance (sharded in production)
cnn_model, # Feature extractor
phash_threshold: int = 8,
embedding_threshold: float = 0.95,
) -> DedupResult:
"""
Full deduplication pipeline. Runs pHash and embedding checks in parallel.
Returns the final dedup decision.
"""
start_ms = time.monotonic() * 1000
# Stage 1: Compute pHash (fast, ~5ms)
phash = compute_phash(image_bytes)
# Stage 2: Compute CNN embedding (slower, ~50ms on GPU)
# In production: batched across multiple uploads for throughput
embedding = cnn_model.encode(image_bytes) # returns np.ndarray of shape (512,)
# Stage 3: LSH candidate retrieval (~10ms)
phash_candidates = phash_lsh_index.query(phash, hamming_threshold=phash_threshold)
# Stage 4: ANN embedding search (~20ms)
embedding_candidates = embedding_ann_index.search(embedding, k=10)
# Stage 5: Candidate re-ranking and decision
# Check exact SHA-256 match first (instant dedup at zero cost)
import hashlib
sha256 = hashlib.sha256(image_bytes).hexdigest()
result_status = DedupStatus.UNIQUE
canonical_id = None
hamming_dist = None
cosine_sim = None
# Priority: exact byte match > pHash near-match > embedding near-match
# (exact SHA-256 match would be caught upstream at the storage layer)
if phash_candidates:
best_phash_match = phash_candidates[0] # (image_id, distance)
hamming_dist = best_phash_match[1]
if hamming_dist <= phash_threshold:
result_status = DedupStatus.NEAR_DUPLICATE
canonical_id = best_phash_match[0]
if embedding_candidates:
best_emb_match = embedding_candidates[0] # (image_id, similarity)
cosine_sim = best_emb_match[1]
if cosine_sim >= embedding_threshold:
# Embedding match overrides pHash if higher confidence
if result_status == DedupStatus.UNIQUE or cosine_sim > (1.0 - hamming_dist/64.0 if hamming_dist else 0):
result_status = DedupStatus.NEAR_DUPLICATE
canonical_id = best_emb_match[0]
end_ms = time.monotonic() * 1000
return DedupResult(
image_id=image_id,
status=result_status,
canonical_id=canonical_id,
phash=phash,
embedding=embedding.tolist(),
hamming_distance=hamming_dist,
cosine_similarity=cosine_sim,
processing_time_ms=end_ms - start_ms,
)
Data Model
-- Core schema for the duplicate detection pipeline
-- Perceptual hash index: one row per uploaded image
CREATE TABLE image_phash (
image_id UUID PRIMARY KEY,
user_id BIGINT NOT NULL,
phash_64 BIGINT NOT NULL, -- 64-bit perceptual hash
sha256_hex CHAR(64) NOT NULL, -- exact byte hash for fast exact dedup
file_size_bytes BIGINT NOT NULL,
width_px INT NOT NULL,
height_px INT NOT NULL,
format VARCHAR(8) NOT NULL, -- jpeg, png, heic, webp
uploaded_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
dedup_status VARCHAR(16) NOT NULL DEFAULT 'pending'
CHECK (dedup_status IN ('pending', 'unique', 'near_duplicate', 'exact_duplicate')),
canonical_image_id UUID REFERENCES image_phash(image_id),
hamming_distance SMALLINT, -- distance to canonical, if duplicate
cosine_similarity FLOAT, -- embedding similarity to canonical, if duplicate
dedup_processed_at TIMESTAMPTZ
);
-- Fast exact-duplicate lookup by SHA-256
CREATE UNIQUE INDEX idx_phash_sha256 ON image_phash (sha256_hex);
-- Lookup by user for within-library dedup
CREATE INDEX idx_phash_user ON image_phash (user_id, uploaded_at DESC);
-- Lookup all non-canonical images pointing to a canonical (for storage reclamation)
CREATE INDEX idx_phash_canonical ON image_phash (canonical_image_id)
WHERE canonical_image_id IS NOT NULL;
-- LSH bucket table: maps band bucket IDs to image IDs
-- This is the in-memory structure externalized for persistence
-- In production: Redis hash maps keyed by bucket_id
CREATE TABLE lsh_bucket (
bucket_id BIGINT NOT NULL, -- (band_idx << 32) | band_value
image_id UUID NOT NULL REFERENCES image_phash(image_id),
phash_64 BIGINT NOT NULL, -- denormalized for fast Hamming check
inserted_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
PRIMARY KEY (bucket_id, image_id)
);
CREATE INDEX idx_lsh_bucket_lookup ON lsh_bucket (bucket_id, phash_64);
-- Dedup group: tracks a set of images that are near-duplicates of each other
CREATE TABLE dedup_group (
group_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
canonical_image_id UUID NOT NULL REFERENCES image_phash(image_id),
member_count INT NOT NULL DEFAULT 1,
total_bytes_saved BIGINT NOT NULL DEFAULT 0, -- bytes reclaimed
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
last_reclaimed_at TIMESTAMPTZ
);
-- Storage reclamation log
CREATE TABLE storage_reclamation_event (
reclaim_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
group_id UUID NOT NULL REFERENCES dedup_group(group_id),
reclaimed_image_id UUID NOT NULL,
bytes_freed BIGINT NOT NULL,
storage_path TEXT NOT NULL, -- GCS path of deleted object
reclaimed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_reclaim_group ON storage_reclamation_event (group_id, reclaimed_at);
-- Feature vector storage: separate table (in practice, a vector DB)
-- Stores 512-d CNN embeddings for ANN search
CREATE TABLE image_embedding (
image_id UUID PRIMARY KEY REFERENCES image_phash(image_id),
embedding_model VARCHAR(32) NOT NULL DEFAULT 'vit-base-patch16',
embedding_bytes BYTEA NOT NULL, -- 512 x float32 = 2048 bytes
computed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
The sharding key for image_phash is user_id for within-library dedup queries. For cross-library dedup (same photo uploaded by multiple users - the viral meme problem), queries are against the global pHash + LSH index, which is partitioned by phash_64 % N_shards to co-locate similar hashes on the same shard.
Key Algorithms and Protocols
Similarity Threshold Tuning
The similarity threshold is the most consequential operational decision in the dedup pipeline. Setting it too low causes false positives (merging non-duplicate photos, destroying user trust). Setting it too high causes false negatives (storing redundant copies, wasting money). The right threshold varies by use case.
# Threshold tuning analysis using labeled test set
from dataclasses import dataclass
from typing import List, Tuple
import numpy as np
@dataclass
class LabeledPair:
hash_a: int
hash_b: int
is_duplicate: bool # ground truth label
def precision_recall_curve(
labeled_pairs: List[LabeledPair],
thresholds: List[int] = None,
) -> List[Tuple[int, float, float, float]]:
"""
Returns (threshold, precision, recall, f1) for each threshold.
Used to select the optimal Hamming distance threshold.
"""
if thresholds is None:
thresholds = list(range(0, 32))
results = []
total_positives = sum(1 for p in labeled_pairs if p.is_duplicate)
for threshold in thresholds:
tp = 0 # correctly identified as duplicate
fp = 0 # non-duplicate incorrectly flagged as duplicate
fn = 0 # duplicate not caught (missed)
for pair in labeled_pairs:
dist = hamming_distance(pair.hash_a, pair.hash_b)
predicted_dup = dist <= threshold
if predicted_dup and pair.is_duplicate:
tp += 1
elif predicted_dup and not pair.is_duplicate:
fp += 1
elif not predicted_dup and pair.is_duplicate:
fn += 1
precision = tp / (tp + fp) if (tp + fp) > 0 else 1.0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
results.append((threshold, precision, recall, f1))
return results
def select_threshold_at_precision(
labeled_pairs: List[LabeledPair],
min_precision: float = 0.9999, # 0.001% false positive rate target
) -> int:
"""
Select the highest threshold (most aggressive dedup) that still meets
the minimum precision requirement. This biases toward recall.
"""
curve = precision_recall_curve(labeled_pairs)
best_threshold = 0
for threshold, precision, recall, f1 in curve:
if precision >= min_precision:
best_threshold = threshold
return best_threshold
Storage Reclamation
Once a duplicate group is identified, storage reclamation runs as an offline batch job. It must be careful: reclaiming storage before verifying the canonical copy is durably stored (with redundancy) would cause data loss.
# Storage reclamation - safely free duplicate copies
from typing import List
import time
def reclaim_duplicate_group(
group_id: str,
canonical_image_id: str,
duplicate_image_ids: List[str],
storage_client, # GCS or equivalent object storage client
db_client, # database client
min_canonical_replicas: int = 3, # safety check before deletion
) -> int:
"""
Free storage for duplicate copies. Returns bytes freed.
Runs in a transaction to prevent partial state.
"""
# Safety check: verify canonical exists and is replicated
canonical_meta = storage_client.get_object_metadata(canonical_image_id)
if canonical_meta is None:
raise ValueError(f"Canonical image {canonical_image_id} not found in storage")
if canonical_meta.replica_count < min_canonical_replicas:
raise ValueError(f"Canonical has only {canonical_meta.replica_count} replicas, need {min_canonical_replicas}")
bytes_freed = 0
for dup_id in duplicate_image_ids:
if dup_id == canonical_image_id:
continue # never delete the canonical
dup_meta = storage_client.get_object_metadata(dup_id)
if dup_meta is None:
continue # already reclaimed
# Update metadata pointer before deleting: user reads now redirect to canonical
# This must commit BEFORE the physical deletion
db_client.execute("""
UPDATE image_phash
SET dedup_status = 'exact_duplicate',
canonical_image_id = %s,
dedup_processed_at = NOW()
WHERE image_id = %s
""", (canonical_image_id, dup_id))
# Now safe to delete the physical object
storage_client.delete_object(dup_id)
# Log the reclamation for audit
db_client.execute("""
INSERT INTO storage_reclamation_event
(group_id, reclaimed_image_id, bytes_freed, storage_path)
VALUES (%s, %s, %s, %s)
""", (group_id, dup_id, dup_meta.size_bytes, dup_meta.storage_path))
bytes_freed += dup_meta.size_bytes
# Update group totals
db_client.execute("""
UPDATE dedup_group
SET total_bytes_saved = total_bytes_saved + %s,
last_reclaimed_at = NOW()
WHERE group_id = %s
""", (bytes_freed, group_id))
return bytes_freed
Storage reclamation must always update the metadata pointer (redirecting the duplicate image_id to the canonical) before deleting the physical object. If you delete first and then crash before updating metadata, users see a broken photo link. The reverse order - update metadata, then delete - means the worst-case scenario is a temporary extra storage cost, not data loss.
Scaling and Performance
Capacity Estimation - Google Photos Dedup Pipeline
Given:
- 28B uploads per day = 324K uploads/sec average, 1.6M/sec peak (5x)
- Average image size: 3MB (after compression to HEIC/JPEG)
- 4 trillion stored images total
- pHash: 8 bytes per image
- CNN embedding: 512 floats x 4 bytes = 2048 bytes per image
- LSH index: 8 buckets x 20 bytes per entry = 160 bytes per image
Index Storage:
- pHash index: 4T * 8 bytes = 32 TB
- Embedding index: 4T * 2048 bytes = 8 PB (must be sharded heavily)
- LSH bucket index: 4T * 160 bytes = 640 TB (stored in Redis/KV)
- Total index overhead: ~9 PB vs ~12 PB raw image storage (75% overhead - acceptable)
Compute for pHash at peak:
- 1.6M uploads/sec * 5ms per pHash = 8,000 CPU-seconds/sec = 8,000 CPU cores
- With 64-core servers: 125 pHash worker servers at peak
Compute for CNN embedding at peak:
- 1.6M uploads/sec * 50ms per embedding (GPU) = need 80K GPU-ms/sec
- Modern A100 GPU: ~200 images/sec in batch mode
- At 1.6M uploads/sec: 8,000 GPUs at peak (batched)
- With 5x peak factor: steady state ~1,600 GPUs for embeddings
LSH lookup latency:
- 8 band lookups per image
- Each bucket lookup: ~1ms (Redis pipeline)
- Total LSH query: ~8ms with pipelining, ~3ms batched
ANN search latency (HNSW):
- Each shard: 100M vectors, HNSW with M=32, ef=64: ~5ms per query
- With routing layer overhead: ~15ms end-to-end
The scaling bottleneck is the CNN embedding computation, not the search. pHash computation is CPU-bound and horizontally scalable. The ANN search is memory-bound and distributable. But running 1.6 million GPU-accelerated CNN forward passes per second requires a dedicated GPU cluster and careful batching to amortize the per-image GPU launch overhead.
The mitigation is selective embedding computation. The pHash LSH check runs first and is cheap. If pHash finds no candidates within a threshold of 12 (a broader pre-filter), there is no need to compute the CNN embedding at all - the image is definitively unique. Approximately 70-80% of uploads in practice have no pHash candidates (they are truly new images), so only 20-30% of uploads actually need CNN embedding, reducing the GPU load by 3-4x.
Facebook’s “Similar Photos” feature uses an equivalent pipeline. Their engineering team published that they use a two-stage approach: a fast 128-bit pHash check to filter candidates, followed by a ResNet-50 embedding similarity check. The key finding from their paper: combining pHash (which is robust to crops and resizes) with CNN embeddings (which are robust to filters and color grading) achieves near-human-level precision on the dedup decision. No single hash type handles all the ways users create “near-duplicate” photos.
Failure Modes and Recovery
| Failure | Detection | Impact | Recovery |
|---|---|---|---|
| LSH shard crash | Health check + increased error rate on bucket lookups | Uploads during crash window bypass LSH check, stored as “unique” | Shard restarts from snapshot; background re-processing job re-queues images stored as unique during outage |
| CNN embedding worker GPU OOM | Worker exits with OOM, job returns to queue with retry | Image stuck in “pending” dedup status; storage not reclaimed | Auto-restart worker; implement per-batch memory limit to prevent cascade OOM |
| Storage reclamation crash mid-group | Metadata pointer updated but physical delete incomplete | Small storage waste (orphaned objects); no data loss | Reclamation job is idempotent: re-running it skips images already marked as reclaimed |
| False positive dedup (wrong merge) | User reports “my photo is wrong”; error rate monitoring | User sees wrong photo when accessing their image | Rollback: restore original pointer, undelete canonical if needed (with 7-day deletion hold) |
| ANN index corruption | Query returns zero results for known duplicates; integration test failure | High false negative rate on embedding-based matching; dedup misses increases | Roll back to previous HNSW index snapshot; rebuild from embedding store asynchronously |
| Hash collision attack (malicious uploads) | Anomalous cluster of images all hashing to same pHash value | Storage reclamation erroneously groups unrelated images | Content safety pipeline pre-filters uploads; pHash collision without embedding match is not reclaimed |
The 7-day deletion hold on reclaimed storage objects is not optional - it is the entire safety net for false positive dedup events. If you immediately hard-delete reclaimed objects, a false positive merge has permanent consequences (user permanently loses their photo). The hold gives the user support team and automated rollback systems time to detect and reverse the error before the physical bytes are gone.
Comparison of Approaches
| Approach | Precision | Recall | Index Size | Query Latency | Best Fit |
|---|---|---|---|---|---|
| SHA-256 exact hash | 100% | approx 5% (exact duplicates only) | 32 bytes/image | under 1ms | Same file uploaded twice |
| pHash alone (threshold 8) | 99.5% | 65% | 8 bytes/image + LSH | under 15ms | Resize, crop, re-export duplicates |
| CNN embedding ANN only | 98% | 88% | 2KB/image | 15-30ms | Filtered, color-graded duplicates |
| pHash + CNN embedding (two-pass) | 99.9% | 91% | 2KB/image | 20-40ms | Production: handles all duplicate types |
| Video-frame pHash (for burst photos) | 99.2% | 85% | 8 bytes/frame | under 15ms | Burst photos, video thumbnails |
The two-pass approach (pHash pre-filter + CNN embedding confirmation) is the right choice for production. The key non-obvious tradeoff: pHash is the speed champion and CNN is the accuracy champion. Using pHash alone at a strict threshold gives high precision but misses 35% of real duplicates (particularly filtered photos, which is a common user workflow). Using CNN alone is too expensive at 1.6M uploads/second without the pHash pre-filter.
The combined approach uses pHash as a coarse filter (threshold 12, high recall at the expense of precision) to narrow candidates, then CNN embedding for the final decision. This preserves recall while keeping GPU costs manageable. The two hash types are complementary: pHash is sensitive to geometric transforms (crop, flip) but robust to color; CNN embedding is robust to both, but expensive.
Key Takeaways
- Perceptual hashing: pHash uses DCT frequency components to produce a 64-bit fingerprint that changes slowly with gradual image changes, making Hamming distance a robust similarity metric.
- LSH scales the search: Brute-force comparison of 4 trillion hashes is impossible; LSH band hashing narrows candidates to thousands, making similarity search tractable.
- Dual-hash strategy: pHash handles geometric near-duplicates; CNN embedding handles filtered/color-graded near-duplicates; combining both achieves 91% recall at 99.9% precision.
- Similarity threshold tuning: The threshold is not a fixed constant but should be calibrated per content category using a labeled test set and precision-recall analysis.
- ANN not exact search: HNSW-based approximate nearest neighbor search sacrifices a small amount of recall for a 1000x speedup over exact search - acceptable for dedup where the fallback is just slightly more storage.
- Selective CNN embedding: Running pHash first and only escalating to CNN embedding for images with pHash candidates reduces GPU compute by 3-4x, making the pipeline cost-effective.
- Storage reclamation safety: Always update metadata pointers before physical deletion; maintain a deletion hold window for rollback; storage reclamation jobs must be idempotent.
- Feature vector storage: The 2KB CNN embedding per image at 4 trillion scale (8 PB) is the dominant cost of the dedup index - making embedding model selection a cost-vs-accuracy tradeoff, not just an accuracy question.
The surprising counter-intuitive lesson from this system is that the hardest engineering problem is not the similarity search algorithm - it is the operational safety net around false positives. A dedup system that achieves 99.9% precision sounds excellent, but at 28 billion uploads per day that still means 28 million incorrectly merged photos per day. The system’s quality depends entirely on fast detection of false positives (via user reports and automated integrity checks) and a reliable rollback path that does not permanently destroy user data.
Frequently Asked Questions
Q: Why not use byte-level hashing (SHA-256) for everything? A: Byte-level hashing only catches literal file duplicates - the same bytes uploaded twice. It misses re-exported JPEGs, photos saved at different quality levels, cropped versions of the same photo, and screenshots of photos. In practice, these perceptual near-duplicates vastly outnumber byte-level duplicates. SHA-256 is still run as the first stage (it’s essentially free) but catches only ~5% of actual duplicates.
Q: Why not use a simpler approach like average hashing (aHash) instead of pHash? A: Average hashing (set bits based on above/below mean pixel value in a downscaled image) is simpler but less robust. It is easily fooled by brightness shifts (a slightly underexposed version of a photo will have a very different aHash). pHash using DCT frequencies is invariant to these global brightness changes because the low-frequency DCT components are not sensitive to a uniform brightness offset. The extra complexity of DCT is worth the robustness.
Q: How do you handle a user who genuinely wants to store two very similar photos as separate items? A: The dedup pipeline marks photos as near-duplicates but does not force a merge at the user-visible level. The user’s library view always shows both photos. The dedup only affects backend storage: if two users upload the same viral meme, one physical copy is stored and both metadata records point to it. Within a single user’s library, near-duplicate photos are grouped visually (“you have 12 similar photos - would you like to clean up?”) but not forced-merged.
Q: Why not just hash a thumbnail instead of running a full DCT on a resized image? A: Simple thumbnail comparison (byte hashing a 32x32 thumbnail) is not robust to cropping - a photo cropped to remove a border has a completely different pixel arrangement. pHash is crop-robust because DCT low-frequency components represent global image structure rather than specific pixel positions. A 10% crop changes only ~10% of DCT coefficients, not 100% of pixel values.
Q: How do you prevent the LSH bucket for a popular meme from growing to millions of entries? A: Viral images that are uploaded by millions of users will cause their LSH buckets to accumulate millions of entries. The mitigation is bucket overflow routing: when a bucket exceeds a threshold size (e.g., 10,000 entries), subsequent insertions are redirected to an overflow shard dedicated to that bucket, and queries fan out to both the primary bucket and the overflow shard. Very high-frequency buckets are also tagged as “hot buckets” and their query results are precomputed and cached.
Q: What’s the latency budget and how do you keep dedup from blocking photo availability? A: Dedup is entirely asynchronous. The user’s photo is available in their library immediately after upload. The dedup pipeline runs within 2 seconds as a background job. During that window, the photo appears as a normal, non-deduplicated entry. If dedup finds a match, the metadata is updated and the canonical pointer is set. Users never experience a delay waiting for dedup to complete.
Interview Questions
Q: Walk through how you’d design the LSH bucket structure for 4 trillion pHash values distributed across 100 shard servers.
Expected depth: Discuss consistent hashing of bucket_id to shard servers, the tradeoff between number of bands (more bands = more bucket lookups = higher recall but more cross-shard fan-out), bucket size distribution (viral images create hot buckets requiring overflow handling), replication of hot buckets for read scalability, and how to handle shard failures during a dedup query (partial results vs. wait-for-all).
Q: How would you tune the similarity threshold for a new content category (e.g., AI-generated images) where you have no labeled training data? Expected depth: Discuss bootstrap labeling (human annotation of a small sample, ~1,000 pairs), conservative starting threshold (prefer high precision / low recall to avoid false merges), gradual threshold relaxation with monitoring of user complaint rates, the challenge that AI-generated images may have different pHash distributions than natural photos (same subject generated twice may have high visual similarity but are legitimately separate), and A/B testing the threshold change on a 1% traffic slice before global rollout.
Q: The CEO sees that storage reclamation is only freeing 2% of expected bytes. How would you debug this? Expected depth: Discuss the dedup funnel: total uploads -> pHash candidates found -> embedding confirmation -> reclamation job run -> bytes freed. Each stage has a conversion rate. Debug by: checking if the LSH index is stale (not receiving updates), checking if reclamation job is running (cron/scheduler issue), checking if the dedup status is being set correctly in the database, measuring what fraction of uploads are cross-user vs. within-user (cross-user dedup yields much more savings), and comparing theoretical dedup rate with observed (the input dataset may genuinely have fewer duplicates than assumed).
Q: How would you extend the pipeline to detect near-duplicate videos in addition to photos? Expected depth: Discuss keyframe extraction (sample 1 frame per second, hash each frame), video-level pHash as the XOR of frame hashes (robust to minor variations across the full video), the challenge of temporal offset (same video trimmed by 5 seconds at the start is still a duplicate), using audio fingerprinting as a complementary signal, and the storage implications (a 1-minute video has ~60 frames to hash vs. 1 for a photo).
Q: Two users upload photos from the same wedding. Many of their photos are near-duplicates of each other but belong to different people. How does dedup handle this correctly? Expected depth: Discuss the distinction between storage-layer dedup (the physical bytes may be the same) and user-visible dedup (both users should see their own photos). Cover the metadata pointer architecture (each user’s image_id is preserved; only the storage object is potentially shared), privacy implications (user A should not be able to discover that user B has the same photo - canonical pointers must not leak user IDs across accounts), the role of access control at the canonical object level, and cases where dedup should be suppressed (a photo that is a user’s only copy of a memorable moment should never be reclaimed regardless of duplicates).
Premium Content
Unlock the full article along with everything else in the archive — all in one place.