Build Gmail's Spam Detection Pipeline


data-engineering performance scalability

System Design Deep Dive

Gmail Spam Detection Pipeline

Classifying 300 billion emails per day in under 100ms while keeping up with spammers who adapt in real time.

⏱ 14 min read📐 Advanced🏗️ Ml-Pipeline

Imagine a postal sorting office that processes 3.5 million letters per second, and every single one needs a fraud check before it reaches its recipient - in under 100 milliseconds. That is the operating reality for Gmail’s spam detection system. At roughly 300 billion emails per day, this is one of the largest real-time machine learning inference systems on the planet.

The challenge is not just volume. It is the adversarial nature of the problem. A static rule-based system would be defeated within hours. Spammers A/B test their campaigns, rotate IP ranges, generate new domains, and adjust word patterns specifically to evade detection. Your system must learn continuously from user feedback - when someone clicks “report spam,” that signal must feed back into the classifier, improve it, and be deployed while emails continue flowing at full throughput. All of this must happen without disrupting delivery for the 99.9% of legitimate email.

Latency and accuracy are in direct tension. The most accurate classifiers - large transformer models - take hundreds of milliseconds per inference. But the pipeline must complete in under 100ms to stay within the SMTP delivery window. A naive approach of running a single deep neural network on every email would either miss the latency budget or require a GPU cluster so large it makes the cost structure untenable.

The system must also handle an extreme class imbalance. Roughly 85% of all email sent on the internet is spam, but Gmail users experience a near-spam-free inbox. The classifier precision must be very high - blocking a legitimate email costs real user trust. We need to solve for: sub-100ms classification latency, continuous model updates that don’t interrupt service, and false-positive rates below 0.05% even as spam patterns shift.

Requirements and Constraints

Functional Requirements

  • Classify every inbound email as spam, ham, or uncertain before delivery
  • Support multi-class classification (spam, phishing, malware, bulk, ham)
  • Collect user feedback signals (report spam, not spam, moved from spam folder)
  • Retrain models continuously using feedback signals
  • Support A/B testing of new model versions against live traffic via shadow mode
  • Propagate new model versions to all inference nodes within 10 minutes
  • Provide per-sender reputation scoring updated in near-real-time

Non-Functional Requirements

  • Throughput: 300B emails/day = 3.5M emails/second at peak
  • Latency: p99 classification under 100ms per email
  • False positive rate: under 0.05% (legitimate email classified as spam)
  • False negative rate: under 1% (spam reaching inbox)
  • Model freshness: new model deployed within 10 minutes of approval
  • Availability: 99.99% uptime (spam detection must not block delivery)
  • Storage: 90-day retention of features and decisions for audit

Constraints

  • Classification must be stateless per email - no cross-email dependencies in the hot path
  • SMTP protocol imposes a hard latency ceiling on the classification step
  • Models must be versioned and rollback-capable in under 5 minutes
  • User-reported spam must never cause false positives for other users without model retraining

High-Level Architecture

The system divides into three planes: the hot path (per-email classification, must be fast), the warm path (feedback collection and model evaluation), and the cold path (model retraining and registry).

Gmail spam detection pipeline architecture showing ingest, feature extraction, ML inference, feedback loop, and model registry

The Email Ingest MTA receives all incoming SMTP connections and places emails onto a distributed ingest queue. Emails are deduplicated at this stage to handle retransmission storms from misconfigured senders. The Feature Extractor runs in parallel on sharded workers, pulling IP reputation data, parsing email headers (SPF, DKIM, DMARC authentication), extracting URLs for reputation lookup, and generating content n-grams - all in memory with locally cached lookup tables.

The Naive Bayes Scorer handles 80% of emails in under 5ms with a clear decision. The Neural Classifier activates only for uncertain cases (score between 0.1 and 0.7), adding 40-80ms to those emails. The Decision Router applies the final threshold and routes to inbox, spam folder, or block. Downstream, the Feedback Collection Pipeline captures user actions asynchronously and feeds the Model Trainer, which runs incremental updates and pushes new model versions through the Model Registry to all inference nodes.

Key Insight

The ensemble of a fast Naive Bayes plus an on-demand neural classifier is the load-bearing architectural decision - 80% of emails get classified in 5ms, and the neural model’s expensive compute is reserved only for the 15% of cases where a cheap model is genuinely uncertain.

Feature Extraction Service

The feature extractor’s job is to transform a raw email into a vector of ~500 numerical signals within 10ms.

Think of it like a TSA security checkpoint with multiple parallel scanners running simultaneously. Each scanner checks a different property - one checks the ID (IP reputation), one runs the bag through X-ray (content analysis), one checks the boarding pass against the list (domain authentication). All run in parallel, and the combined result takes no longer than the slowest scanner.

Single email classification data flow showing timing through each stage

The extractor divides signals into four groups:

Network signals come from the sending mail server’s IP address. A reputation database (updated daily from aggregated blocklists and Google’s own spam reports) is cached locally on every feature extraction worker as a compact bloom filter plus hash map. An IP lookup is a constant-time in-memory operation, not a network call. This eliminates a potential 20-50ms round trip that would kill the latency budget.

Authentication signals verify domain ownership. SPF, DKIM, and DMARC checks run in the MTA and are passed as header fields - the feature extractor just reads them. An email that fails DKIM from a gmail.com domain gets an immediate high-weight spam signal.

Content signals use n-gram extraction and URL analysis. The text body is tokenized, and frequency counts of known spam phrases are computed against an in-memory vocabulary model. URLs are hash-checked against a safe-browsing-style blocklist, again cached locally.

Behavioral signals capture sender history: how many emails this sender sent in the last hour, the ratio of reported spam in those emails, the age of the sending domain. These are maintained in a distributed counter store (Redis cluster) with per-sender keys and sliding TTL windows.

# Feature extraction for a single email - runs in parallel threads
import hashlib
from dataclasses import dataclass

@dataclass
class EmailFeatures:
    ip_reputation_score: float      # 0.0 (clean) to 1.0 (known spammer)
    spf_pass: bool
    dkim_pass: bool
    dmarc_pass: bool
    domain_age_days: int
    sender_spam_ratio_1h: float     # fraction of this sender's emails reported spam
    sender_volume_1h: int           # emails sent by this sender in last hour
    body_spam_ngram_score: float    # Naive Bayes prior on content
    url_reputation_score: float     # max score across all URLs in email
    html_to_text_ratio: float       # high ratio = likely bulk HTML email
    subject_caps_ratio: float       # ALL CAPS subjects = spam signal
    attachment_hash_malware: bool   # hash of attachments vs malware DB

def extract_features(email: dict, ip_cache: dict, ngram_model, url_cache: set) -> EmailFeatures:
    sender_ip = email["received_from_ip"]
    # O(1) local cache lookup - no network call
    ip_score = ip_cache.get(sender_ip, 0.5)

    body_text = strip_html(email["body"])
    ngram_score = ngram_model.score(tokenize(body_text))

    urls = extract_urls(email["body"])
    url_score = max((1.0 if url_hash(u) in url_cache else 0.0 for u in urls), default=0.0)

    return EmailFeatures(
        ip_reputation_score=ip_score,
        spf_pass=email["headers"].get("spf") == "pass",
        dkim_pass=email["headers"].get("dkim") == "pass",
        dmarc_pass=email["headers"].get("dmarc") == "pass",
        domain_age_days=email.get("domain_age_days", 0),
        sender_spam_ratio_1h=get_sender_ratio(email["from"], redis_client),
        sender_volume_1h=get_sender_volume(email["from"], redis_client),
        body_spam_ngram_score=ngram_score,
        url_reputation_score=url_score,
        html_to_text_ratio=compute_html_ratio(email["body"]),
        subject_caps_ratio=caps_ratio(email["subject"]),
        attachment_hash_malware=check_attachment_hashes(email.get("attachments", []))
    )
Watch Out

Sending IP reputation lookups over the network during classification is the most common latency mistake. At 3.5M emails/second, even a 1ms network call per email requires 3,500 concurrent I/O connections per second. Cache all reputation data locally and refresh it asynchronously every few minutes.

Naive Bayes Baseline Classifier

The Naive Bayes classifier is the fast lane of the pipeline - and the reason 80% of emails never need GPU inference.

A Naive Bayes classifier for text is like a library card catalog where each word’s card records how often it appeared in spam books versus non-spam books. To classify a new email, you look up every word’s card and multiply the probabilities. Because multiplication of many small floats underflows, the practical implementation uses log-probability sums.

The key property making Naive Bayes suitable here: it is a linear model. Classification is a dot product between the feature vector and a weight vector - a single BLAS matrix operation that runs in microseconds on CPU. The entire trained model for a vocabulary of 100,000 words fits in about 800KB of RAM. Every inference node loads it on startup and keeps it in L3 cache.

import numpy as np
from scipy.special import logsumexp

class NaiveBayesSpamClassifier:
    def __init__(self, log_class_priors, log_feature_probs_spam, log_feature_probs_ham):
        # log_class_priors: [log P(spam), log P(ham)]
        # log_feature_probs: shape (vocab_size,) - log P(word | class)
        self.log_priors = log_class_priors
        self.log_probs = np.stack([log_feature_probs_spam, log_feature_probs_ham])

    def score(self, word_counts: np.ndarray) -> float:
        # word_counts: sparse vector of shape (vocab_size,)
        # Returns P(spam | email) as float in [0, 1]
        log_likelihoods = self.log_probs @ word_counts  # shape (2,)
        log_posteriors = log_likelihoods + self.log_priors
        # Numerically stable softmax
        log_p_spam = log_posteriors[0] - logsumexp(log_posteriors)
        return float(np.exp(log_p_spam))

    def batch_score(self, word_count_matrix: np.ndarray) -> np.ndarray:
        # word_count_matrix: shape (batch_size, vocab_size)
        log_likelihoods = self.log_probs @ word_count_matrix.T  # (2, batch_size)
        log_posteriors = log_likelihoods + self.log_priors[:, None]
        log_p_spam = log_posteriors[0] - logsumexp(log_posteriors, axis=0)
        return np.exp(log_p_spam)

The model is retrained nightly on the previous week’s labeled data using Laplace smoothing to handle unseen vocabulary. The model file is written atomically to a shared object store, and each inference node polls for updates every 5 minutes and hot-swaps the model in memory without restarting.

Real World

SpamAssassin, the open-source spam filter that powers millions of mail servers, uses a similar Bayesian scoring approach combined with rule-based signals. Gmail’s system extends this with a neural re-scoring layer for uncertain cases - the same architectural pattern: cheap model first, expensive model only when needed.

Neural Classifier

The neural classifier covers the adversarial middle ground - emails that look mostly legitimate but have subtle patterns that Naive Bayes misses.

Think of Naive Bayes as checking individual ingredients in a recipe, while the neural classifier reads the whole dish. A sophisticated spam campaign might use perfectly clean vocabulary but arrange it in contextual patterns that signal intent: urgency language, impersonation of known brands, subtle Unicode lookalike characters. These are relationships between tokens, not just token frequencies.

The neural architecture is a BERT-style transformer encoder, fine-tuned on email classification. It processes the subject and body as a token sequence, and the [CLS] token’s representation is fed into a two-class classification head.

import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer

class EmailSpamClassifier(nn.Module):
    def __init__(self, pretrained_model_name: str, num_classes: int = 2):
        super().__init__()
        self.encoder = BertModel.from_pretrained(pretrained_model_name)
        hidden_size = self.encoder.config.hidden_size
        self.classifier = nn.Sequential(
            nn.Dropout(0.1),
            nn.Linear(hidden_size, 256),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(256, num_classes)
        )

    def forward(self, input_ids, attention_mask, token_type_ids=None):
        outputs = self.encoder(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids
        )
        cls_representation = outputs.last_hidden_state[:, 0, :]
        logits = self.classifier(cls_representation)
        return logits

# Serving: batch uncertain emails together for efficient GPU utilization
class NeuralInferenceServer:
    def __init__(self, model_path: str, batch_size: int = 64, max_wait_ms: int = 20):
        self.model = torch.load(model_path).cuda().eval()
        self.tokenizer = BertTokenizer.from_pretrained(model_path)
        self.batch_size = batch_size
        self.max_wait_ms = max_wait_ms
        self.pending_queue = []

    @torch.inference_mode()
    def score_batch(self, emails: list[str]) -> list[float]:
        encoded = self.tokenizer(
            emails,
            max_length=512,
            truncation=True,
            padding=True,
            return_tensors="pt"
        ).to("cuda")
        logits = self.model(**encoded)
        probs = torch.softmax(logits, dim=-1)
        return probs[:, 1].cpu().tolist()  # P(spam)

The neural nodes run on GPU instances. Critically, they accept batches of uncertain emails - the inference server waits up to 20ms to accumulate a batch of 64 emails before running a GPU forward pass. This is the standard GPU batching trick: a single forward pass on 64 sequences takes only 2x longer than a single sequence, because the memory bandwidth is already saturated at batch size 1.

Key Insight

Batching uncertain emails with a 20ms wait window means you’re trading a small fixed latency penalty for a ~64x reduction in GPU cost. For the 15% of emails that need neural scoring, the total latency becomes 10ms (feature extraction) + 20ms (batch wait) + 40ms (GPU inference) = 70ms - still under budget.

Feedback Loop and Model Versioning

The feedback loop is the system’s immune response. Without it, the classifier would degrade as spam patterns evolve.

ML model internals showing training path versus serving path and model registry

User actions generate feedback signals: reporting an email as spam, moving a spam folder email to inbox, or interacting with an email that the classifier scored highly. These events are written to a Kafka topic with the email’s feature vector and the classifier’s original decision.

-- Schema for feedback events
CREATE TABLE feedback_events (
    event_id        UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    email_hash      BYTEA NOT NULL,          -- SHA-256 of email content
    user_id         BIGINT NOT NULL,
    event_type      TEXT NOT NULL CHECK (event_type IN ('report_spam', 'not_spam', 'opened', 'replied')),
    classifier_score FLOAT NOT NULL,          -- score at time of delivery
    model_version   TEXT NOT NULL,
    feature_vector  JSONB NOT NULL,           -- stored for retraining
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_feedback_events_model_version ON feedback_events(model_version, created_at DESC);
CREATE INDEX idx_feedback_events_type ON feedback_events(event_type, created_at DESC);

-- Aggregated signals per sender for fast feature lookup
CREATE TABLE sender_reputation (
    sender_domain   TEXT NOT NULL,
    sender_email    TEXT,
    window_start    TIMESTAMPTZ NOT NULL,
    emails_sent     INT NOT NULL DEFAULT 0,
    spam_reports    INT NOT NULL DEFAULT 0,
    not_spam_marks  INT NOT NULL DEFAULT 0,
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    PRIMARY KEY (sender_domain, window_start)
);

CREATE INDEX idx_sender_rep_domain ON sender_reputation(sender_domain, window_start DESC);

Feedback signal collection is the process of turning raw user actions into training labels. Not every action is equally reliable. A single user reporting spam might be a false positive (the user just didn’t want the email). A coordinated pattern of 100 users all marking the same sender as spam is a high-confidence label. The aggregator applies minimum-count thresholds and inter-annotator agreement logic before adding to the training corpus.

Model versioning follows semantic versioning with a registry that records model hash, training dataset fingerprint, offline evaluation metrics, and shadow-mode test results. A new model version must pass three gates before deployment: offline precision/recall above baseline, shadow-mode false-positive rate below 0.05%, and a canary deployment (1% of traffic for 30 minutes) without regression.

# Model version metadata in the registry
model_id: "spam-nb-v2.14.1"
type: "naive_bayes"
trained_at: "2026-06-04T03:00:00Z"
training_set:
  start: "2026-05-28"
  end: "2026-06-03"
  size: 48200000
  spam_fraction: 0.52
metrics:
  precision: 0.9982
  recall: 0.9891
  auc_roc: 0.9997
  false_positive_rate: 0.0018
shadow_test:
  duration_minutes: 60
  emails_evaluated: 12400000
  divergence_vs_production: 0.003
status: "approved"
deployment:
  rollout_percent: 100
  deployed_at: "2026-06-04T05:45:00Z"
Watch Out

Feedback loops have a dangerous failure mode called feedback poisoning: if spammers discover that their emails are training data, they can inject carefully crafted emails that slowly shift the classifier’s decision boundary. Production spam detection systems add noise to feedback sampling rates and limit how much a single user’s feedback can influence the training corpus.

Shadow Mode Testing

Shadow mode is how you validate a new model against production traffic without risking user experience.

Think of it like a flight simulator - the new model sees the same real inputs as the production model and produces outputs, but those outputs are discarded. You measure the shadow model’s decisions against the ground truth (what the production model decided, corroborated by later user actions) and compute divergence metrics.

# Shadow mode evaluation pipeline
class ShadowModeEvaluator:
    def __init__(self, production_model, shadow_model, sample_rate: float = 0.05):
        self.prod = production_model
        self.shadow = shadow_model
        self.sample_rate = sample_rate
        self.divergences = []

    def evaluate(self, email_features: dict) -> dict:
        import random
        prod_score = self.prod.score(email_features)
        result = {"score": prod_score, "model": "production"}

        # Shadow evaluation: fire-and-forget on sampled emails
        if random.random() < self.sample_rate:
            shadow_score = self.shadow.score(email_features)
            divergence = abs(prod_score - shadow_score)

            # Log divergence for offline analysis - do not affect delivery
            self.log_divergence({
                "email_id": email_features["id"],
                "prod_score": prod_score,
                "shadow_score": shadow_score,
                "divergence": divergence,
                "would_flip": (prod_score > 0.5) != (shadow_score > 0.5)
            })

        return result

    def compute_shadow_metrics(self, window_hours: int = 1) -> dict:
        recent = [d for d in self.divergences if d["age_hours"] < window_hours]
        flip_rate = sum(1 for d in recent if d["would_flip"]) / len(recent)
        return {
            "mean_divergence": sum(d["divergence"] for d in recent) / len(recent),
            "flip_rate": flip_rate,
            "safe_to_deploy": flip_rate < 0.001  # less than 0.1% of emails change decision
        }

Shadow mode runs for a minimum of 60 minutes before any promotion to production. The safety gate checks that the new model’s flip_rate (fraction of emails that would change classification) is below 0.1%, and that the direction of flips does not increase the false positive rate.

Real World

Google’s TFX (TensorFlow Extended) platform formalizes shadow mode as a first-class pipeline step called “model evaluation.” It automates the divergence computation and integrates with Vertex AI to compare model versions against a “blessed” baseline before any promotion. Every production ML system at Google uses this pattern.

Data Model

-- Core email classification record
CREATE TABLE email_classifications (
    email_hash       BYTEA PRIMARY KEY,      -- SHA-256 of raw email
    received_at      TIMESTAMPTZ NOT NULL,
    sender_ip        INET NOT NULL,
    sender_domain    TEXT NOT NULL,
    recipient_count  INT NOT NULL,
    nb_score         FLOAT NOT NULL,         -- Naive Bayes output
    neural_score     FLOAT,                  -- NULL if not invoked
    final_score      FLOAT NOT NULL,
    decision         TEXT NOT NULL CHECK (decision IN ('inbox', 'spam', 'blocked', 'quarantine')),
    model_version_nb TEXT NOT NULL,
    model_version_nn TEXT,
    feature_fingerprint TEXT NOT NULL,       -- hash of feature vector for reproducibility
    latency_ms       INT NOT NULL
);

CREATE INDEX idx_classifications_sender ON email_classifications(sender_domain, received_at DESC);
CREATE INDEX idx_classifications_decision ON email_classifications(decision, received_at DESC);
CREATE INDEX idx_classifications_score ON email_classifications(final_score) WHERE final_score > 0.3;

-- IP reputation cache (refreshed hourly from global aggregation)
CREATE TABLE ip_reputation (
    ip_address       INET PRIMARY KEY,
    reputation_score FLOAT NOT NULL,         -- 0.0 (clean) to 1.0 (known spam)
    blocklist_hits   INT NOT NULL DEFAULT 0,
    spam_volume_24h  BIGINT NOT NULL DEFAULT 0,
    last_seen_spam   TIMESTAMPTZ,
    updated_at       TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Domain reputation
CREATE TABLE domain_reputation (
    domain           TEXT PRIMARY KEY,
    age_days         INT,
    is_blocklisted   BOOLEAN NOT NULL DEFAULT FALSE,
    spam_complaint_rate FLOAT NOT NULL DEFAULT 0.0,
    dmarc_policy     TEXT,
    updated_at       TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

The email_classifications table is the audit log. It is append-only and partitioned by received_at with monthly partitions. Partitions older than 90 days are compressed and moved to cold storage (BigQuery or GCS-backed). The ip_reputation and domain_reputation tables are hot lookup tables - small enough to cache entirely in RAM on each feature extraction node.

Key Algorithms and Protocols

Multinomial Naive Bayes with Laplace Smoothing

Naive Bayes models email spam detection as: given the bag-of-words in this email, what is P(spam | words)? By Bayes’ theorem and the naive independence assumption:

P(spam | w1..wn) ∝ P(spam) * ∏ P(wi | spam)

In log space (to avoid floating point underflow):

log P(spam | w1..wn) ∝ log P(spam) + Σ count(wi) * log P(wi | spam)

Laplace smoothing handles unseen words by adding a pseudocount of 1 to every word count before computing probabilities:

def train_naive_bayes(spam_emails: list, ham_emails: list, vocab_size: int, alpha: float = 1.0):
    # alpha: Laplace smoothing parameter
    spam_word_counts = np.zeros(vocab_size)
    ham_word_counts = np.zeros(vocab_size)

    for email in spam_emails:
        for word_id, count in email.word_counts.items():
            spam_word_counts[word_id] += count

    for email in ham_emails:
        for word_id, count in email.word_counts.items():
            ham_word_counts[word_id] += count

    # Laplace smoothing: add alpha to every count
    spam_word_counts += alpha
    ham_word_counts += alpha

    # Compute log probabilities
    log_prob_spam = np.log(spam_word_counts / spam_word_counts.sum())
    log_prob_ham = np.log(ham_word_counts / ham_word_counts.sum())

    n_spam = len(spam_emails)
    n_ham = len(ham_emails)
    log_prior_spam = np.log(n_spam / (n_spam + n_ham))
    log_prior_ham = np.log(n_ham / (n_spam + n_ham))

    return log_prior_spam, log_prior_ham, log_prob_spam, log_prob_ham

Time complexity: O(V) per classification where V is vocabulary size. At V=100,000 this is ~800KB of sequential memory reads - fits entirely in CPU L3 cache. Throughput exceeds 500,000 classifications/second per CPU core.

Key Insight

The independence assumption in Naive Bayes is provably wrong (words in spam emails are not independent), yet the classifier works extremely well in practice. This is because the ranking of P(spam) vs P(ham) is often correct even when the absolute probabilities are miscalibrated. What makes it work at scale is that the vocabulary is stable - spam phrases change slowly enough for daily retraining to keep up.

Feedback Signal Aggregation

Raw user feedback is noisy. The aggregator applies a minimum confidence threshold before labeling:

from collections import defaultdict
from datetime import datetime, timedelta

class FeedbackSignalAggregator:
    MIN_REPORTS_FOR_LABEL = 5          # need at least 5 reports
    MIN_REPORT_RATIO = 0.7             # at least 70% of recipients must report
    RECENCY_WINDOW_HOURS = 24

    def aggregate(self, feedback_events: list) -> list:
        # Group by email_hash
        by_email = defaultdict(list)
        for event in feedback_events:
            by_email[event.email_hash].append(event)

        labeled = []
        for email_hash, events in by_email.items():
            spam_reports = sum(1 for e in events if e.event_type == "report_spam")
            not_spam = sum(1 for e in events if e.event_type == "not_spam")
            total = spam_reports + not_spam

            if total < self.MIN_REPORTS_FOR_LABEL:
                continue  # insufficient signal

            spam_ratio = spam_reports / total
            if spam_ratio >= self.MIN_REPORT_RATIO:
                labeled.append({"email_hash": email_hash, "label": "spam", "confidence": spam_ratio})
            elif spam_ratio <= (1 - self.MIN_REPORT_RATIO):
                labeled.append({"email_hash": email_hash, "label": "ham", "confidence": 1 - spam_ratio})
            # else: conflicting signals, do not label

        return labeled

Scaling and Performance

Gmail spam detection scaling diagram showing sharded feature extraction and replicated ML inference
Capacity Estimation:
  - Inbound email rate: 300B/day = 3.47M/second average, ~5M/second peak
  - Feature extraction: stateless, sharded by sender IP hash
    - Each worker handles ~50K emails/second
    - Required workers: 5M / 50K = 100 feature workers
  - Naive Bayes: 500K classifications/second per CPU core
    - Required cores: 5M / 500K = 10 cores for NB alone
    - With headroom: 30 NB nodes (CPU-optimized instances)
  - Neural classifier: needed for ~15% of emails = 750K/second
    - GPU batch inference: 64 emails per 40ms = 1,600 emails/second per GPU
    - Required GPUs: 750K / 1600 = 469 GPUs
    - With batching efficiency at scale: ~300 GPU nodes
  - Feature vector storage: 500 floats * 4 bytes * 300B emails/day = 600TB/day
    - Only features for uncertain+classified-spam emails retained: ~5% = 30TB/day
    - 90-day retention: 2.7PB

The dominant bottleneck is GPU capacity for the neural classifier. The solution is a combination of: aggressively routing confident cases (score < 0.1 or > 0.9) to NB-only path, batching uncertain emails for GPU efficiency, and running NB model updates nightly to continuously shrink the uncertain zone.

Caching strategy: IP reputation and domain reputation data are pre-loaded into each feature extraction node’s memory at startup and refreshed asynchronously every 10 minutes. URL blocklist data uses a probabilistic Bloom filter (50MB, 1% false positive rate) for first-pass check, followed by hash lookup only for Bloom-positive results.

Real World

Gmail processes approximately 1.5 billion active users’ email. The system uses a tiered classification approach similar to what we described - TensorFlow Serving for the neural models with GPU batching, and a custom high-throughput Naive Bayes implementation running on CPU. The model updates are managed through TFX with automated evaluation gates.

Failure Modes and Recovery

FailureDetectionImpactRecovery
Feature extraction worker crashHealth check miss within 5sEmails stuck in ingest queue, latency spikeKubernetes restarts worker; queue drains automatically
IP reputation cache stale (>30 min)Metrics alert on cache ageLower precision on IP-based signalsFallback to neutral score (0.5) for IP signals; async refresh
Neural classifier GPU OOMPod eviction eventUncertain emails pile up; p99 latency exceeds budgetNB-only fallback mode activates; uncertain emails delivered conservatively
Bad model deploymentFalse positive rate spike within 10 minUsers complain legitimate email going to spamAutomated rollback to previous model version within 5 min
Kafka feedback topic lagConsumer group lag metricFeedback signals delayed, model training delayedAdd consumers; model stays on current version, no service impact
Redis sender-reputation cluster failureConnection timeout on writeSender volume signals missing from featuresFeatures default to neutral; classification continues without behavioral signals
Watch Out

The most dangerous operational mistake is using classification errors as the primary alert signal. By the time false positive rate appears in user-facing metrics, millions of emails have already been misclassified. Shadow mode divergence metrics and per-model-version precision/recall computed every 5 minutes should trigger alerts long before users notice a problem.

Comparison of Approaches

ApproachLatencyRecallAdaptabilityComplexityBest For
Rule-based only1msLow - spammers adaptNone - staticLowSimple blocklist enforcement
Naive Bayes only5msMedium (92-95%)Daily retrainingLowHigh-throughput, cost-constrained
Neural only80-150msHigh (99%+)Weekly retrainingHighLow-volume, accuracy-critical
NB + Neural ensemble5-100msVery high (99.5%+)Daily NB, weekly NNMediumProduction email at scale
LLM-based (GPT-4 class)500ms+Highest potentialPrompt tuningVery highResearch / edge case analysis

The NB plus neural ensemble is the production choice because it achieves near-neural accuracy while keeping median latency at 5ms. The cost of running neural inference only on 15% of emails is 85% lower than running it on all emails.

Key Takeaways

  • Two-stage classification - routing easy cases to a fast model and uncertain cases to a slow model is the single most important design decision; it makes the latency SLA achievable without sacrificing accuracy.
  • Feature extraction must be entirely in-memory with locally cached lookup tables - any network call in the classification hot path will blow the latency budget at scale.
  • Naive Bayes baseline achieves ~95% accuracy at near-zero compute cost; its role is to shrink the uncertain zone so the expensive neural model sees as few emails as possible.
  • Neural classifier uses GPU batching with a short wait window to amortize inference cost; batch size 64 on a single GPU achieves ~1,600 emails/second at 40ms latency.
  • Shadow mode testing validates every new model against live traffic before deployment; the flip rate metric detects behavioral changes before users experience them.
  • Feedback signal aggregation requires minimum-confidence thresholds - raw user clicks are too noisy to use directly as training labels.
  • Model versioning enables instant rollback; a bad deployment can be reversed in under 5 minutes by pushing the previous model version through the registry.
  • Feedback poisoning is a real threat - spammers can attempt to manipulate training data, so feedback from any single source should be rate-limited in its influence on the model.

The counter-intuitive lesson is that the mathematically inferior model (Naive Bayes with its independence assumption) does most of the work in a production spam detection system, while the mathematically superior neural model is a specialized resource reserved for the hardest cases. Scale forces you to optimize for the median case, not the edge case.

Frequently Asked Questions

Q: Why not just use a single neural model for everything? A: At 3.5M emails/second, a neural model requiring 40ms per email would need 140,000 GPU-seconds per second of throughput - roughly 4,000 A100 GPUs running continuously. The NB-first approach reduces the neural model’s workload to 15% of emails, cutting GPU requirements by ~85%. Cost and latency both dictate the tiered approach.

Q: How do you prevent the feedback loop from being gamed? A: Several layers: minimum report count before labeling (5+ users), inter-annotator agreement threshold (70%+), rate limiting on how much any single user’s feedback moves the training distribution, and anomaly detection on feedback patterns that look coordinated. New feedback is also batched into offline training rather than real-time online updates, adding a natural delay that limits real-time gaming.

Q: Why not retrain the neural model in real-time? A: Neural model retraining requires hours of compute even with incremental techniques. More importantly, the model needs proper holdout evaluation before deployment to detect regressions - you cannot safely deploy a model that was trained 5 minutes ago on potentially poisoned data. The NB model can be retrained nightly because its evaluation is fast (minutes) and its surface area for poisoning is smaller.

Q: What happens during the shadow mode period if the new model is worse? A: The shadow mode evaluator checks the flip rate and false positive direction continuously. If the new model would flip more legitimate emails to spam than the production model, the shadow run is terminated and the model is rejected. The model registry records the shadow test results permanently so teams can audit why a version was not promoted.

Q: How do you handle bulk marketing emails that users sometimes want? A: Marketing email classification uses a separate “promotional” category distinct from spam. The classification threshold for promotional is much more permissive - it is filtered to the Promotions tab rather than spam folder. User preference signals (did the user open this sender’s previous promotional email?) are strong override signals that prevent promotional classification for desired newsletters.

Q: What is the accuracy impact of the NB-only path for the 80% of confident emails? A: For emails where NB scores below 0.1 or above 0.9, the classifier has very high confidence and the neural model would agree ~99.5% of the time. The 0.5% disagreement cases are tracked via shadow mode, and if the neural model’s overrides correlate with user feedback, that is a signal to tighten the NB confidence thresholds.

Interview Questions

Q: Design a system to detect spam emails at Gmail’s scale. Expected depth: Cover the tiered classification approach (fast Naive Bayes + on-demand neural), feature extraction with locally cached reputation data, feedback loop architecture, model versioning with shadow mode testing. Discuss the GPU cost problem and how batching solves it. Name the false positive vs false negative tradeoff and why false positives are treated more seriously.

Q: How would you update the spam classifier model without restarting the classification service? Expected depth: Describe hot model swapping - the inference server holds a reference to the current model object, a background thread polls the model registry, and an atomic swap replaces the reference when a new version is available. Discuss the deployment process: shadow mode first, then percentage rollout (1%, 5%, 100%). Explain how rollback works - just push the previous version’s checksum to the registry.

Q: The false positive rate suddenly spikes to 2% after a model update. Walk me through your diagnosis. Expected depth: Check which model version is deployed and when it was promoted. Pull shadow mode evaluation results from before promotion - was the flip rate within bounds? Compare the training dataset for this version vs previous - was there label noise or a distribution shift? Check if there is a particular sender domain or email type dominating the false positives. Rollback immediately while diagnosing - do not wait for root cause before restoring service.

Q: How would you design the sender reputation system to handle sudden volume spikes? Expected depth: Sender reputation uses sliding window counters in Redis with per-sender keys. The counter key expires after the window. During a spam campaign, a sender’s spam report ratio rises quickly. Discuss rate limiting on reputation score updates to prevent gaming, and the tradeoff between window size (small window reacts faster, large window is more statistically stable). Name the “new sender” cold-start problem - a brand-new domain has no history.

Q: How do you measure classifier performance in production without ground truth labels? Expected depth: Ground truth comes from user feedback with the confidence thresholds discussed. For emails with no feedback, you use proxy signals: open rate (opened emails are probably not spam), reply rate, move-from-spam-to-inbox action. Discuss the delayed ground truth problem - feedback for an email may not arrive for hours or days. Offline evaluation on a labeled holdout set gives a leading indicator; user-reported metrics give a lagging but definitive signal.

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access
Unlock Full Article