Build a Real-Time Fraud Detection Engine


performance scalability distributed-systems

System Design Deep Dive

Real-Time Fraud Detection Engine

Evaluating every payment transaction in under 100ms using behavioral, velocity, and network signals with continuously updated risk models

14 min readAdvancedBlueprint

A user taps “Pay” at a checkout terminal in Mumbai. Within the next 80 milliseconds, your system must decide whether this transaction is genuine or fraudulent - before the payment processor times out and before the customer gets annoyed waiting for the spinner. The decision draws on hundreds of signals: is this the usual device the user shops from? Have they made 12 transactions in the last 5 minutes from three different countries? Does the merchant category match previous spending patterns? Has this card number appeared on a known-compromised list in the last hour?

Fraud is an adversarial cat-and-mouse problem. Fraudsters actively probe your defenses, testing small transactions before large ones, cycling through stolen cards, and mimicking legitimate behavior patterns to avoid velocity triggers. Your detection system must evolve continuously - a static rule set becomes obsolete within days as attackers learn its boundaries. At the same time, every false positive blocks a legitimate customer, damages trust, and costs revenue. A 0.5% false positive rate on a system processing 10,000 transactions per second blocks 50 legitimate payments every second.

The core engineering tension is latency versus accuracy. More sophisticated models - deep neural nets, graph-based fraud network detection, complex ensemble methods - take longer to run. Simple threshold rules are fast but blind to subtle pattern combinations. Waiting for more signals improves accuracy but increases latency. Running every model in parallel wastes compute on transactions that a fast rule would have blocked immediately. The system must be a pipeline that can exit early on high-confidence decisions while escalating uncertain cases through progressively more expensive analysis.

We need to solve for three things simultaneously: (1) sub-100ms scoring that consumes pre-computed features and makes a decision before the payment processor times out, (2) continuous model evolution that updates risk models with new fraud patterns without requiring a full redeployment, and (3) false positive minimization that avoids blocking legitimate users even as fraud patterns shift.

Requirements and Constraints

Functional Requirements

  • Score every incoming payment transaction synchronously before approval, returning APPROVE, DECLINE, or REVIEW within 100ms
  • Compute velocity features: transaction count and volume per user in the last 1 minute, 5 minutes, 1 hour, and 24 hours
  • Evaluate rule-based checks: velocity thresholds, device fingerprint consistency, IP geolocation anomalies, card BIN risk scoring
  • Run ML ensemble scoring combining gradient boosting and a neural network for behavioral pattern detection
  • Support a shadow mode where new models run alongside production without affecting decisions, for validation before promotion
  • Feed approved and declined transaction outcomes back into the feature store for continuous model updates
  • Allow manual case review workflow for REVIEW decisions, with analyst feedback flowing back into training data

Non-Functional Requirements

  • Latency: P99 scoring decision under 100ms end-to-end, P50 under 30ms
  • Throughput: 10,000 transactions per second sustained, 50,000 TPS burst (flash sale periods)
  • Availability: 99.99% uptime for the scoring path (failsafe to APPROVE on system failure to avoid blocking payments)
  • False positive rate: under 0.3% of legitimate transactions incorrectly declined
  • False negative rate: under 0.8% of fraudulent transactions incorrectly approved (tunable via threshold)
  • Feature freshness: velocity counters updated within 500ms of transaction completion
  • Model update latency: new model versions promoted to production within 15 minutes of approval
  • Audit log retention: all scoring decisions with full feature vectors stored for 7 years (regulatory requirement)

Constraints and Assumptions

  • Payment processor has a 120ms timeout - we must respond in under 100ms to leave 20ms buffer
  • We do not store raw card numbers - only tokenized card IDs and card BIN (first 6 digits)
  • PCI-DSS compliance requires all cardholder data to flow through isolated network segments
  • We assume a user identity graph is available as a pre-built service (not built here)
  • Geographic distribution: scoring nodes deployed in 4 regions; features replicated with eventual consistency

High-Level Architecture

The fraud engine has six major components working together to deliver a sub-100ms scoring decision for every payment.

Real-time fraud detection engine architecture showing transaction flow from API gateway through feature extraction, rule engine, ML ensemble, and decision engine

The API Gateway receives the raw transaction payload from the payment processor and immediately strips sensitive fields before forwarding downstream. It enforces the 100ms SLA by setting a deadline on the entire scoring pipeline context.

The Feature Extractor is the synchronous hot path. It executes feature lookups in parallel: querying the online feature store for pre-computed behavioral features, hitting Redis for velocity counters, and calling the device fingerprint service. All lookups run concurrently with a combined timeout of 20ms. Missing features default to safe baseline values rather than failing the transaction.

The Feature Store has two layers: an offline store (columnar database updated by batch jobs) and an online store (Redis cluster) that serves low-latency feature lookups. The online store holds pre-computed user risk scores, historical velocity statistics, and device trust signals updated within seconds of new transactions.

The Rule Engine applies deterministic, interpretable checks that can exit early. Rules catch obvious fraud patterns - impossible travel, velocity spikes, known-bad indicators - in under 5ms. A rule firing with sufficient confidence returns an immediate DECLINE without invoking the ML models.

The ML Ensemble runs gradient boosting and a neural network in parallel, combining their scores with learned weights. This layer handles subtle pattern combinations that rules cannot express. It accounts for 60-80ms of the total budget when invoked, which is why the Rule Engine’s early exit path is critical.

The Decision Engine receives the rule verdict and ML scores, applies the final threshold logic, writes the decision to the audit log, and returns the response to the payment processor. It also enqueues the transaction and decision to the feedback loop pipeline.

Key Insight

The architecture separates fast deterministic rules from slow probabilistic models into sequential stages with early exit. This means 40-60% of transactions never reach the ML ensemble - they are decided by rules in under 10ms. The ML compute budget is spent only on the ambiguous middle ground where rules produce no confident answer.

The Feature Store

The Feature Store’s job is to serve pre-computed features to the scoring pipeline in under 15ms, handling 10,000 reads per second per transaction while staying fresh enough that a fraud pattern detected seconds ago already affects scoring.

The feature store has two distinct halves with very different access patterns. The offline store is a columnar database (Apache Parquet on S3 backed by Hive/Spark) that holds historical aggregations: 30-day spending patterns, merchant category distributions, device usage history. Batch jobs run every 15 minutes to recompute these features and push updated values to the online store. The offline store is queried by training pipelines, not by the scoring path.

The online store is a Redis cluster holding the subset of features that change frequently and need sub-millisecond reads. Each user has a hash key containing their pre-computed feature vector: 30-day average transaction amount, preferred merchant categories, usual transaction hours, device trust score, and current velocity counters. Velocity counters are stored separately as sorted sets for sliding window computation.

import redis
import json
import time
from typing import Optional

# Feature store client for the scoring hot path
class OnlineFeatureStore:
    def __init__(self, redis_cluster):
        self.redis = redis_cluster
        self.default_features = {
            "avg_txn_amount_30d": 500.0,
            "txn_count_30d": 50,
            "device_trust_score": 0.5,
            "preferred_hour_match": 0.5,
            "merchant_category_match": 0.5,
        }

    def get_user_features(self, user_id: str, timeout_ms: int = 15) -> dict:
        """Fetch pre-computed user feature vector from online store."""
        key = f"features:user:{user_id}"
        raw = self.redis.get(key)
        if raw is None:
            # Cold start: new user with no history
            return dict(self.default_features)
        return json.loads(raw)

    def get_velocity_features(self, user_id: str, card_token: str) -> dict:
        """Compute sliding window velocity counters in real time."""
        now_ms = int(time.time() * 1000)
        pipe = self.redis.pipeline(transaction=False)

        # Sorted sets keyed by user and card with score=timestamp
        user_key = f"velocity:user:{user_id}"
        card_key = f"velocity:card:{card_token}"

        # Count transactions in sliding windows (1m, 5m, 1h, 24h)
        windows = [60_000, 300_000, 3_600_000, 86_400_000]
        for window_ms in windows:
            cutoff = now_ms - window_ms
            pipe.zcount(user_key, cutoff, now_ms)
            pipe.zrangebyscore(user_key, cutoff, now_ms, withscores=False)

        results = pipe.execute()

        # Parse counts and amounts from sorted set members
        user_counts = {}
        labels = ["1m", "5m", "1h", "24h"]
        for i, label in enumerate(labels):
            count = results[i * 2]
            members = results[i * 2 + 1]
            total_amount = sum(
                float(m.split(":")[1]) for m in members if ":" in m
            )
            user_counts[f"txn_count_{label}"] = count
            user_counts[f"txn_amount_{label}"] = total_amount

        return user_counts

    def record_transaction(self, user_id: str, card_token: str,
                           amount: float, txn_id: str):
        """Write velocity counter entry after transaction completes."""
        now_ms = int(time.time() * 1000)
        pipe = self.redis.pipeline(transaction=False)

        member = f"{txn_id}:{amount:.2f}"
        user_key = f"velocity:user:{user_id}"
        card_key = f"velocity:card:{card_token}"

        pipe.zadd(user_key, {member: now_ms})
        pipe.zadd(card_key, {member: now_ms})

        # Prune entries older than 25 hours to control memory
        cutoff = now_ms - 90_000_000
        pipe.zremrangebyscore(user_key, 0, cutoff)
        pipe.zremrangebyscore(card_key, 0, cutoff)

        # TTL: expire keys after 48h of inactivity
        pipe.expire(user_key, 172800)
        pipe.expire(card_key, 172800)

        pipe.execute()
Key Insight

Velocity counters use Redis sorted sets with the Unix timestamp in milliseconds as the score and the transaction ID plus amount as the member. A single ZCOUNT call gives the transaction count in any time window without scanning all members, and ZRANGEBYSCORE gives the full list to sum amounts. This is O(log N) per query instead of O(N) for a full scan.

The Rule Engine

The Rule Engine’s job is to apply fast, deterministic checks that catch obvious fraud patterns in under 5ms, returning a confident DECLINE for the clearest cases before wasting ML compute cycles.

Rules are evaluated in priority order. Each rule produces a signal strength from 0.0 to 1.0 and an action: DECLINE (confidence above block threshold), FLAG (adds to the ML feature vector), or PASS. Rules are defined in a versioned configuration file loaded at startup and hot-reloaded without restarts.

from dataclasses import dataclass
from typing import Optional, List
import math

@dataclass
class Transaction:
    txn_id: str
    user_id: str
    card_token: str
    amount: float
    merchant_id: str
    merchant_category: str
    ip_address: str
    device_fingerprint: str
    billing_country: str
    ip_country: str
    timestamp_ms: int

@dataclass
class RuleResult:
    rule_name: str
    signal: float      # 0.0 to 1.0
    action: str        # "DECLINE", "FLAG", "PASS"
    reason: str

class RuleEngine:
    BLOCK_THRESHOLD = 0.85

    def evaluate(self, txn: Transaction, features: dict) -> List[RuleResult]:
        results = []
        results.append(self._velocity_rule(txn, features))
        results.append(self._geo_mismatch_rule(txn, features))
        results.append(self._device_consistency_rule(txn, features))
        results.append(self._amount_spike_rule(txn, features))
        results.append(self._card_bin_risk_rule(txn, features))
        results.append(self._impossible_travel_rule(txn, features))
        return results

    def _velocity_rule(self, txn: Transaction, features: dict) -> RuleResult:
        count_1m = features.get("txn_count_1m", 0)
        count_5m = features.get("txn_count_5m", 0)
        amount_1h = features.get("txn_amount_1h", 0.0)

        # Hard thresholds tuned from historical fraud data
        if count_1m >= 5:
            return RuleResult("velocity_1m", 0.95, "DECLINE",
                              f"5+ transactions in 1 minute: count={count_1m}")
        if count_5m >= 12:
            return RuleResult("velocity_5m", 0.92, "DECLINE",
                              f"12+ transactions in 5 minutes: count={count_5m}")
        if amount_1h > 50000:
            return RuleResult("velocity_amount", 0.88, "DECLINE",
                              f"Amount exceeds hourly limit: {amount_1h:.0f}")

        # Softer signal for elevated velocity
        signal = min(0.6, count_5m / 20.0)
        return RuleResult("velocity_soft", signal, "FLAG",
                          f"Elevated velocity: {count_5m} txns/5m")

    def _geo_mismatch_rule(self, txn: Transaction, features: dict) -> RuleResult:
        if txn.billing_country and txn.ip_country:
            if txn.billing_country != txn.ip_country:
                # High-risk country pair combinations have different weights
                high_risk_pairs = {("US", "NG"), ("GB", "RU"), ("AU", "CN")}
                pair = (txn.billing_country, txn.ip_country)
                if pair in high_risk_pairs:
                    return RuleResult("geo_mismatch_high_risk", 0.82, "DECLINE",
                                      f"High-risk country pair: {pair}")
                return RuleResult("geo_mismatch", 0.55, "FLAG",
                                  f"Billing/IP country mismatch: {txn.billing_country} vs {txn.ip_country}")
        return RuleResult("geo_mismatch", 0.0, "PASS", "geo consistent")

    def _device_consistency_rule(self, txn: Transaction, features: dict) -> RuleResult:
        known_device = features.get("last_known_device_fingerprint")
        device_trust = features.get("device_trust_score", 1.0)

        if known_device and txn.device_fingerprint != known_device:
            if device_trust < 0.2:
                # New device AND low trust score for account
                return RuleResult("device_mismatch_low_trust", 0.78, "FLAG",
                                  "New device on low-trust account")
        return RuleResult("device_consistent", 0.0, "PASS", "device known")

    def _amount_spike_rule(self, txn: Transaction, features: dict) -> RuleResult:
        avg_amount = features.get("avg_txn_amount_30d", 0.0)
        if avg_amount > 0 and txn.amount > avg_amount * 8:
            signal = min(0.9, 0.5 + math.log10(txn.amount / avg_amount) / 4)
            return RuleResult("amount_spike", signal, "FLAG",
                              f"Amount {txn.amount:.0f} vs avg {avg_amount:.0f}")
        return RuleResult("amount_normal", 0.0, "PASS", "amount normal")

    def _card_bin_risk_rule(self, txn: Transaction, features: dict) -> RuleResult:
        bin_risk = features.get("card_bin_risk_score", 0.0)
        if bin_risk > 0.9:
            return RuleResult("bin_high_risk", 0.88, "DECLINE",
                              f"Card BIN on high-risk list: score={bin_risk}")
        return RuleResult("bin_ok", bin_risk * 0.5, "FLAG" if bin_risk > 0.5 else "PASS",
                          f"BIN risk: {bin_risk:.2f}")

    def _impossible_travel_rule(self, txn: Transaction, features: dict) -> RuleResult:
        last_txn_country = features.get("last_txn_country")
        last_txn_ts_ms = features.get("last_txn_timestamp_ms", 0)
        if last_txn_country and last_txn_country != txn.ip_country:
            hours_elapsed = (txn.timestamp_ms - last_txn_ts_ms) / 3_600_000
            if hours_elapsed < 2.0:
                return RuleResult("impossible_travel", 0.91, "DECLINE",
                                  f"Transacted in {last_txn_country} {hours_elapsed:.1f}h ago")
        return RuleResult("travel_ok", 0.0, "PASS", "travel plausible")
Watch Out

Rule thresholds calibrated on historical data will drift as fraud patterns evolve. A velocity threshold of “5 transactions per minute” catches card testing attacks today, but fraudsters adjust to 4 transactions per minute once they discover the threshold. Build an automated threshold review pipeline that re-evaluates rule effectiveness every 30 days using the feedback loop data, and alert when a rule’s true positive rate drops below baseline.

The ML Ensemble

The ML Ensemble’s job is to evaluate the behavioral patterns that rules cannot express - subtle combinations of signals that individually look normal but together indicate fraud - returning a calibrated probability score in under 80ms.

The ensemble combines three models with learned combination weights. The first is a gradient boosting model (XGBoost or LightGBM) trained on 90 days of labeled transactions. It handles tabular features well - amounts, velocity counts, merchant categories, device age - and produces well-calibrated probabilities. Inference takes 10-15ms. The second is a neural network that processes sequential transaction history as a time-series, capturing behavioral drift that gradient boosting misses. It takes 30-50ms to run. The third is a rule signal aggregator that converts the rule engine’s FLAG outputs into a single weighted score, which serves as a regularization term for the other two models.

package ensemble

import (
    "context"
    "sync"
    "time"
)

type ScoringRequest struct {
    TxnID       string
    Features    map[string]float64
    RuleSignals []RuleSignal
}

type ModelScore struct {
    ModelName   string
    Score       float64
    LatencyMs   int64
    Err         error
}

type EnsembleResult struct {
    FinalScore       float64
    GBScore          float64
    NNScore          float64
    RuleScore        float64
    ConfidenceLevel  string  // "HIGH", "MEDIUM", "LOW"
    TotalLatencyMs   int64
}

type MLEnsemble struct {
    gbModel    GradientBoostModel
    nnModel    NeuralNetModel
    weights    EnsembleWeights
    nnTimeout  time.Duration
}

type EnsembleWeights struct {
    GBWeight   float64  // 0.45
    NNWeight   float64  // 0.40
    RuleWeight float64  // 0.15
}

func (e *MLEnsemble) Score(ctx context.Context, req ScoringRequest) EnsembleResult {
    start := time.Now()

    // Aggregate rule signals into a single score
    ruleScore := e.aggregateRuleSignals(req.RuleSignals)

    // High-confidence rule block - skip ML entirely
    if ruleScore > 0.85 {
        return EnsembleResult{
            FinalScore:      ruleScore,
            RuleScore:       ruleScore,
            ConfidenceLevel: "HIGH",
            TotalLatencyMs:  time.Since(start).Milliseconds(),
        }
    }

    // Run GB and NN in parallel with independent timeouts
    var wg sync.WaitGroup
    gbCh := make(chan ModelScore, 1)
    nnCh := make(chan ModelScore, 1)

    wg.Add(2)
    go func() {
        defer wg.Done()
        gbStart := time.Now()
        score, err := e.gbModel.Predict(ctx, req.Features)
        gbCh <- ModelScore{
            ModelName: "gradient_boost",
            Score:     score,
            LatencyMs: time.Since(gbStart).Milliseconds(),
            Err:       err,
        }
    }()

    go func() {
        defer wg.Done()
        // Neural net has its own deadline - 60ms budget
        nnCtx, cancel := context.WithTimeout(ctx, e.nnTimeout)
        defer cancel()
        nnStart := time.Now()
        score, err := e.nnModel.Predict(nnCtx, req.Features)
        nnCh <- ModelScore{
            ModelName: "neural_net",
            Score:     score,
            LatencyMs: time.Since(nnStart).Milliseconds(),
            Err:       err,
        }
    }()

    wg.Wait()
    close(gbCh)
    close(nnCh)

    gbResult := <-gbCh
    nnResult := <-nnCh

    gbScore := gbResult.Score
    nnScore := nnResult.Score

    // Fall back gracefully if NN times out
    if nnResult.Err != nil {
        // Rebalance weights when NN is unavailable
        finalScore := (gbScore * 0.75) + (ruleScore * 0.25)
        return EnsembleResult{
            FinalScore:      finalScore,
            GBScore:         gbScore,
            RuleScore:       ruleScore,
            ConfidenceLevel: "MEDIUM",
            TotalLatencyMs:  time.Since(start).Milliseconds(),
        }
    }

    finalScore := (gbScore * e.weights.GBWeight) +
        (nnScore * e.weights.NNWeight) +
        (ruleScore * e.weights.RuleWeight)

    confidence := "MEDIUM"
    if finalScore > 0.80 || finalScore < 0.15 {
        confidence = "HIGH"
    } else if finalScore > 0.55 || finalScore < 0.35 {
        confidence = "LOW"
    }

    return EnsembleResult{
        FinalScore:      finalScore,
        GBScore:         gbScore,
        NNScore:         nnScore,
        RuleScore:       ruleScore,
        ConfidenceLevel: confidence,
        TotalLatencyMs:  time.Since(start).Milliseconds(),
    }
}

func (e *MLEnsemble) aggregateRuleSignals(signals []RuleSignal) float64 {
    if len(signals) == 0 {
        return 0.0
    }
    // Weight by rule confidence, use max for DECLINE signals
    maxDeclineSignal := 0.0
    flagSum := 0.0
    flagCount := 0
    for _, sig := range signals {
        if sig.Action == "DECLINE" && sig.Signal > maxDeclineSignal {
            maxDeclineSignal = sig.Signal
        } else if sig.Action == "FLAG" {
            flagSum += sig.Signal
            flagCount++
        }
    }
    if maxDeclineSignal > 0 {
        return maxDeclineSignal
    }
    if flagCount == 0 {
        return 0.0
    }
    return (flagSum / float64(flagCount)) * 0.7
}
Three-stage scoring pipeline showing rule engine fast path, gradient boosting, and neural network with early exit on high-confidence decisions
Real World

Stripe Radar uses a similar ensemble architecture, combining thousands of hand-crafted features with gradient boosting and neural network models. Their key insight - published in engineering posts - is that the model architecture matters less than feature quality. A well-featured gradient boosting model consistently outperforms a poorly-featured neural network. Radar evaluates every transaction on Stripe’s platform and has processed over 100 billion transactions, making their fraud signal dataset one of the richest in the industry.

Model Shadow Mode

Shadow mode’s job is to validate new model versions alongside the production model using real traffic without affecting any live decisions, giving engineers statistical confidence before promotion.

When a new model version passes offline evaluation (AUC-ROC, precision-recall curves on held-out test data), it enters shadow mode. In shadow mode, the new model runs on every transaction in parallel with the production model, its score is recorded alongside the production decision, but its score does not influence the outcome. After 24-48 hours of shadow traffic, the team compares the shadow model’s decisions against actual fraud outcomes to measure whether it would have performed better or worse.

The critical design decisions for shadow mode are isolation (shadow model failures must not affect production latency), asynchronous execution (shadow scoring can happen off the critical path after the production decision is returned), and statistical validity (enough traffic volume to achieve significance before promotion).

import asyncio
import logging
from dataclasses import dataclass
from typing import Optional

logger = logging.getLogger(__name__)

@dataclass
class ShadowResult:
    model_version: str
    score: float
    latency_ms: int
    error: Optional[str]

class ShadowModeOrchestrator:
    def __init__(self, production_model, shadow_models: list,
                 shadow_store, metrics_client):
        self.production = production_model
        self.shadows = shadow_models
        self.store = shadow_store
        self.metrics = metrics_client

    async def score_with_shadows(self, txn_id: str, features: dict) -> dict:
        """
        Run production model synchronously (on critical path).
        Launch shadow models as fire-and-forget async tasks.
        """
        # Production score is blocking - on the critical path
        prod_result = self.production.score(features)

        # Shadow models run asynchronously - do not await here
        asyncio.create_task(
            self._run_shadows_async(txn_id, features, prod_result)
        )

        # Return immediately with production result only
        return prod_result

    async def _run_shadows_async(self, txn_id: str, features: dict,
                                  prod_result: dict):
        """Shadow scoring runs entirely off the critical path."""
        shadow_results = []
        for model in self.shadows:
            try:
                import time
                start = time.monotonic()
                # Each shadow has its own timeout - failure is silent
                score = await asyncio.wait_for(
                    asyncio.to_thread(model.score, features),
                    timeout=0.5  # 500ms shadow budget - much more lenient
                )
                latency_ms = int((time.monotonic() - start) * 1000)
                shadow_results.append(ShadowResult(
                    model_version=model.version,
                    score=score["final_score"],
                    latency_ms=latency_ms,
                    error=None,
                ))
                self.metrics.histogram(
                    "shadow_model_latency_ms",
                    latency_ms,
                    tags={"version": model.version}
                )
            except asyncio.TimeoutError:
                shadow_results.append(ShadowResult(
                    model_version=model.version,
                    score=-1.0,
                    latency_ms=500,
                    error="timeout",
                ))
                logger.warning("Shadow model %s timed out for txn %s",
                               model.version, txn_id)
            except Exception as e:
                shadow_results.append(ShadowResult(
                    model_version=model.version,
                    score=-1.0,
                    latency_ms=0,
                    error=str(e),
                ))

        # Persist shadow scores for later analysis
        await self.store.save_shadow_comparison(
            txn_id=txn_id,
            production_score=prod_result["final_score"],
            production_decision=prod_result["decision"],
            shadow_results=shadow_results,
        )
Key Insight

Shadow mode scoring runs after the production response is sent - not before. The production decision is returned to the payment processor immediately, then the shadow models run asynchronously. This means shadow mode adds zero latency to the critical path. The shadow results are stored and compared against actual fraud outcomes reported in the next 24-72 hours via chargeback data.

Data Model

-- Core transaction scoring record
-- Written once per transaction, never updated
CREATE TABLE fraud_scoring_decisions (
    txn_id              UUID PRIMARY KEY,
    user_id             UUID NOT NULL,
    card_token          VARCHAR(64) NOT NULL,
    merchant_id         VARCHAR(64) NOT NULL,
    amount_paise        BIGINT NOT NULL,   -- integer paise to avoid float precision
    currency            CHAR(3) NOT NULL,
    decision            VARCHAR(16) NOT NULL CHECK (decision IN ('APPROVE','DECLINE','REVIEW')),
    final_score         NUMERIC(5,4) NOT NULL,  -- 0.0000 to 1.0000
    decision_reason     VARCHAR(128),
    rule_signals        JSONB NOT NULL DEFAULT '[]',
    gb_score            NUMERIC(5,4),
    nn_score            NUMERIC(5,4),
    rule_score          NUMERIC(5,4),
    model_version       VARCHAR(32) NOT NULL,
    feature_vector      JSONB NOT NULL,    -- full features for audit + retraining
    scoring_latency_ms  INT NOT NULL,
    decided_at          TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    ip_address          INET,
    device_fingerprint  VARCHAR(128),
    ip_country          CHAR(2),
    billing_country     CHAR(2)
) PARTITION BY RANGE (decided_at);

-- Monthly partitions for efficient archival
CREATE TABLE fraud_scoring_decisions_2026_06
    PARTITION OF fraud_scoring_decisions
    FOR VALUES FROM ('2026-06-01') TO ('2026-07-01');

-- Fast lookups by user and card for velocity queries
CREATE INDEX idx_fsd_user_decided ON fraud_scoring_decisions (user_id, decided_at DESC);
CREATE INDEX idx_fsd_card_decided ON fraud_scoring_decisions (card_token, decided_at DESC);
CREATE INDEX idx_fsd_decision ON fraud_scoring_decisions (decision, decided_at DESC);

-- Feedback loop: actual fraud outcomes reported via chargeback
CREATE TABLE fraud_outcomes (
    outcome_id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    txn_id              UUID NOT NULL REFERENCES fraud_scoring_decisions(txn_id),
    outcome_type        VARCHAR(32) NOT NULL CHECK (
        outcome_type IN ('confirmed_fraud','confirmed_legitimate',
                         'chargeback','analyst_approved','analyst_declined')
    ),
    reported_by         VARCHAR(64),  -- 'chargeback_processor', 'analyst_id', 'customer'
    reported_at         TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    notes               TEXT,
    -- Derived label for retraining: 1=fraud, 0=legitimate
    training_label      SMALLINT GENERATED ALWAYS AS (
        CASE outcome_type
            WHEN 'confirmed_fraud' THEN 1
            WHEN 'chargeback' THEN 1
            WHEN 'confirmed_legitimate' THEN 0
            WHEN 'analyst_approved' THEN 0
            ELSE NULL
        END
    ) STORED
);

CREATE INDEX idx_outcomes_txn ON fraud_outcomes (txn_id);
CREATE INDEX idx_outcomes_reported ON fraud_outcomes (reported_at DESC);
CREATE INDEX idx_outcomes_label ON fraud_outcomes (training_label, reported_at DESC)
    WHERE training_label IS NOT NULL;

-- Model registry for version management and shadow mode tracking
CREATE TABLE model_versions (
    version_id          VARCHAR(32) PRIMARY KEY,   -- e.g. "gb-v4.2.1"
    model_type          VARCHAR(32) NOT NULL,       -- 'gradient_boost', 'neural_net', 'ensemble'
    status              VARCHAR(16) NOT NULL DEFAULT 'shadow'
        CHECK (status IN ('shadow','production','retired')),
    auc_roc             NUMERIC(6,5),
    false_positive_rate NUMERIC(6,5),
    false_negative_rate NUMERIC(6,5),
    shadow_started_at   TIMESTAMPTZ,
    promoted_at         TIMESTAMPTZ,
    retired_at          TIMESTAMPTZ,
    model_artifact_uri  TEXT NOT NULL,   -- S3 URI to serialized model
    created_by          VARCHAR(128),
    metadata            JSONB DEFAULT '{}'
);
Transaction data flow diagram showing feature extraction, scoring stages, decision output, and feedback loop updating the model and feature store

Key Algorithms and Protocols

Velocity Counter with Sliding Window

Sliding window velocity counters track transaction counts and amounts in moving time windows without the step-function artifacts of fixed tumbling windows. A sorted set in Redis indexed by timestamp provides exact sliding window computation in O(log N) time.

import time
import redis

class SlidingWindowVelocityCounter:
    """
    Exact sliding window counter using Redis sorted sets.
    Members are "{txn_id}:{amount}" with score = Unix timestamp in ms.
    """

    WINDOW_SIZES_MS = {
        "1m":  60_000,
        "5m":  300_000,
        "1h":  3_600_000,
        "24h": 86_400_000,
    }

    def __init__(self, redis_client: redis.Redis, key_prefix: str):
        self.r = redis_client
        self.prefix = key_prefix

    def increment(self, entity_id: str, txn_id: str, amount: float):
        """Record a new transaction event."""
        now_ms = int(time.time() * 1000)
        key = f"{self.prefix}:{entity_id}"
        member = f"{txn_id}:{amount:.4f}"

        pipe = self.r.pipeline(transaction=False)
        pipe.zadd(key, {member: now_ms})
        # Prune events older than 25 hours (slightly beyond max window)
        cutoff = now_ms - 90_000_000
        pipe.zremrangebyscore(key, "-inf", cutoff)
        pipe.expire(key, 90000)  # 25h TTL
        pipe.execute()

    def get_counts(self, entity_id: str) -> dict:
        """Return count and total amount for each window."""
        now_ms = int(time.time() * 1000)
        key = f"{self.prefix}:{entity_id}"
        pipe = self.r.pipeline(transaction=False)

        for window_ms in self.WINDOW_SIZES_MS.values():
            cutoff = now_ms - window_ms
            pipe.zrangebyscore(key, cutoff, now_ms)

        results = pipe.execute()
        output = {}
        for (label, _), members in zip(self.WINDOW_SIZES_MS.items(), results):
            count = len(members)
            total = sum(
                float(m.split(":")[1]) for m in members if ":" in m
            ) if members else 0.0
            output[f"count_{label}"] = count
            output[f"amount_{label}"] = total
        return output
Key Insight

Redis sorted sets provide exact sliding windows without approximation. The alternative - HyperLogLog or count-min sketch - uses less memory but cannot give exact counts or sum amounts, which matters for fraud decisions. At 10,000 TPS with 1 year retention per user key, the sorted set approach uses about 120 bytes per transaction entry and auto-expires entries beyond the maximum window. For a typical user with 5 transactions per day, that is under 600 bytes of Redis memory per user.

False Positive Rate Optimization

False positive optimization requires treating the score threshold as a tunable parameter calibrated against business cost, not as a fixed value. The cost of a false positive (blocking a legitimate user) is not symmetric with the cost of a false negative (approving fraud). Calibrate thresholds using cost-sensitive optimization:

import numpy as np
from sklearn.metrics import confusion_matrix

def optimize_threshold(
    scores: np.ndarray,
    labels: np.ndarray,
    fp_cost: float = 15.0,    # Cost in USD of wrongly blocking a user (lost sale + support cost)
    fn_cost: float = 180.0,   # Cost in USD of approving fraud (chargeback + fees)
    target_fpr_max: float = 0.003  # Hard cap: no more than 0.3% false positive rate
) -> float:
    """
    Find the threshold that minimizes total cost while respecting FPR cap.
    """
    thresholds = np.linspace(0.01, 0.99, 990)
    best_threshold = 0.5
    best_cost = float("inf")

    total_negatives = np.sum(labels == 0)
    total_positives = np.sum(labels == 1)

    for threshold in thresholds:
        predictions = (scores >= threshold).astype(int)
        tn, fp, fn, tp = confusion_matrix(labels, predictions).ravel()

        fpr = fp / total_negatives if total_negatives > 0 else 0
        if fpr > target_fpr_max:
            continue  # Hard constraint: reject thresholds with excessive FPR

        total_cost = (fp * fp_cost) + (fn * fn_cost)
        if total_cost < best_cost:
            best_cost = total_cost
            best_threshold = threshold

    return best_threshold
Key Insight

The optimal threshold shifts with transaction size. A $10 transaction has a lower fraud cost and a higher relative cost of false positives than a $5,000 transaction. Production systems use tiered thresholds: below $100 requires a score above 0.90 to decline, above $1,000 requires only 0.65. The specific numbers come from the feedback loop data where you can measure actual chargeback amounts versus actual abandoned sales revenue.

Feedback Loop Design

The feedback loop closes the system. Chargeback data arrives 30-90 days after a transaction. Analyst review outcomes arrive within hours. Both feed into a continuous training pipeline that retrains the gradient boosting model weekly and the neural network monthly.

from kafka import KafkaConsumer, KafkaProducer
import json

class FeedbackLoopConsumer:
    """
    Consumes fraud outcome events and enriches them with
    the original feature vector for model retraining.
    """

    def __init__(self, kafka_bootstrap, scoring_db, training_store):
        self.consumer = KafkaConsumer(
            "fraud-outcomes",
            bootstrap_servers=kafka_bootstrap,
            group_id="feedback-loop-enricher",
            value_deserializer=lambda m: json.loads(m.decode()),
            auto_offset_reset="earliest",
            enable_auto_commit=False,
        )
        self.db = scoring_db
        self.store = training_store

    def process(self):
        for message in self.consumer:
            outcome = message.value
            txn_id = outcome["txn_id"]
            label = outcome["training_label"]  # 1=fraud, 0=legitimate

            # Fetch original feature vector from audit log
            decision_record = self.db.get_decision(txn_id)
            if not decision_record:
                continue

            training_example = {
                "txn_id": txn_id,
                "features": decision_record["feature_vector"],
                "label": label,
                "original_score": decision_record["final_score"],
                "original_decision": decision_record["decision"],
                "outcome_type": outcome["outcome_type"],
                "outcome_delay_hours": outcome["outcome_delay_hours"],
            }

            # Write to training store partitioned by outcome date
            self.store.append(training_example)

            # Update online feature store with confirmed fraud signal
            if label == 1:
                self._update_fraud_signals(decision_record)

            self.consumer.commit()

    def _update_fraud_signals(self, decision_record: dict):
        """
        Immediately update feature store when fraud is confirmed,
        so future transactions from the same device/IP get elevated risk.
        """
        user_id = decision_record["user_id"]
        device_fp = decision_record["device_fingerprint"]

        # Decay trust scores for involved entities
        self.store.decrement_trust(f"device:{device_fp}", decay=0.4)
        self.store.decrement_trust(f"user:{user_id}", decay=0.2)

Scaling and Performance

Horizontal scaling diagram showing multiple API nodes, sharded Redis velocity counters by user_id, ML inference pool, and model registry

Capacity Estimation

Given:
  - 10,000 TPS sustained, 50,000 TPS peak
  - 100ms P99 latency target
  - Feature store: 100M active users, 20 features each at 200 bytes = 4TB Redis
  - Velocity counters: 100M users, avg 5 txns/day active, 120 bytes/entry
    100M * 5 * 120 = 60GB for 24h window (manageable with sharding)

Scoring service sizing:
  - Each scoring request: ~5ms CPU (rule eval + GB inference)
  - At 10,000 TPS: 50,000 ms CPU/s = 50 CPU cores just for rule+GB
  - NN inference: GPU-based, 1 A100 handles ~800 NN inferences/s at 40ms
  - At 10,000 TPS with 60% reaching NN: 6,000 NN calls/s = 8 A100 GPUs
  - API + rule engine: 16 nodes * 8 cores = 128 cores (comfortable headroom)

Redis velocity cluster:
  - 60GB data + 2x replication = 180GB across cluster
  - At 10,000 TPS, each txn does 4 ZCOUNT + 4 ZRANGEBYSCORE = 8 ops
  - 80,000 Redis ops/s: 4 Redis nodes with 20,000 ops/s each (trivial)
  - Shard by user_id % num_shards for locality

Feature store Redis cluster:
  - 4TB total, 2x replication = 8TB
  - 128GB nodes: 64 Redis nodes
  - 10,000 HGET/s: well within per-node capacity

Decision audit log (PostgreSQL):
  - 10,000 writes/s to fraud_scoring_decisions
  - 1KB per row average = 10MB/s = 864GB/day
  - Partition by month, archive to S3 after 90 days
  - 3 write replicas + 2 read replicas for query path

Kafka feedback pipeline:
  - 10,000 txn events/s + 200 outcome events/s (2% feedback rate)
  - 3 brokers, 24 partitions, RF=3: easily handles this volume

Horizontal scaling for the API nodes is straightforward - the scoring service is stateless. The challenging scaling dimension is the Redis feature store. The velocity counters are sharded by user_id % shard_count, ensuring all velocity operations for a user hit the same shard. The feature store hash keys are sharded by user ID hash for even distribution. When adding shards, consistent hashing minimizes key migration during rebalancing.

The ML inference layer scales differently for the two models. The gradient boosting model is CPU-bound and scales horizontally by adding inference nodes. The neural network is GPU-accelerated and scales by adding GPU instances, which are expensive and require more careful capacity planning. At 60,000 NN calls per second during peak, 80 A100 GPUs are needed - batching multiple requests into a single inference call is essential to amortize GPU utilization.

Real World

PayPal’s fraud detection system processes over 15 million transactions daily across 200+ countries. Their published architecture uses a tiered approach nearly identical to what is described here: fast rule evaluation in under 5ms, followed by gradient boosting in 15ms, with deep learning reserved for high-value or ambiguous transactions. PayPal reports that their ensemble approach reduced false positives by 37% compared to their previous rule-only system, while simultaneously improving fraud detection rate by 12%.

Failure Modes and Recovery

FailureDetectionImpactRecovery
Redis velocity counter node failureRedis Sentinel heartbeat, 5s detectionVelocity features unavailable for users on that shard; counters default to 0Automatic failover to replica in ~30s; missing counters mean slightly lower fraud signal for affected users; acceptable risk versus blocking payments
ML inference node crashHealth check + circuit breaker, 2s detectionNN scoring unavailable; ensemble falls back to GB + rules onlyLoad balancer removes unhealthy node; remaining nodes absorb traffic; auto-scaling adds replacement node within 3 minutes
Feature store staleness (batch job lag)Monitoring on feature freshness timestamp; alert if age > 30 minPre-computed features reflect state from 30+ minutes ago; velocity counters still fresh (real-time)Alert ops; rule engine continues using real-time velocity; ML features degrade gracefully with older inputs
Payment processor timeout (>100ms)Latency P99 monitoring; alert if >95ms P95Processor declines transaction as timeout - worse than a fraud DECLINE because user gets no useful errorCircuit breaker returns APPROVE after 95ms deadline regardless of score; log for audit; tunable via feature flag
Kafka feedback pipeline lagConsumer group lag metric; alert if >100k messages behindFraud outcomes not flowing to training store; model training data becomes staleIncrease consumer parallelism; if persistent, replay from stored offset; model training is batch anyway so 1-2 day lag is acceptable
Database (audit log) write failureWrite error rate monitoringScoring decision not persisted for audit; regulatory compliance riskDual-write: write to Kafka first, PostgreSQL asynchronously; Kafka serves as durable buffer for recovery replay
Watch Out

The most dangerous failure mode is a latency spike that causes the scoring service to return APPROVE as a timeout fallback for thousands of transactions that should have been DECLINE or REVIEW. Fraudsters who discover the timeout behavior can deliberately trigger slow scoring by submitting transactions designed to max out all model features simultaneously. Monitor for correlated latency spikes with elevated approval rates - this pattern indicates either an attack or a cascading performance degradation that is masking as availability.

Comparison of Approaches

ApproachLatencyAccuracyOperational CostBest For
Rule-only engine5ms P99Limited: misses novel patterns, requires manual maintenanceLow: no ML infraLow-volume or tightly-scoped fraud domains with stable patterns
Single ML model (no rules)50-100ms P99Good for trained patterns, weak on novel velocity attacksMedium: model training pipeline requiredBusinesses with clean labeled data and stable fraud distribution
Rule engine + gradient boosting25ms P99High: rules catch obvious fraud fast, GB handles subtletyMedium-high: two systems to maintainMost payment fraud scenarios at scale
Full ensemble (rules + GB + NN)80-100ms P99Highest: NN captures temporal behavioral shiftsHigh: GPU infra, shadow mode, feedback loopHigh-value payment flows where fraud cost justifies model complexity
Graph-based fraud network detection200-500ms P99Very high for organized fraud ringsVery high: graph DB, GNN trainingDetecting coordinated fraud across account networks
External fraud bureau API50-200ms P99 (external)Variable: depends on bureau data freshnessLow in-house, high vendor costStartups or low-volume without training data to build in-house models

The correct choice for a payment platform at 10,000 TPS is the full ensemble for high-value transactions (above $100) and rule engine plus gradient boosting for low-value transactions. The NN adds meaningful accuracy improvement but costs 4-6x more compute than gradient boosting alone. Routing transactions to the appropriate scoring tier by amount captures most of the accuracy benefit at a fraction of the cost: 80% of transactions are low-value and decided by the cheaper tier, while the expensive NN is reserved for the 20% of high-value transactions where the fraud cost justifies the compute spend.

Key Takeaways

  • Online feature freshness is the foundational constraint - a model trained on stale features makes decisions on stale reality; velocity counters must be updated within seconds, not minutes.
  • Early exit architecture multiplies throughput - when 50% of transactions exit at the rule engine in 5ms, the ML ensemble only needs to handle 50% of peak volume, halving the required inference capacity.
  • False positive cost is asymmetric by transaction value - a $10 declined transaction costs more in customer damage relative to fraud risk than a $5,000 declined transaction; thresholds must be tiered by amount.
  • Shadow mode is not optional for model updates - promoting a model directly to production without shadow validation risks a sudden spike in false positives or false negatives that damages both user trust and fraud loss simultaneously.
  • The feedback loop is the moat - models trained on your own labeled fraud outcomes are far superior to generic models because fraud patterns are merchant-category and region-specific; the longer you operate the feedback loop, the harder the system is to replicate.
  • Velocity counters in Redis sorted sets provide exact sliding windows - approximate data structures save memory but lose the precision needed for fraud decisions where 12 transactions vs. 13 transactions changes the verdict.
  • Failsafe to APPROVE, not DECLINE - when the scoring system itself fails, blocking all payments is worse than approving potentially fraudulent ones; design for graceful degradation to a permissive baseline while alerting immediately.
  • Rule thresholds drift with adversarial pressure - fraudsters learn your thresholds and adjust to stay below them; automated threshold review and model retraining is not maintenance, it is survival.

The counter-intuitive lesson in fraud detection is that the hardest problem is not the ML model - it is the feedback loop latency. Chargebacks arrive 30-90 days after the transaction. By then, the fraud pattern has evolved and the training signal is partially stale. The systems that win are the ones that supplement chargeback data with faster signals: analyst review outcomes (hours), customer fraud reports (days), and cross-merchant network signals from payment network partners. Invest in feedback loop latency reduction before model complexity.

Frequently Asked Questions

Q: Why not just use a single neural network instead of a rule engine plus gradient boosting plus neural net ensemble?

A: A single neural network has three problems for this use case. First, inference latency is 40-80ms even on GPU - with no fast-path exit, every transaction pays this cost. Second, neural networks are black boxes that make regulatory compliance difficult; payment regulations in most jurisdictions require explainable fraud decisions that can be audited and challenged. Third, neural networks need large amounts of labeled training data to generalize; gradient boosting works well with thousands of examples while neural networks need millions. The ensemble gets the best of all three: rules provide explainability and fast exit, gradient boosting handles structured tabular features with limited data, and the neural network captures temporal behavioral patterns.

Q: Why not use a graph database to detect fraud rings instead of behavioral features per user?

A: Graph-based fraud ring detection is extremely valuable but too slow for synchronous transaction scoring. Building a transaction graph, running community detection algorithms, and scoring graph centrality takes 200-500ms at minimum - well beyond the 100ms budget. The practical approach is to run graph analysis asynchronously as a batch job every 15-30 minutes, materializing “fraud ring membership score” as a pre-computed feature that the online store serves to the synchronous scoring path. You get the graph intelligence without paying the graph traversal latency on the critical path.

Q: How do you handle a user who legitimately travels internationally and triggers the impossible travel rule?

A: Two mechanisms. First, the impossible travel rule signals a FLAG at 0.91 score but is a DECLINE only if the time gap is under 2 hours - international travel takes longer, so the rule has a built-in grace period. Second, the ML model sees the user’s historical travel patterns as a feature: if the user frequently transacts in multiple countries, the model assigns lower fraud probability to cross-country transactions for that user. The third mechanism is friction: a REVIEW decision can trigger a 3DS challenge or a push notification to the user’s phone asking them to confirm the transaction, converting a potential false positive into an authenticated transaction.

Q: Why use Redis sorted sets for velocity counters instead of a streaming system like Flink with windowed aggregations?

A: Flink windowed aggregations are excellent for analytics but have two problems for synchronous fraud scoring. First, Flink windows produce results at window boundaries - a 1-minute tumbling window tells you the count in the last complete minute, not in the last 60 seconds from now. The fraud detection needs an exact sliding window. Second, Flink result availability has latency - the aggregation result is published after the window closes, which adds seconds of staleness. Redis sorted sets provide exact sliding windows with millisecond-fresh results because you compute the window at query time, not at window close time.

Q: How do you prevent the feedback loop from reinforcing biases - for example, if the model over-declines transactions from certain demographics?

A: This is a real risk called feedback loop bias. The mechanism: model declines user group X more often, meaning X generates fewer positive transaction examples, meaning the next training run sees X as lower risk due to survivorship bias - or higher risk because the declined ones were never tested. Mitigation requires three things. First, maintain a holdout set: a small percentage of high-score transactions are randomly approved regardless of model decision and their outcomes tracked. Second, audit training data for demographic distribution parity before each training run. Third, track approval rates segmented by user cohort and alert when any cohort’s approval rate diverges more than 2 standard deviations from its 90-day baseline.

Q: What happens to scoring during model retraining or deployment?

A: Retraining and deployment are fully decoupled from scoring. The model is trained offline, serialized to S3, and registered in the model registry table with status “shadow”. The scoring service polls the model registry every 60 seconds and loads new shadow models into a separate inference slot. Shadow scoring runs asynchronously. When shadow validation passes and a human approves the promotion, the registry status changes to “production” and the scoring service picks it up on its next poll cycle - typically within 60 seconds - with zero downtime and no restart required.

Interview Questions

Q: Walk me through how a transaction gets scored end-to-end in under 100ms.

Expected depth: Trace the path from API Gateway receiving the payload, to parallel feature lookups (online feature store HGET + Redis velocity ZCOUNT, both within 15ms), to rule evaluation (sequential rule checks in priority order, early exit if any rule fires above block threshold), to ML ensemble (gradient boosting inference in 15ms, neural network in 40ms run in parallel, combine scores with weights), to decision engine (threshold comparison, audit log write, response return). Name the time budget for each stage: 15ms features, 5ms rules, 60ms ML, 10ms decision + write = 90ms with 10ms buffer. Discuss the failsafe: return APPROVE after 95ms regardless of scoring state.

Q: How would you design the velocity counter to handle a Redis node failure without losing accuracy?

Expected depth: Explain that Redis Sentinel provides automatic failover to a replica in 30 seconds. During the 30-second gap, velocity counters for affected shards return 0 (missing key defaults), which means those users appear to have zero transaction history and the fraud score drops. This is the correct trade-off - brief false negatives during a Redis failure are less damaging than blocking all payments. After failover, the replica has up-to-date data (Redis replication is asynchronous with sub-second lag). For the 30-second recovery window, discuss tracking which users had missing counters and applying a temporary elevated-risk flag until their counters are verified fresh.

Q: The model team wants to update the gradient boosting model every week. How do you do this with zero downtime?

Expected depth: Explain the model registry + shadow mode pipeline: train offline, register as shadow, run shadow on 100% of traffic for 24-48 hours, compare shadow decisions against production decisions and against known fraud outcomes, if metrics meet promotion criteria then update registry status to production, scoring service polls registry and hot-swaps model in memory within 60 seconds. Discuss the fallback: model versions are immutable artifacts in S3; if a promoted model shows degradation (false positive rate spiking, latency regression), rollback is changing the registry status back to the previous version - same 60-second hot-swap mechanism. No deployment, no restart.

Q: Your fraud false negative rate is acceptable but false positives are too high - 1.2% of legitimate transactions are being blocked. What do you do?

Expected depth: Start with diagnosis: segment the 1.2% false positives by rule versus ML decision, by user segment, by transaction type. If most false positives come from a single rule, calibrate that rule’s threshold upward. If they come from the ML score, lower the DECLINE threshold (raise the bar for declines) at the cost of accepting more fraud - quantify that trade-off using the cost-optimization function. Introduce a REVIEW tier: transactions between score 0.55 and 0.75 go to REVIEW with a friction step (SMS OTP or 3DS challenge) instead of hard DECLINE, converting false positives into authenticated approvals. Track friction acceptance rate as a proxy for false positive rate. Finally, improve the training data: add more confirmed-legitimate examples to the training set to reduce the model’s tendency to over-classify legitimate patterns as fraud.

Q: How would you extend the system to detect fraud rings - coordinated attacks using many accounts?

Expected depth: Fraud rings require graph analysis that cannot run synchronously. The architecture extension: build a transaction graph with nodes as users, cards, devices, and IP addresses, edges as shared attributes (same device used by multiple accounts, same IP, same billing address). Run community detection (Louvain algorithm or label propagation) on this graph daily or every 4 hours as a batch job. Output a “ring membership score” per user and a “connected account cluster ID.” Materialize these as features in the online store, so the synchronous scoring path can use them without graph traversal cost. Additionally, add a real-time streaming signal: when any account in a cluster is confirmed as fraud, immediately elevate the risk score for all accounts in the same cluster via a Kafka event that updates their online store features.

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access
Unlock Full Article