Build a Real-Time Fraud Detection Engine
performance scalability distributed-systems
System Design Deep Dive
Real-Time Fraud Detection Engine
Evaluating every payment transaction in under 100ms using behavioral, velocity, and network signals with continuously updated risk models
A user taps “Pay” at a checkout terminal in Mumbai. Within the next 80 milliseconds, your system must decide whether this transaction is genuine or fraudulent - before the payment processor times out and before the customer gets annoyed waiting for the spinner. The decision draws on hundreds of signals: is this the usual device the user shops from? Have they made 12 transactions in the last 5 minutes from three different countries? Does the merchant category match previous spending patterns? Has this card number appeared on a known-compromised list in the last hour?
Fraud is an adversarial cat-and-mouse problem. Fraudsters actively probe your defenses, testing small transactions before large ones, cycling through stolen cards, and mimicking legitimate behavior patterns to avoid velocity triggers. Your detection system must evolve continuously - a static rule set becomes obsolete within days as attackers learn its boundaries. At the same time, every false positive blocks a legitimate customer, damages trust, and costs revenue. A 0.5% false positive rate on a system processing 10,000 transactions per second blocks 50 legitimate payments every second.
The core engineering tension is latency versus accuracy. More sophisticated models - deep neural nets, graph-based fraud network detection, complex ensemble methods - take longer to run. Simple threshold rules are fast but blind to subtle pattern combinations. Waiting for more signals improves accuracy but increases latency. Running every model in parallel wastes compute on transactions that a fast rule would have blocked immediately. The system must be a pipeline that can exit early on high-confidence decisions while escalating uncertain cases through progressively more expensive analysis.
We need to solve for three things simultaneously: (1) sub-100ms scoring that consumes pre-computed features and makes a decision before the payment processor times out, (2) continuous model evolution that updates risk models with new fraud patterns without requiring a full redeployment, and (3) false positive minimization that avoids blocking legitimate users even as fraud patterns shift.
Requirements and Constraints
Functional Requirements
- Score every incoming payment transaction synchronously before approval, returning APPROVE, DECLINE, or REVIEW within 100ms
- Compute velocity features: transaction count and volume per user in the last 1 minute, 5 minutes, 1 hour, and 24 hours
- Evaluate rule-based checks: velocity thresholds, device fingerprint consistency, IP geolocation anomalies, card BIN risk scoring
- Run ML ensemble scoring combining gradient boosting and a neural network for behavioral pattern detection
- Support a shadow mode where new models run alongside production without affecting decisions, for validation before promotion
- Feed approved and declined transaction outcomes back into the feature store for continuous model updates
- Allow manual case review workflow for REVIEW decisions, with analyst feedback flowing back into training data
Non-Functional Requirements
- Latency: P99 scoring decision under 100ms end-to-end, P50 under 30ms
- Throughput: 10,000 transactions per second sustained, 50,000 TPS burst (flash sale periods)
- Availability: 99.99% uptime for the scoring path (failsafe to APPROVE on system failure to avoid blocking payments)
- False positive rate: under 0.3% of legitimate transactions incorrectly declined
- False negative rate: under 0.8% of fraudulent transactions incorrectly approved (tunable via threshold)
- Feature freshness: velocity counters updated within 500ms of transaction completion
- Model update latency: new model versions promoted to production within 15 minutes of approval
- Audit log retention: all scoring decisions with full feature vectors stored for 7 years (regulatory requirement)
Constraints and Assumptions
- Payment processor has a 120ms timeout - we must respond in under 100ms to leave 20ms buffer
- We do not store raw card numbers - only tokenized card IDs and card BIN (first 6 digits)
- PCI-DSS compliance requires all cardholder data to flow through isolated network segments
- We assume a user identity graph is available as a pre-built service (not built here)
- Geographic distribution: scoring nodes deployed in 4 regions; features replicated with eventual consistency
High-Level Architecture
The fraud engine has six major components working together to deliver a sub-100ms scoring decision for every payment.
The API Gateway receives the raw transaction payload from the payment processor and immediately strips sensitive fields before forwarding downstream. It enforces the 100ms SLA by setting a deadline on the entire scoring pipeline context.
The Feature Extractor is the synchronous hot path. It executes feature lookups in parallel: querying the online feature store for pre-computed behavioral features, hitting Redis for velocity counters, and calling the device fingerprint service. All lookups run concurrently with a combined timeout of 20ms. Missing features default to safe baseline values rather than failing the transaction.
The Feature Store has two layers: an offline store (columnar database updated by batch jobs) and an online store (Redis cluster) that serves low-latency feature lookups. The online store holds pre-computed user risk scores, historical velocity statistics, and device trust signals updated within seconds of new transactions.
The Rule Engine applies deterministic, interpretable checks that can exit early. Rules catch obvious fraud patterns - impossible travel, velocity spikes, known-bad indicators - in under 5ms. A rule firing with sufficient confidence returns an immediate DECLINE without invoking the ML models.
The ML Ensemble runs gradient boosting and a neural network in parallel, combining their scores with learned weights. This layer handles subtle pattern combinations that rules cannot express. It accounts for 60-80ms of the total budget when invoked, which is why the Rule Engine’s early exit path is critical.
The Decision Engine receives the rule verdict and ML scores, applies the final threshold logic, writes the decision to the audit log, and returns the response to the payment processor. It also enqueues the transaction and decision to the feedback loop pipeline.
The architecture separates fast deterministic rules from slow probabilistic models into sequential stages with early exit. This means 40-60% of transactions never reach the ML ensemble - they are decided by rules in under 10ms. The ML compute budget is spent only on the ambiguous middle ground where rules produce no confident answer.
The Feature Store
The Feature Store’s job is to serve pre-computed features to the scoring pipeline in under 15ms, handling 10,000 reads per second per transaction while staying fresh enough that a fraud pattern detected seconds ago already affects scoring.
The feature store has two distinct halves with very different access patterns. The offline store is a columnar database (Apache Parquet on S3 backed by Hive/Spark) that holds historical aggregations: 30-day spending patterns, merchant category distributions, device usage history. Batch jobs run every 15 minutes to recompute these features and push updated values to the online store. The offline store is queried by training pipelines, not by the scoring path.
The online store is a Redis cluster holding the subset of features that change frequently and need sub-millisecond reads. Each user has a hash key containing their pre-computed feature vector: 30-day average transaction amount, preferred merchant categories, usual transaction hours, device trust score, and current velocity counters. Velocity counters are stored separately as sorted sets for sliding window computation.
import redis
import json
import time
from typing import Optional
# Feature store client for the scoring hot path
class OnlineFeatureStore:
def __init__(self, redis_cluster):
self.redis = redis_cluster
self.default_features = {
"avg_txn_amount_30d": 500.0,
"txn_count_30d": 50,
"device_trust_score": 0.5,
"preferred_hour_match": 0.5,
"merchant_category_match": 0.5,
}
def get_user_features(self, user_id: str, timeout_ms: int = 15) -> dict:
"""Fetch pre-computed user feature vector from online store."""
key = f"features:user:{user_id}"
raw = self.redis.get(key)
if raw is None:
# Cold start: new user with no history
return dict(self.default_features)
return json.loads(raw)
def get_velocity_features(self, user_id: str, card_token: str) -> dict:
"""Compute sliding window velocity counters in real time."""
now_ms = int(time.time() * 1000)
pipe = self.redis.pipeline(transaction=False)
# Sorted sets keyed by user and card with score=timestamp
user_key = f"velocity:user:{user_id}"
card_key = f"velocity:card:{card_token}"
# Count transactions in sliding windows (1m, 5m, 1h, 24h)
windows = [60_000, 300_000, 3_600_000, 86_400_000]
for window_ms in windows:
cutoff = now_ms - window_ms
pipe.zcount(user_key, cutoff, now_ms)
pipe.zrangebyscore(user_key, cutoff, now_ms, withscores=False)
results = pipe.execute()
# Parse counts and amounts from sorted set members
user_counts = {}
labels = ["1m", "5m", "1h", "24h"]
for i, label in enumerate(labels):
count = results[i * 2]
members = results[i * 2 + 1]
total_amount = sum(
float(m.split(":")[1]) for m in members if ":" in m
)
user_counts[f"txn_count_{label}"] = count
user_counts[f"txn_amount_{label}"] = total_amount
return user_counts
def record_transaction(self, user_id: str, card_token: str,
amount: float, txn_id: str):
"""Write velocity counter entry after transaction completes."""
now_ms = int(time.time() * 1000)
pipe = self.redis.pipeline(transaction=False)
member = f"{txn_id}:{amount:.2f}"
user_key = f"velocity:user:{user_id}"
card_key = f"velocity:card:{card_token}"
pipe.zadd(user_key, {member: now_ms})
pipe.zadd(card_key, {member: now_ms})
# Prune entries older than 25 hours to control memory
cutoff = now_ms - 90_000_000
pipe.zremrangebyscore(user_key, 0, cutoff)
pipe.zremrangebyscore(card_key, 0, cutoff)
# TTL: expire keys after 48h of inactivity
pipe.expire(user_key, 172800)
pipe.expire(card_key, 172800)
pipe.execute()
Velocity counters use Redis sorted sets with the Unix timestamp in milliseconds as the score and the transaction ID plus amount as the member. A single ZCOUNT call gives the transaction count in any time window without scanning all members, and ZRANGEBYSCORE gives the full list to sum amounts. This is O(log N) per query instead of O(N) for a full scan.
The Rule Engine
The Rule Engine’s job is to apply fast, deterministic checks that catch obvious fraud patterns in under 5ms, returning a confident DECLINE for the clearest cases before wasting ML compute cycles.
Rules are evaluated in priority order. Each rule produces a signal strength from 0.0 to 1.0 and an action: DECLINE (confidence above block threshold), FLAG (adds to the ML feature vector), or PASS. Rules are defined in a versioned configuration file loaded at startup and hot-reloaded without restarts.
from dataclasses import dataclass
from typing import Optional, List
import math
@dataclass
class Transaction:
txn_id: str
user_id: str
card_token: str
amount: float
merchant_id: str
merchant_category: str
ip_address: str
device_fingerprint: str
billing_country: str
ip_country: str
timestamp_ms: int
@dataclass
class RuleResult:
rule_name: str
signal: float # 0.0 to 1.0
action: str # "DECLINE", "FLAG", "PASS"
reason: str
class RuleEngine:
BLOCK_THRESHOLD = 0.85
def evaluate(self, txn: Transaction, features: dict) -> List[RuleResult]:
results = []
results.append(self._velocity_rule(txn, features))
results.append(self._geo_mismatch_rule(txn, features))
results.append(self._device_consistency_rule(txn, features))
results.append(self._amount_spike_rule(txn, features))
results.append(self._card_bin_risk_rule(txn, features))
results.append(self._impossible_travel_rule(txn, features))
return results
def _velocity_rule(self, txn: Transaction, features: dict) -> RuleResult:
count_1m = features.get("txn_count_1m", 0)
count_5m = features.get("txn_count_5m", 0)
amount_1h = features.get("txn_amount_1h", 0.0)
# Hard thresholds tuned from historical fraud data
if count_1m >= 5:
return RuleResult("velocity_1m", 0.95, "DECLINE",
f"5+ transactions in 1 minute: count={count_1m}")
if count_5m >= 12:
return RuleResult("velocity_5m", 0.92, "DECLINE",
f"12+ transactions in 5 minutes: count={count_5m}")
if amount_1h > 50000:
return RuleResult("velocity_amount", 0.88, "DECLINE",
f"Amount exceeds hourly limit: {amount_1h:.0f}")
# Softer signal for elevated velocity
signal = min(0.6, count_5m / 20.0)
return RuleResult("velocity_soft", signal, "FLAG",
f"Elevated velocity: {count_5m} txns/5m")
def _geo_mismatch_rule(self, txn: Transaction, features: dict) -> RuleResult:
if txn.billing_country and txn.ip_country:
if txn.billing_country != txn.ip_country:
# High-risk country pair combinations have different weights
high_risk_pairs = {("US", "NG"), ("GB", "RU"), ("AU", "CN")}
pair = (txn.billing_country, txn.ip_country)
if pair in high_risk_pairs:
return RuleResult("geo_mismatch_high_risk", 0.82, "DECLINE",
f"High-risk country pair: {pair}")
return RuleResult("geo_mismatch", 0.55, "FLAG",
f"Billing/IP country mismatch: {txn.billing_country} vs {txn.ip_country}")
return RuleResult("geo_mismatch", 0.0, "PASS", "geo consistent")
def _device_consistency_rule(self, txn: Transaction, features: dict) -> RuleResult:
known_device = features.get("last_known_device_fingerprint")
device_trust = features.get("device_trust_score", 1.0)
if known_device and txn.device_fingerprint != known_device:
if device_trust < 0.2:
# New device AND low trust score for account
return RuleResult("device_mismatch_low_trust", 0.78, "FLAG",
"New device on low-trust account")
return RuleResult("device_consistent", 0.0, "PASS", "device known")
def _amount_spike_rule(self, txn: Transaction, features: dict) -> RuleResult:
avg_amount = features.get("avg_txn_amount_30d", 0.0)
if avg_amount > 0 and txn.amount > avg_amount * 8:
signal = min(0.9, 0.5 + math.log10(txn.amount / avg_amount) / 4)
return RuleResult("amount_spike", signal, "FLAG",
f"Amount {txn.amount:.0f} vs avg {avg_amount:.0f}")
return RuleResult("amount_normal", 0.0, "PASS", "amount normal")
def _card_bin_risk_rule(self, txn: Transaction, features: dict) -> RuleResult:
bin_risk = features.get("card_bin_risk_score", 0.0)
if bin_risk > 0.9:
return RuleResult("bin_high_risk", 0.88, "DECLINE",
f"Card BIN on high-risk list: score={bin_risk}")
return RuleResult("bin_ok", bin_risk * 0.5, "FLAG" if bin_risk > 0.5 else "PASS",
f"BIN risk: {bin_risk:.2f}")
def _impossible_travel_rule(self, txn: Transaction, features: dict) -> RuleResult:
last_txn_country = features.get("last_txn_country")
last_txn_ts_ms = features.get("last_txn_timestamp_ms", 0)
if last_txn_country and last_txn_country != txn.ip_country:
hours_elapsed = (txn.timestamp_ms - last_txn_ts_ms) / 3_600_000
if hours_elapsed < 2.0:
return RuleResult("impossible_travel", 0.91, "DECLINE",
f"Transacted in {last_txn_country} {hours_elapsed:.1f}h ago")
return RuleResult("travel_ok", 0.0, "PASS", "travel plausible")
Rule thresholds calibrated on historical data will drift as fraud patterns evolve. A velocity threshold of “5 transactions per minute” catches card testing attacks today, but fraudsters adjust to 4 transactions per minute once they discover the threshold. Build an automated threshold review pipeline that re-evaluates rule effectiveness every 30 days using the feedback loop data, and alert when a rule’s true positive rate drops below baseline.
The ML Ensemble
The ML Ensemble’s job is to evaluate the behavioral patterns that rules cannot express - subtle combinations of signals that individually look normal but together indicate fraud - returning a calibrated probability score in under 80ms.
The ensemble combines three models with learned combination weights. The first is a gradient boosting model (XGBoost or LightGBM) trained on 90 days of labeled transactions. It handles tabular features well - amounts, velocity counts, merchant categories, device age - and produces well-calibrated probabilities. Inference takes 10-15ms. The second is a neural network that processes sequential transaction history as a time-series, capturing behavioral drift that gradient boosting misses. It takes 30-50ms to run. The third is a rule signal aggregator that converts the rule engine’s FLAG outputs into a single weighted score, which serves as a regularization term for the other two models.
package ensemble
import (
"context"
"sync"
"time"
)
type ScoringRequest struct {
TxnID string
Features map[string]float64
RuleSignals []RuleSignal
}
type ModelScore struct {
ModelName string
Score float64
LatencyMs int64
Err error
}
type EnsembleResult struct {
FinalScore float64
GBScore float64
NNScore float64
RuleScore float64
ConfidenceLevel string // "HIGH", "MEDIUM", "LOW"
TotalLatencyMs int64
}
type MLEnsemble struct {
gbModel GradientBoostModel
nnModel NeuralNetModel
weights EnsembleWeights
nnTimeout time.Duration
}
type EnsembleWeights struct {
GBWeight float64 // 0.45
NNWeight float64 // 0.40
RuleWeight float64 // 0.15
}
func (e *MLEnsemble) Score(ctx context.Context, req ScoringRequest) EnsembleResult {
start := time.Now()
// Aggregate rule signals into a single score
ruleScore := e.aggregateRuleSignals(req.RuleSignals)
// High-confidence rule block - skip ML entirely
if ruleScore > 0.85 {
return EnsembleResult{
FinalScore: ruleScore,
RuleScore: ruleScore,
ConfidenceLevel: "HIGH",
TotalLatencyMs: time.Since(start).Milliseconds(),
}
}
// Run GB and NN in parallel with independent timeouts
var wg sync.WaitGroup
gbCh := make(chan ModelScore, 1)
nnCh := make(chan ModelScore, 1)
wg.Add(2)
go func() {
defer wg.Done()
gbStart := time.Now()
score, err := e.gbModel.Predict(ctx, req.Features)
gbCh <- ModelScore{
ModelName: "gradient_boost",
Score: score,
LatencyMs: time.Since(gbStart).Milliseconds(),
Err: err,
}
}()
go func() {
defer wg.Done()
// Neural net has its own deadline - 60ms budget
nnCtx, cancel := context.WithTimeout(ctx, e.nnTimeout)
defer cancel()
nnStart := time.Now()
score, err := e.nnModel.Predict(nnCtx, req.Features)
nnCh <- ModelScore{
ModelName: "neural_net",
Score: score,
LatencyMs: time.Since(nnStart).Milliseconds(),
Err: err,
}
}()
wg.Wait()
close(gbCh)
close(nnCh)
gbResult := <-gbCh
nnResult := <-nnCh
gbScore := gbResult.Score
nnScore := nnResult.Score
// Fall back gracefully if NN times out
if nnResult.Err != nil {
// Rebalance weights when NN is unavailable
finalScore := (gbScore * 0.75) + (ruleScore * 0.25)
return EnsembleResult{
FinalScore: finalScore,
GBScore: gbScore,
RuleScore: ruleScore,
ConfidenceLevel: "MEDIUM",
TotalLatencyMs: time.Since(start).Milliseconds(),
}
}
finalScore := (gbScore * e.weights.GBWeight) +
(nnScore * e.weights.NNWeight) +
(ruleScore * e.weights.RuleWeight)
confidence := "MEDIUM"
if finalScore > 0.80 || finalScore < 0.15 {
confidence = "HIGH"
} else if finalScore > 0.55 || finalScore < 0.35 {
confidence = "LOW"
}
return EnsembleResult{
FinalScore: finalScore,
GBScore: gbScore,
NNScore: nnScore,
RuleScore: ruleScore,
ConfidenceLevel: confidence,
TotalLatencyMs: time.Since(start).Milliseconds(),
}
}
func (e *MLEnsemble) aggregateRuleSignals(signals []RuleSignal) float64 {
if len(signals) == 0 {
return 0.0
}
// Weight by rule confidence, use max for DECLINE signals
maxDeclineSignal := 0.0
flagSum := 0.0
flagCount := 0
for _, sig := range signals {
if sig.Action == "DECLINE" && sig.Signal > maxDeclineSignal {
maxDeclineSignal = sig.Signal
} else if sig.Action == "FLAG" {
flagSum += sig.Signal
flagCount++
}
}
if maxDeclineSignal > 0 {
return maxDeclineSignal
}
if flagCount == 0 {
return 0.0
}
return (flagSum / float64(flagCount)) * 0.7
}
Stripe Radar uses a similar ensemble architecture, combining thousands of hand-crafted features with gradient boosting and neural network models. Their key insight - published in engineering posts - is that the model architecture matters less than feature quality. A well-featured gradient boosting model consistently outperforms a poorly-featured neural network. Radar evaluates every transaction on Stripe’s platform and has processed over 100 billion transactions, making their fraud signal dataset one of the richest in the industry.
Model Shadow Mode
Shadow mode’s job is to validate new model versions alongside the production model using real traffic without affecting any live decisions, giving engineers statistical confidence before promotion.
When a new model version passes offline evaluation (AUC-ROC, precision-recall curves on held-out test data), it enters shadow mode. In shadow mode, the new model runs on every transaction in parallel with the production model, its score is recorded alongside the production decision, but its score does not influence the outcome. After 24-48 hours of shadow traffic, the team compares the shadow model’s decisions against actual fraud outcomes to measure whether it would have performed better or worse.
The critical design decisions for shadow mode are isolation (shadow model failures must not affect production latency), asynchronous execution (shadow scoring can happen off the critical path after the production decision is returned), and statistical validity (enough traffic volume to achieve significance before promotion).
import asyncio
import logging
from dataclasses import dataclass
from typing import Optional
logger = logging.getLogger(__name__)
@dataclass
class ShadowResult:
model_version: str
score: float
latency_ms: int
error: Optional[str]
class ShadowModeOrchestrator:
def __init__(self, production_model, shadow_models: list,
shadow_store, metrics_client):
self.production = production_model
self.shadows = shadow_models
self.store = shadow_store
self.metrics = metrics_client
async def score_with_shadows(self, txn_id: str, features: dict) -> dict:
"""
Run production model synchronously (on critical path).
Launch shadow models as fire-and-forget async tasks.
"""
# Production score is blocking - on the critical path
prod_result = self.production.score(features)
# Shadow models run asynchronously - do not await here
asyncio.create_task(
self._run_shadows_async(txn_id, features, prod_result)
)
# Return immediately with production result only
return prod_result
async def _run_shadows_async(self, txn_id: str, features: dict,
prod_result: dict):
"""Shadow scoring runs entirely off the critical path."""
shadow_results = []
for model in self.shadows:
try:
import time
start = time.monotonic()
# Each shadow has its own timeout - failure is silent
score = await asyncio.wait_for(
asyncio.to_thread(model.score, features),
timeout=0.5 # 500ms shadow budget - much more lenient
)
latency_ms = int((time.monotonic() - start) * 1000)
shadow_results.append(ShadowResult(
model_version=model.version,
score=score["final_score"],
latency_ms=latency_ms,
error=None,
))
self.metrics.histogram(
"shadow_model_latency_ms",
latency_ms,
tags={"version": model.version}
)
except asyncio.TimeoutError:
shadow_results.append(ShadowResult(
model_version=model.version,
score=-1.0,
latency_ms=500,
error="timeout",
))
logger.warning("Shadow model %s timed out for txn %s",
model.version, txn_id)
except Exception as e:
shadow_results.append(ShadowResult(
model_version=model.version,
score=-1.0,
latency_ms=0,
error=str(e),
))
# Persist shadow scores for later analysis
await self.store.save_shadow_comparison(
txn_id=txn_id,
production_score=prod_result["final_score"],
production_decision=prod_result["decision"],
shadow_results=shadow_results,
)
Shadow mode scoring runs after the production response is sent - not before. The production decision is returned to the payment processor immediately, then the shadow models run asynchronously. This means shadow mode adds zero latency to the critical path. The shadow results are stored and compared against actual fraud outcomes reported in the next 24-72 hours via chargeback data.
Data Model
-- Core transaction scoring record
-- Written once per transaction, never updated
CREATE TABLE fraud_scoring_decisions (
txn_id UUID PRIMARY KEY,
user_id UUID NOT NULL,
card_token VARCHAR(64) NOT NULL,
merchant_id VARCHAR(64) NOT NULL,
amount_paise BIGINT NOT NULL, -- integer paise to avoid float precision
currency CHAR(3) NOT NULL,
decision VARCHAR(16) NOT NULL CHECK (decision IN ('APPROVE','DECLINE','REVIEW')),
final_score NUMERIC(5,4) NOT NULL, -- 0.0000 to 1.0000
decision_reason VARCHAR(128),
rule_signals JSONB NOT NULL DEFAULT '[]',
gb_score NUMERIC(5,4),
nn_score NUMERIC(5,4),
rule_score NUMERIC(5,4),
model_version VARCHAR(32) NOT NULL,
feature_vector JSONB NOT NULL, -- full features for audit + retraining
scoring_latency_ms INT NOT NULL,
decided_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
ip_address INET,
device_fingerprint VARCHAR(128),
ip_country CHAR(2),
billing_country CHAR(2)
) PARTITION BY RANGE (decided_at);
-- Monthly partitions for efficient archival
CREATE TABLE fraud_scoring_decisions_2026_06
PARTITION OF fraud_scoring_decisions
FOR VALUES FROM ('2026-06-01') TO ('2026-07-01');
-- Fast lookups by user and card for velocity queries
CREATE INDEX idx_fsd_user_decided ON fraud_scoring_decisions (user_id, decided_at DESC);
CREATE INDEX idx_fsd_card_decided ON fraud_scoring_decisions (card_token, decided_at DESC);
CREATE INDEX idx_fsd_decision ON fraud_scoring_decisions (decision, decided_at DESC);
-- Feedback loop: actual fraud outcomes reported via chargeback
CREATE TABLE fraud_outcomes (
outcome_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
txn_id UUID NOT NULL REFERENCES fraud_scoring_decisions(txn_id),
outcome_type VARCHAR(32) NOT NULL CHECK (
outcome_type IN ('confirmed_fraud','confirmed_legitimate',
'chargeback','analyst_approved','analyst_declined')
),
reported_by VARCHAR(64), -- 'chargeback_processor', 'analyst_id', 'customer'
reported_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
notes TEXT,
-- Derived label for retraining: 1=fraud, 0=legitimate
training_label SMALLINT GENERATED ALWAYS AS (
CASE outcome_type
WHEN 'confirmed_fraud' THEN 1
WHEN 'chargeback' THEN 1
WHEN 'confirmed_legitimate' THEN 0
WHEN 'analyst_approved' THEN 0
ELSE NULL
END
) STORED
);
CREATE INDEX idx_outcomes_txn ON fraud_outcomes (txn_id);
CREATE INDEX idx_outcomes_reported ON fraud_outcomes (reported_at DESC);
CREATE INDEX idx_outcomes_label ON fraud_outcomes (training_label, reported_at DESC)
WHERE training_label IS NOT NULL;
-- Model registry for version management and shadow mode tracking
CREATE TABLE model_versions (
version_id VARCHAR(32) PRIMARY KEY, -- e.g. "gb-v4.2.1"
model_type VARCHAR(32) NOT NULL, -- 'gradient_boost', 'neural_net', 'ensemble'
status VARCHAR(16) NOT NULL DEFAULT 'shadow'
CHECK (status IN ('shadow','production','retired')),
auc_roc NUMERIC(6,5),
false_positive_rate NUMERIC(6,5),
false_negative_rate NUMERIC(6,5),
shadow_started_at TIMESTAMPTZ,
promoted_at TIMESTAMPTZ,
retired_at TIMESTAMPTZ,
model_artifact_uri TEXT NOT NULL, -- S3 URI to serialized model
created_by VARCHAR(128),
metadata JSONB DEFAULT '{}'
);
Key Algorithms and Protocols
Velocity Counter with Sliding Window
Sliding window velocity counters track transaction counts and amounts in moving time windows without the step-function artifacts of fixed tumbling windows. A sorted set in Redis indexed by timestamp provides exact sliding window computation in O(log N) time.
import time
import redis
class SlidingWindowVelocityCounter:
"""
Exact sliding window counter using Redis sorted sets.
Members are "{txn_id}:{amount}" with score = Unix timestamp in ms.
"""
WINDOW_SIZES_MS = {
"1m": 60_000,
"5m": 300_000,
"1h": 3_600_000,
"24h": 86_400_000,
}
def __init__(self, redis_client: redis.Redis, key_prefix: str):
self.r = redis_client
self.prefix = key_prefix
def increment(self, entity_id: str, txn_id: str, amount: float):
"""Record a new transaction event."""
now_ms = int(time.time() * 1000)
key = f"{self.prefix}:{entity_id}"
member = f"{txn_id}:{amount:.4f}"
pipe = self.r.pipeline(transaction=False)
pipe.zadd(key, {member: now_ms})
# Prune events older than 25 hours (slightly beyond max window)
cutoff = now_ms - 90_000_000
pipe.zremrangebyscore(key, "-inf", cutoff)
pipe.expire(key, 90000) # 25h TTL
pipe.execute()
def get_counts(self, entity_id: str) -> dict:
"""Return count and total amount for each window."""
now_ms = int(time.time() * 1000)
key = f"{self.prefix}:{entity_id}"
pipe = self.r.pipeline(transaction=False)
for window_ms in self.WINDOW_SIZES_MS.values():
cutoff = now_ms - window_ms
pipe.zrangebyscore(key, cutoff, now_ms)
results = pipe.execute()
output = {}
for (label, _), members in zip(self.WINDOW_SIZES_MS.items(), results):
count = len(members)
total = sum(
float(m.split(":")[1]) for m in members if ":" in m
) if members else 0.0
output[f"count_{label}"] = count
output[f"amount_{label}"] = total
return output
Redis sorted sets provide exact sliding windows without approximation. The alternative - HyperLogLog or count-min sketch - uses less memory but cannot give exact counts or sum amounts, which matters for fraud decisions. At 10,000 TPS with 1 year retention per user key, the sorted set approach uses about 120 bytes per transaction entry and auto-expires entries beyond the maximum window. For a typical user with 5 transactions per day, that is under 600 bytes of Redis memory per user.
False Positive Rate Optimization
False positive optimization requires treating the score threshold as a tunable parameter calibrated against business cost, not as a fixed value. The cost of a false positive (blocking a legitimate user) is not symmetric with the cost of a false negative (approving fraud). Calibrate thresholds using cost-sensitive optimization:
import numpy as np
from sklearn.metrics import confusion_matrix
def optimize_threshold(
scores: np.ndarray,
labels: np.ndarray,
fp_cost: float = 15.0, # Cost in USD of wrongly blocking a user (lost sale + support cost)
fn_cost: float = 180.0, # Cost in USD of approving fraud (chargeback + fees)
target_fpr_max: float = 0.003 # Hard cap: no more than 0.3% false positive rate
) -> float:
"""
Find the threshold that minimizes total cost while respecting FPR cap.
"""
thresholds = np.linspace(0.01, 0.99, 990)
best_threshold = 0.5
best_cost = float("inf")
total_negatives = np.sum(labels == 0)
total_positives = np.sum(labels == 1)
for threshold in thresholds:
predictions = (scores >= threshold).astype(int)
tn, fp, fn, tp = confusion_matrix(labels, predictions).ravel()
fpr = fp / total_negatives if total_negatives > 0 else 0
if fpr > target_fpr_max:
continue # Hard constraint: reject thresholds with excessive FPR
total_cost = (fp * fp_cost) + (fn * fn_cost)
if total_cost < best_cost:
best_cost = total_cost
best_threshold = threshold
return best_threshold
The optimal threshold shifts with transaction size. A $10 transaction has a lower fraud cost and a higher relative cost of false positives than a $5,000 transaction. Production systems use tiered thresholds: below $100 requires a score above 0.90 to decline, above $1,000 requires only 0.65. The specific numbers come from the feedback loop data where you can measure actual chargeback amounts versus actual abandoned sales revenue.
Feedback Loop Design
The feedback loop closes the system. Chargeback data arrives 30-90 days after a transaction. Analyst review outcomes arrive within hours. Both feed into a continuous training pipeline that retrains the gradient boosting model weekly and the neural network monthly.
from kafka import KafkaConsumer, KafkaProducer
import json
class FeedbackLoopConsumer:
"""
Consumes fraud outcome events and enriches them with
the original feature vector for model retraining.
"""
def __init__(self, kafka_bootstrap, scoring_db, training_store):
self.consumer = KafkaConsumer(
"fraud-outcomes",
bootstrap_servers=kafka_bootstrap,
group_id="feedback-loop-enricher",
value_deserializer=lambda m: json.loads(m.decode()),
auto_offset_reset="earliest",
enable_auto_commit=False,
)
self.db = scoring_db
self.store = training_store
def process(self):
for message in self.consumer:
outcome = message.value
txn_id = outcome["txn_id"]
label = outcome["training_label"] # 1=fraud, 0=legitimate
# Fetch original feature vector from audit log
decision_record = self.db.get_decision(txn_id)
if not decision_record:
continue
training_example = {
"txn_id": txn_id,
"features": decision_record["feature_vector"],
"label": label,
"original_score": decision_record["final_score"],
"original_decision": decision_record["decision"],
"outcome_type": outcome["outcome_type"],
"outcome_delay_hours": outcome["outcome_delay_hours"],
}
# Write to training store partitioned by outcome date
self.store.append(training_example)
# Update online feature store with confirmed fraud signal
if label == 1:
self._update_fraud_signals(decision_record)
self.consumer.commit()
def _update_fraud_signals(self, decision_record: dict):
"""
Immediately update feature store when fraud is confirmed,
so future transactions from the same device/IP get elevated risk.
"""
user_id = decision_record["user_id"]
device_fp = decision_record["device_fingerprint"]
# Decay trust scores for involved entities
self.store.decrement_trust(f"device:{device_fp}", decay=0.4)
self.store.decrement_trust(f"user:{user_id}", decay=0.2)
Scaling and Performance
Capacity Estimation
Given:
- 10,000 TPS sustained, 50,000 TPS peak
- 100ms P99 latency target
- Feature store: 100M active users, 20 features each at 200 bytes = 4TB Redis
- Velocity counters: 100M users, avg 5 txns/day active, 120 bytes/entry
100M * 5 * 120 = 60GB for 24h window (manageable with sharding)
Scoring service sizing:
- Each scoring request: ~5ms CPU (rule eval + GB inference)
- At 10,000 TPS: 50,000 ms CPU/s = 50 CPU cores just for rule+GB
- NN inference: GPU-based, 1 A100 handles ~800 NN inferences/s at 40ms
- At 10,000 TPS with 60% reaching NN: 6,000 NN calls/s = 8 A100 GPUs
- API + rule engine: 16 nodes * 8 cores = 128 cores (comfortable headroom)
Redis velocity cluster:
- 60GB data + 2x replication = 180GB across cluster
- At 10,000 TPS, each txn does 4 ZCOUNT + 4 ZRANGEBYSCORE = 8 ops
- 80,000 Redis ops/s: 4 Redis nodes with 20,000 ops/s each (trivial)
- Shard by user_id % num_shards for locality
Feature store Redis cluster:
- 4TB total, 2x replication = 8TB
- 128GB nodes: 64 Redis nodes
- 10,000 HGET/s: well within per-node capacity
Decision audit log (PostgreSQL):
- 10,000 writes/s to fraud_scoring_decisions
- 1KB per row average = 10MB/s = 864GB/day
- Partition by month, archive to S3 after 90 days
- 3 write replicas + 2 read replicas for query path
Kafka feedback pipeline:
- 10,000 txn events/s + 200 outcome events/s (2% feedback rate)
- 3 brokers, 24 partitions, RF=3: easily handles this volume
Horizontal scaling for the API nodes is straightforward - the scoring service is stateless. The challenging scaling dimension is the Redis feature store. The velocity counters are sharded by user_id % shard_count, ensuring all velocity operations for a user hit the same shard. The feature store hash keys are sharded by user ID hash for even distribution. When adding shards, consistent hashing minimizes key migration during rebalancing.
The ML inference layer scales differently for the two models. The gradient boosting model is CPU-bound and scales horizontally by adding inference nodes. The neural network is GPU-accelerated and scales by adding GPU instances, which are expensive and require more careful capacity planning. At 60,000 NN calls per second during peak, 80 A100 GPUs are needed - batching multiple requests into a single inference call is essential to amortize GPU utilization.
PayPal’s fraud detection system processes over 15 million transactions daily across 200+ countries. Their published architecture uses a tiered approach nearly identical to what is described here: fast rule evaluation in under 5ms, followed by gradient boosting in 15ms, with deep learning reserved for high-value or ambiguous transactions. PayPal reports that their ensemble approach reduced false positives by 37% compared to their previous rule-only system, while simultaneously improving fraud detection rate by 12%.
Failure Modes and Recovery
| Failure | Detection | Impact | Recovery |
|---|---|---|---|
| Redis velocity counter node failure | Redis Sentinel heartbeat, 5s detection | Velocity features unavailable for users on that shard; counters default to 0 | Automatic failover to replica in ~30s; missing counters mean slightly lower fraud signal for affected users; acceptable risk versus blocking payments |
| ML inference node crash | Health check + circuit breaker, 2s detection | NN scoring unavailable; ensemble falls back to GB + rules only | Load balancer removes unhealthy node; remaining nodes absorb traffic; auto-scaling adds replacement node within 3 minutes |
| Feature store staleness (batch job lag) | Monitoring on feature freshness timestamp; alert if age > 30 min | Pre-computed features reflect state from 30+ minutes ago; velocity counters still fresh (real-time) | Alert ops; rule engine continues using real-time velocity; ML features degrade gracefully with older inputs |
| Payment processor timeout (>100ms) | Latency P99 monitoring; alert if >95ms P95 | Processor declines transaction as timeout - worse than a fraud DECLINE because user gets no useful error | Circuit breaker returns APPROVE after 95ms deadline regardless of score; log for audit; tunable via feature flag |
| Kafka feedback pipeline lag | Consumer group lag metric; alert if >100k messages behind | Fraud outcomes not flowing to training store; model training data becomes stale | Increase consumer parallelism; if persistent, replay from stored offset; model training is batch anyway so 1-2 day lag is acceptable |
| Database (audit log) write failure | Write error rate monitoring | Scoring decision not persisted for audit; regulatory compliance risk | Dual-write: write to Kafka first, PostgreSQL asynchronously; Kafka serves as durable buffer for recovery replay |
The most dangerous failure mode is a latency spike that causes the scoring service to return APPROVE as a timeout fallback for thousands of transactions that should have been DECLINE or REVIEW. Fraudsters who discover the timeout behavior can deliberately trigger slow scoring by submitting transactions designed to max out all model features simultaneously. Monitor for correlated latency spikes with elevated approval rates - this pattern indicates either an attack or a cascading performance degradation that is masking as availability.
Comparison of Approaches
| Approach | Latency | Accuracy | Operational Cost | Best For |
|---|---|---|---|---|
| Rule-only engine | 5ms P99 | Limited: misses novel patterns, requires manual maintenance | Low: no ML infra | Low-volume or tightly-scoped fraud domains with stable patterns |
| Single ML model (no rules) | 50-100ms P99 | Good for trained patterns, weak on novel velocity attacks | Medium: model training pipeline required | Businesses with clean labeled data and stable fraud distribution |
| Rule engine + gradient boosting | 25ms P99 | High: rules catch obvious fraud fast, GB handles subtlety | Medium-high: two systems to maintain | Most payment fraud scenarios at scale |
| Full ensemble (rules + GB + NN) | 80-100ms P99 | Highest: NN captures temporal behavioral shifts | High: GPU infra, shadow mode, feedback loop | High-value payment flows where fraud cost justifies model complexity |
| Graph-based fraud network detection | 200-500ms P99 | Very high for organized fraud rings | Very high: graph DB, GNN training | Detecting coordinated fraud across account networks |
| External fraud bureau API | 50-200ms P99 (external) | Variable: depends on bureau data freshness | Low in-house, high vendor cost | Startups or low-volume without training data to build in-house models |
The correct choice for a payment platform at 10,000 TPS is the full ensemble for high-value transactions (above $100) and rule engine plus gradient boosting for low-value transactions. The NN adds meaningful accuracy improvement but costs 4-6x more compute than gradient boosting alone. Routing transactions to the appropriate scoring tier by amount captures most of the accuracy benefit at a fraction of the cost: 80% of transactions are low-value and decided by the cheaper tier, while the expensive NN is reserved for the 20% of high-value transactions where the fraud cost justifies the compute spend.
Key Takeaways
- Online feature freshness is the foundational constraint - a model trained on stale features makes decisions on stale reality; velocity counters must be updated within seconds, not minutes.
- Early exit architecture multiplies throughput - when 50% of transactions exit at the rule engine in 5ms, the ML ensemble only needs to handle 50% of peak volume, halving the required inference capacity.
- False positive cost is asymmetric by transaction value - a $10 declined transaction costs more in customer damage relative to fraud risk than a $5,000 declined transaction; thresholds must be tiered by amount.
- Shadow mode is not optional for model updates - promoting a model directly to production without shadow validation risks a sudden spike in false positives or false negatives that damages both user trust and fraud loss simultaneously.
- The feedback loop is the moat - models trained on your own labeled fraud outcomes are far superior to generic models because fraud patterns are merchant-category and region-specific; the longer you operate the feedback loop, the harder the system is to replicate.
- Velocity counters in Redis sorted sets provide exact sliding windows - approximate data structures save memory but lose the precision needed for fraud decisions where 12 transactions vs. 13 transactions changes the verdict.
- Failsafe to APPROVE, not DECLINE - when the scoring system itself fails, blocking all payments is worse than approving potentially fraudulent ones; design for graceful degradation to a permissive baseline while alerting immediately.
- Rule thresholds drift with adversarial pressure - fraudsters learn your thresholds and adjust to stay below them; automated threshold review and model retraining is not maintenance, it is survival.
The counter-intuitive lesson in fraud detection is that the hardest problem is not the ML model - it is the feedback loop latency. Chargebacks arrive 30-90 days after the transaction. By then, the fraud pattern has evolved and the training signal is partially stale. The systems that win are the ones that supplement chargeback data with faster signals: analyst review outcomes (hours), customer fraud reports (days), and cross-merchant network signals from payment network partners. Invest in feedback loop latency reduction before model complexity.
Frequently Asked Questions
Q: Why not just use a single neural network instead of a rule engine plus gradient boosting plus neural net ensemble?
A: A single neural network has three problems for this use case. First, inference latency is 40-80ms even on GPU - with no fast-path exit, every transaction pays this cost. Second, neural networks are black boxes that make regulatory compliance difficult; payment regulations in most jurisdictions require explainable fraud decisions that can be audited and challenged. Third, neural networks need large amounts of labeled training data to generalize; gradient boosting works well with thousands of examples while neural networks need millions. The ensemble gets the best of all three: rules provide explainability and fast exit, gradient boosting handles structured tabular features with limited data, and the neural network captures temporal behavioral patterns.
Q: Why not use a graph database to detect fraud rings instead of behavioral features per user?
A: Graph-based fraud ring detection is extremely valuable but too slow for synchronous transaction scoring. Building a transaction graph, running community detection algorithms, and scoring graph centrality takes 200-500ms at minimum - well beyond the 100ms budget. The practical approach is to run graph analysis asynchronously as a batch job every 15-30 minutes, materializing “fraud ring membership score” as a pre-computed feature that the online store serves to the synchronous scoring path. You get the graph intelligence without paying the graph traversal latency on the critical path.
Q: How do you handle a user who legitimately travels internationally and triggers the impossible travel rule?
A: Two mechanisms. First, the impossible travel rule signals a FLAG at 0.91 score but is a DECLINE only if the time gap is under 2 hours - international travel takes longer, so the rule has a built-in grace period. Second, the ML model sees the user’s historical travel patterns as a feature: if the user frequently transacts in multiple countries, the model assigns lower fraud probability to cross-country transactions for that user. The third mechanism is friction: a REVIEW decision can trigger a 3DS challenge or a push notification to the user’s phone asking them to confirm the transaction, converting a potential false positive into an authenticated transaction.
Q: Why use Redis sorted sets for velocity counters instead of a streaming system like Flink with windowed aggregations?
A: Flink windowed aggregations are excellent for analytics but have two problems for synchronous fraud scoring. First, Flink windows produce results at window boundaries - a 1-minute tumbling window tells you the count in the last complete minute, not in the last 60 seconds from now. The fraud detection needs an exact sliding window. Second, Flink result availability has latency - the aggregation result is published after the window closes, which adds seconds of staleness. Redis sorted sets provide exact sliding windows with millisecond-fresh results because you compute the window at query time, not at window close time.
Q: How do you prevent the feedback loop from reinforcing biases - for example, if the model over-declines transactions from certain demographics?
A: This is a real risk called feedback loop bias. The mechanism: model declines user group X more often, meaning X generates fewer positive transaction examples, meaning the next training run sees X as lower risk due to survivorship bias - or higher risk because the declined ones were never tested. Mitigation requires three things. First, maintain a holdout set: a small percentage of high-score transactions are randomly approved regardless of model decision and their outcomes tracked. Second, audit training data for demographic distribution parity before each training run. Third, track approval rates segmented by user cohort and alert when any cohort’s approval rate diverges more than 2 standard deviations from its 90-day baseline.
Q: What happens to scoring during model retraining or deployment?
A: Retraining and deployment are fully decoupled from scoring. The model is trained offline, serialized to S3, and registered in the model registry table with status “shadow”. The scoring service polls the model registry every 60 seconds and loads new shadow models into a separate inference slot. Shadow scoring runs asynchronously. When shadow validation passes and a human approves the promotion, the registry status changes to “production” and the scoring service picks it up on its next poll cycle - typically within 60 seconds - with zero downtime and no restart required.
Interview Questions
Q: Walk me through how a transaction gets scored end-to-end in under 100ms.
Expected depth: Trace the path from API Gateway receiving the payload, to parallel feature lookups (online feature store HGET + Redis velocity ZCOUNT, both within 15ms), to rule evaluation (sequential rule checks in priority order, early exit if any rule fires above block threshold), to ML ensemble (gradient boosting inference in 15ms, neural network in 40ms run in parallel, combine scores with weights), to decision engine (threshold comparison, audit log write, response return). Name the time budget for each stage: 15ms features, 5ms rules, 60ms ML, 10ms decision + write = 90ms with 10ms buffer. Discuss the failsafe: return APPROVE after 95ms regardless of scoring state.
Q: How would you design the velocity counter to handle a Redis node failure without losing accuracy?
Expected depth: Explain that Redis Sentinel provides automatic failover to a replica in 30 seconds. During the 30-second gap, velocity counters for affected shards return 0 (missing key defaults), which means those users appear to have zero transaction history and the fraud score drops. This is the correct trade-off - brief false negatives during a Redis failure are less damaging than blocking all payments. After failover, the replica has up-to-date data (Redis replication is asynchronous with sub-second lag). For the 30-second recovery window, discuss tracking which users had missing counters and applying a temporary elevated-risk flag until their counters are verified fresh.
Q: The model team wants to update the gradient boosting model every week. How do you do this with zero downtime?
Expected depth: Explain the model registry + shadow mode pipeline: train offline, register as shadow, run shadow on 100% of traffic for 24-48 hours, compare shadow decisions against production decisions and against known fraud outcomes, if metrics meet promotion criteria then update registry status to production, scoring service polls registry and hot-swaps model in memory within 60 seconds. Discuss the fallback: model versions are immutable artifacts in S3; if a promoted model shows degradation (false positive rate spiking, latency regression), rollback is changing the registry status back to the previous version - same 60-second hot-swap mechanism. No deployment, no restart.
Q: Your fraud false negative rate is acceptable but false positives are too high - 1.2% of legitimate transactions are being blocked. What do you do?
Expected depth: Start with diagnosis: segment the 1.2% false positives by rule versus ML decision, by user segment, by transaction type. If most false positives come from a single rule, calibrate that rule’s threshold upward. If they come from the ML score, lower the DECLINE threshold (raise the bar for declines) at the cost of accepting more fraud - quantify that trade-off using the cost-optimization function. Introduce a REVIEW tier: transactions between score 0.55 and 0.75 go to REVIEW with a friction step (SMS OTP or 3DS challenge) instead of hard DECLINE, converting false positives into authenticated approvals. Track friction acceptance rate as a proxy for false positive rate. Finally, improve the training data: add more confirmed-legitimate examples to the training set to reduce the model’s tendency to over-classify legitimate patterns as fraud.
Q: How would you extend the system to detect fraud rings - coordinated attacks using many accounts?
Expected depth: Fraud rings require graph analysis that cannot run synchronously. The architecture extension: build a transaction graph with nodes as users, cards, devices, and IP addresses, edges as shared attributes (same device used by multiple accounts, same IP, same billing address). Run community detection (Louvain algorithm or label propagation) on this graph daily or every 4 hours as a batch job. Output a “ring membership score” per user and a “connected account cluster ID.” Materialize these as features in the online store, so the synchronous scoring path can use them without graph traversal cost. Additionally, add a real-time streaming signal: when any account in a cluster is confirmed as fraud, immediately elevate the risk score for all accounts in the same cluster via a Kafka event that updates their online store features.
Premium Content
Unlock the full article along with everything else in the archive — all in one place.