Build a Feature Flag Service


scalability reliability distributed-systems

System Design Deep Dive

Feature Flag Service

Ship code to 1% of users, roll it back in milliseconds - without a single deploy.

⏱ 14 min read📐 Advanced🏗️ Infrastructure

Imagine you ship a new checkout flow and within two minutes your error rate climbs from 0.1% to 8%. With a traditional deploy, your only option is a rollback - a process that takes five to fifteen minutes of CI/CD pipeline time, during which thousands of customers keep hitting the broken code. A feature flag turns that fifteen minutes into fifteen seconds: you flip a switch, and every application instance in every region switches off the new flow before the next request arrives.

Think of a feature flag service as a distributed circuit breaker. A circuit breaker in electrical engineering protects a system by cutting current instantly when a fault is detected - no waiting for a fuse to melt, no cascading damage. A feature flag does the same thing for software: it sits between your code and your users, ready to interrupt a bad rollout without touching a single line of deployed code. Unlike a circuit breaker, it also lets you dial up exposure gradually, from 0% to 1% to 10% to 100%, watching your error rates at each step.

The naive approaches fail in predictable ways. Environment variables require a full redeploy to change - useless for a kill switch. A database column queried on every request works until you have 10,000 requests per second, at which point flag evaluation adds 10,000 extra DB queries per second and your flag store becomes a throughput ceiling. Hardcoded if feature_enabled: checks scattered across a codebase have no propagation mechanism at all - you are back to redeploys. Config files cached in memory are fast to evaluate but have no live-update path; stale configs can persist for minutes or hours depending on your restart cadence.

The requirements pull in three directions simultaneously. Evaluation must be sub-millisecond because flags live on every hot path. Propagation must be sub-500ms because a kill switch that takes a minute to propagate is not a kill switch. Availability must be near-100% because every service in your stack depends on the SDK to evaluate correctly. We need to solve for in-process evaluation, streaming propagation, and offline resilience simultaneously.

Requirements and Constraints

Functional requirements

  • Create, read, update, and delete feature flags via a Management API
  • Target flags to specific user IDs, user attributes (plan, country, device type), or percentage buckets
  • Instant kill switch: disable any flag globally within 500ms of the API call completing
  • Client SDK that performs flag evaluation entirely in-process with zero network calls on the evaluation path
  • Audit trail: every flag change logged with actor identity, timestamp, old value, and new value
  • Support boolean, string, number, and JSON flag value types

Non-functional requirements

  • Flag evaluation latency: p99 < 1ms (all local, no network round-trip)
  • Flag propagation latency: < 500ms globally after any flag change is committed
  • SDK availability: 99.99% (must evaluate flags even if all flag servers are unreachable)
  • Flag store capacity: 10,000+ flags per account
  • Propagation throughput: 50,000+ connected SDK instances per region

Constraints

  • SDK must operate offline using last-known flag configuration
  • No synchronous network calls on the flag evaluation path
  • Eventual consistency is acceptable: a window of a few hundred milliseconds of stale values is fine

High-Level Architecture

Feature Flag Service Architecture - Management API, Propagation Service, and SDK instances

The system has five major components. The Management API is a REST service that handles all flag CRUD operations, validates targeting rules, and persists changes to PostgreSQL. The Flag Config Store (PostgreSQL) is the single source of truth for all flag configurations, rules, and overrides. The Propagation Service is a stateful fan-out service that maintains long-lived SSE connections to every SDK instance and pushes flag deltas in real time. The Audit Store is an append-only PostgreSQL table (partitioned by month) that records every flag mutation for compliance and debugging. The Client SDK is a library embedded directly in application processes - it bootstraps by fetching all flag configs, then holds an open SSE connection to receive deltas, and evaluates flags entirely in local memory.

Data flows in two directions. When an engineer changes a flag via the Management API, the change is written to PostgreSQL, an event is published to Redis pub/sub, and the Propagation Service picks up that event and pushes a JSON delta to every connected SDK instance within the 500ms target. When application code evaluates a flag, there is no data flow at all - the SDK reads from its in-memory HashMap, a pure local operation that completes in microseconds.

Key Insight

The most important architectural decision is putting flag evaluation in-process: the SDK never makes a network call to evaluate a flag, which gives you sub-millisecond latency and near-perfect availability even if your servers are down.

Component Deep Dives

The Client SDK

The SDK is the most critical component in the system, and it is also the most invisible. Application engineers interact with it through a two-line API, but underneath it is managing a bootstrap fetch, an SSE connection with reconnect logic, a version-tracked in-memory flag store, and a graceful degradation path for when the server is unreachable.

On startup, the SDK performs a synchronous HTTP GET to /flags?api_key=... and receives a JSON payload containing every flag configuration for that account. This payload is parsed into an in-memory HashMap<String, FlagConfig>. Lookups are O(1). The entire flag state for an account with 10,000 flags fits comfortably in a few megabytes of RAM.

Immediately after bootstrapping, the SDK opens a persistent SSE connection to the Propagation Service. Every time a flag changes, the SDK receives a JSON message with the flag key, the new configuration, and a version number. The SDK updates its local HashMap atomically and increments the version counter for that flag. The SSE connection is unidirectional - the SDK never sends anything back - which makes reconnection simple and stateless from the server’s perspective.

If the SSE connection drops (network interruption, server restart, load balancer timeout), the SDK continues evaluating flags against its last-known values and retries the connection with exponential backoff starting at 1 second. The Last-Event-ID header, sent automatically by the browser SSE API and implemented manually in server-side SDKs, tells the Propagation Service which event the SDK last received, allowing missed events to be replayed on reconnect.

Feature Flag Service data flow - SDK evaluation path and SSE update path
# Initialize once at app startup
client = FeatureFlagClient(api_key="...", sse_url="https://flags.example.com/stream")
client.wait_for_ready(timeout_ms=2000)

# Evaluate anywhere, zero network calls
def handle_checkout(user_id: str, order: Order):
    if client.bool("new-checkout-flow", user_id, default=False):
        return new_checkout_handler(order)
    return legacy_checkout_handler(order)

Real World

LaunchDarkly and Split.io both use this same in-process evaluation pattern. LaunchDarkly’s SDKs maintain a persistent streaming connection and evaluate flags locally, achieving sub-100ms propagation in practice - well within our 500ms target. Their architecture documentation explicitly calls out that “the SDK does not make any network requests to evaluate a feature flag.”

The Flag Evaluation Engine

The evaluation engine is a pure function: given a flag configuration and a user context, it returns a value. No I/O, no side effects, no mutable state outside the flag HashMap it reads from. This property makes the engine trivially testable and safe to call from any thread.

Flag Evaluation Engine internals - targeting rules and percentage bucketing

The engine evaluates rules in strict priority order. Individual user overrides are checked first - these are exact-match lookups by user ID and are useful for giving specific users (like beta testers or your own account) a fixed flag value regardless of rollout state. Next, the engine walks through the ordered list of targeting rules. Each rule specifies an attribute name, an operator, and a value to match against - for example, plan in ["enterprise", "pro"] or country == "US". If a rule matches, the engine returns that rule’s designated value immediately. If no rules match, the engine falls back to percentage rollout logic. Finally, if the user falls outside the rollout percentage, the engine returns the flag’s configured default value.

import mmh3  # murmurhash3

def bucket_user(user_id: str, flag_key: str) -> int:
    # Combine user_id + flag_key to make buckets flag-specific
    hash_key = f"{flag_key}.{user_id}"
    # murmurhash3 gives consistent distribution, fast and collision-resistant
    hash_value = mmh3.hash(hash_key, signed=False)
    return hash_value % 100  # returns 0-99

def evaluate_flag(flag: FlagConfig, user: UserContext) -> FlagValue:
    # 1. Check individual user overrides first
    if user.user_id in flag.overrides:
        return flag.overrides[user.user_id]

    # 2. Evaluate targeting rules (segment membership)
    for rule in flag.rules:
        if matches_rule(rule, user):
            return rule.value

    # 3. Percentage rollout
    bucket = bucket_user(user.user_id, flag.key)
    if bucket < flag.rollout_percentage:
        return flag.rollout_value

    # 4. Default value
    return flag.default_value

The reason murmurhash3 is used instead of Python’s built-in hash() or a random number is consistency. Given the same user_id and flag_key, murmurhash3 always produces the same bucket number - across every process, every language, every restart. This means user 12345 always sees the same variant of new-checkout-flow on every request, which is essential for A/B test validity and for preventing the jarring experience of users seeing a feature appear and disappear between page loads.

Watch Out

Bucket assignment must include the flag key in the hash input. If you hash only the user ID, all of your 1%-rollout flags hit the exact same 1% of users simultaneously. Those users become accidental guinea pigs for every experiment, and your A/B test results are completely confounded. Including the flag key gives each flag its own independent, uniformly distributed user population.

The Kill Switch and Flag Propagation

A kill switch is not a special feature - it is enabled: false propagated quickly. The same pipeline that handles a routine percentage increase from 10% to 20% also handles an emergency kill. This simplicity is not accidental: when something is on fire, you do not want to reason about two separate code paths.

When an engineer toggles a flag off in the Management API, the following sequence happens: the API writes the updated flag record to PostgreSQL with enabled = false; it then publishes a change event to a Redis pub/sub channel keyed by account ID; the Propagation Service, subscribed to that channel, receives the event; it serializes the new flag configuration into a JSON delta; and it writes that delta to every SSE connection belonging to that account. Each SDK receives the delta within its SSE stream, updates its local HashMap, and returns false for all subsequent evaluations of that flag. The end-to-end latency from the database write completing to the last SDK updating is under 500ms in a single region with a healthy Redis connection.

The Propagation Service manages SSE connections using consistent hashing on SDK instance ID. Each instance always connects to the same node, which means the node can efficiently track “which flags has this instance already seen” using a per-connection version map. When a connection drops and reconnects (even to a different node after a failover), the Last-Event-ID header tells the new node which event to replay from.

Feature Flag Service scaling - multi-region propagation

SSE vs WebSocket vs Polling

SSE is ideal for flag propagation. It is unidirectional (server to client only), uses plain HTTP/1.1 so it works through corporate proxies and load balancers that inspect traffic, has built-in reconnection semantics via Last-Event-ID, and is natively supported in every browser and most HTTP client libraries. WebSocket adds bidirectional overhead you do not need - the SDK never sends anything to the server on the propagation channel. Polling (every 30 seconds) is simpler to implement but gives you up to 30 seconds of stale values, which makes it useless for kill switches.

Audit Logging

Every mutation to the flag store - creates, updates, deletes, enables, disables - is recorded in an append-only audit log table. The audit record captures the flag key, the change type, the identity of the actor who made the change, a timestamp, the full old configuration as JSONB, and the full new configuration as JSONB. Nothing is ever updated or deleted from this table; it is a permanent record.

The audit table is partitioned by month using PostgreSQL declarative partitioning. This means you can drop a partition to purge records older than your retention policy (say, 24 months) as a single DDL statement that takes milliseconds, rather than a DELETE that scans millions of rows and locks the table for minutes. New partitions are created at the start of each month.

-- Audit log write on every flag mutation
INSERT INTO flag_audit_log (
    flag_key,
    event_type,
    changed_by,
    old_value,
    new_value,
    created_at
) VALUES (
    $1,          -- 'new-checkout-flow'
    $2,          -- 'updated'
    $3,          -- 'alice@example.com'
    $4::jsonb,   -- previous flag config snapshot
    $5::jsonb,   -- new flag config snapshot
    now()
);

The audit log is distinct from an evaluation log. The audit log records human-driven changes - it is low volume (maybe thousands of rows per day) and must be durable. An evaluation log, which would record every SDK evaluation for analytics, is much higher volume (billions of rows per day) and warrants a separate pipeline (Kafka to a columnar store like ClickHouse or BigQuery), not a PostgreSQL table.

Data Model

-- Core flag storage
CREATE TABLE feature_flags (
    id          BIGSERIAL PRIMARY KEY,
    flag_key    TEXT NOT NULL UNIQUE,
    name        TEXT NOT NULL,
    description TEXT,
    flag_type   TEXT NOT NULL CHECK (flag_type IN ('boolean', 'string', 'number', 'json')),
    enabled     BOOLEAN NOT NULL DEFAULT false,
    default_value JSONB NOT NULL,
    rollout_percentage SMALLINT NOT NULL DEFAULT 0 CHECK (rollout_percentage BETWEEN 0 AND 100),
    rollout_value JSONB,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Targeting rules (ordered evaluation)
CREATE TABLE flag_rules (
    id          BIGSERIAL PRIMARY KEY,
    flag_id     BIGINT NOT NULL REFERENCES feature_flags(id) ON DELETE CASCADE,
    rule_order  SMALLINT NOT NULL,
    attribute   TEXT NOT NULL,   -- e.g. 'plan', 'country', 'user_id'
    operator    TEXT NOT NULL,   -- e.g. 'in', 'equals', 'starts_with'
    value       JSONB NOT NULL,  -- e.g. ["enterprise", "pro"]
    serve_value JSONB NOT NULL,
    UNIQUE (flag_id, rule_order)
);

-- User-level overrides
CREATE TABLE flag_overrides (
    id          BIGSERIAL PRIMARY KEY,
    flag_id     BIGINT NOT NULL REFERENCES feature_flags(id) ON DELETE CASCADE,
    user_id     TEXT NOT NULL,
    value       JSONB NOT NULL,
    UNIQUE (flag_id, user_id)
);

-- Append-only audit log
CREATE TABLE flag_audit_log (
    id          BIGSERIAL PRIMARY KEY,
    flag_key    TEXT NOT NULL,
    event_type  TEXT NOT NULL,   -- 'created', 'updated', 'deleted', 'enabled', 'disabled'
    changed_by  TEXT NOT NULL,
    old_value   JSONB,
    new_value   JSONB,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now()
) PARTITION BY RANGE (created_at);

CREATE TABLE flag_audit_log_2026_06 PARTITION OF flag_audit_log
    FOR VALUES FROM ('2026-06-01') TO ('2026-07-01');

-- Indexes
CREATE INDEX idx_flag_rules_flag_id ON flag_rules(flag_id, rule_order);
CREATE INDEX idx_flag_overrides_flag ON flag_overrides(flag_id, user_id);
CREATE INDEX idx_audit_log_flag_key ON flag_audit_log(flag_key, created_at DESC);

The flag_key column has a UNIQUE constraint, which creates a B-tree index automatically. Lookups by flag key (the primary access pattern from the Management API) are O(log n) against the index, effectively constant for any practical number of flags. The rules table’s composite index on (flag_id, rule_order) means that fetching all rules for a flag in evaluation order is a single index scan. The audit log’s index on (flag_key, created_at DESC) supports the common query pattern “show me the last 20 changes to this specific flag.”

Partitioning the audit log by month also means that queries scoped to a recent time range (the most common case - “what changed this week?”) only scan the current month’s partition, not the entire table. PostgreSQL’s partition pruning eliminates irrelevant partitions at query planning time.

Key Algorithms and Protocols

Consistent Percentage Bucketing

The bucketing algorithm is the heart of every gradual rollout. It must be deterministic (same user always gets same bucket), uniformly distributed (buckets 0-99 each contain exactly 1% of users), fast (called on every evaluation), and flag-specific (different flags target independent user populations).

import mmh3
from typing import Optional

def bucket_user(user_id: str, flag_key: str) -> int:
    """
    Returns a stable bucket 0-99 for this (user, flag) pair.
    Same inputs always produce the same output.
    Different flag_keys produce different bucket distributions.
    """
    if not user_id:
        # Anonymous users: use a stable session ID or return -1
        # to force fallback to default value
        return -1
    hash_key = f"{flag_key}.{user_id}"
    hash_value = mmh3.hash(hash_key, signed=False)
    return hash_value % 100

def is_user_in_rollout(user_id: str, flag_key: str, rollout_percentage: int) -> bool:
    """
    Returns True if this user is within the rollout percentage for this flag.
    rollout_percentage=0 means nobody sees it; rollout_percentage=100 means everyone does.
    """
    if rollout_percentage <= 0:
        return False
    if rollout_percentage >= 100:
        return True
    bucket = bucket_user(user_id, flag_key)
    if bucket < 0:
        return False
    return bucket < rollout_percentage

The algorithm runs in O(len(hash_key)) time, which is effectively O(1) since flag keys and user IDs are bounded in length. murmurhash3 is not a cryptographic hash - it makes no security guarantees - but it provides excellent distribution uniformity and runs at approximately 500MB/s, making it roughly 10-50x faster than SHA-256 for this use case. The non-cryptographic property is fine here: users cannot meaningfully manipulate their ID to land in a specific bucket, and even if they could, the consequence would be seeing a slightly different UI, not a security vulnerability.

SSE Protocol for Flag Propagation

Server-Sent Events uses a simple text protocol over a persistent HTTP connection. Each event is a block of field: value lines terminated by a blank line. The Propagation Service sends events in this format:

id: 4291
data: {"flag_key":"new-checkout-flow","version":42,"config":{"enabled":true,"rollout_percentage":10,"default_value":false}}

id: 4292
data: {"flag_key":"dark-mode","version":7,"config":{"enabled":false,"default_value":false}}

The id field is a monotonically increasing sequence number assigned by the Propagation Service (not a timestamp, to avoid clock skew issues). When the SDK reconnects after a disconnect, it sends the Last-Event-ID header with the highest event ID it has processed. The Propagation Service reads this value and replays any events with IDs greater than the provided value before resuming live streaming. This replay mechanism ensures no flag change is missed even across multi-second disconnections.

# SSE endpoint - one long-lived response per SDK instance
async def sse_stream(request: Request, api_key: str):
    last_event_id = request.headers.get("Last-Event-ID", "0")

    async def generate():
        # Send any events missed since last_event_id
        missed = await get_events_since(api_key, int(last_event_id))
        for event in missed:
            yield f"id: {event.id}\ndata: {event.to_json()}\n\n"

        # Subscribe to live updates
        async with pubsub.subscribe(channel=f"flags:{api_key}") as channel:
            async for message in channel:
                yield f"id: {message.id}\ndata: {message.to_json()}\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream",
        headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"})

The X-Accel-Buffering: no header tells Nginx not to buffer the response body - without this, Nginx will accumulate SSE events in its buffer and deliver them in batches rather than immediately.

Scaling and Performance

Flag evaluation scales entirely with application CPU - once the SDK is bootstrapped, the flag server has no involvement in evaluations. The scaling challenge lives entirely in the Propagation Service, which must maintain 50,000 concurrent long-lived SSE connections per region and fan out flag changes to all of them quickly.

A single Propagation Service node running Python asyncio or Go goroutines can comfortably handle 5,000 concurrent SSE connections, limited primarily by memory (each connection holds a small amount of state) and the kernel’s file descriptor limit (set ulimit -n 65536). Ten nodes per region handles 50,000 connections. For redundancy, run twelve nodes and use consistent hashing so the load balancer routes each SDK instance to a specific node.

Capacity estimation:

Given:
  - 10,000 flag changes per day
  - 50,000 SDK instances connected per region
  - Average flag config size: 2KB per delta event
  - Propagation: each change fans out to 50,000 connections

Bandwidth per flag change: 50,000 * 2KB = 100MB per change event
Peak bandwidth (10 changes/sec burst): 1 GB/sec egress per region

SSE connections: 50,000 * ~1KB RAM per connection = 50MB RAM for connection state
Storage: 10,000 flags * 2KB average = 20MB total (trivially small)
Audit log: 10,000 changes/day * 500B per row = 5MB/day = 1.8GB/year

The Redis pub/sub layer decouples Management API writes from the Propagation Service fan-out. Without Redis, every Management API instance would need direct connections to all Propagation Service nodes, creating a topology that grows as O(api_nodes * propagation_nodes). With Redis as the intermediary, each Management API instance publishes to one channel and each Propagation node subscribes to that channel - a much simpler topology.

The thundering herd problem occurs when many SDK instances reconnect simultaneously - after a Propagation Service restart, for example. If all 50,000 SDK instances reconnect at once and each makes an HTTP request to bootstrap flag state, the result is 50,000 concurrent database reads. The mitigation is two-fold: first, the Propagation Service caches the complete flag configuration snapshot in Redis so reconnecting SDKs read from Redis (memory, microseconds) not PostgreSQL (disk, milliseconds); second, SDKs use jittered exponential backoff starting at a random time within a 1-5 second window, spreading the reconnect storm over several seconds.

Real World

LaunchDarkly uses a streaming architecture nearly identical to this - their SDK maintains a persistent connection and receives flag deltas in real time, achieving sub-100ms propagation in practice. Their engineering blog describes handling millions of concurrent SDK connections globally using this pattern, with Redis as the pub/sub backbone for inter-node communication.

Failure Modes and Recovery

FailureDetectionImpactRecovery
Propagation Service crashSDK SSE disconnectSDKs use stale flag valuesSDK auto-reconnects with exponential backoff; new node picks up from Redis
PostgreSQL write failsManagement API returns 5xxFlag change not persistedClient retries; change never propagated
Redis pub/sub partitionPropagation Service cannot publishSDKs do not receive updatePropagation Service polls Postgres as fallback every 5 seconds
SDK never connectedTimeout on startupUses default valuesSDK falls back to bundled defaults; configurable behavior
Clock skew on audit logTimestamp ordering incorrectAudit log shows wrong orderUse monotonic sequence IDs for ordering, not timestamps
Network partition (client side)SSE heartbeat missingSDK operates offlineSDK uses last-known values; acceptable for most flag types

Watch Out

The most common operational mistake is not setting a meaningful default_value on each flag. When the SDK has not yet received configs (cold start) or the SSE connection is severed for an extended period, the SDK falls back to the default. A new dark_mode flag with default: null will crash code that expects a boolean. A kill-switch flag with default: true defeats the entire purpose. Treat the default value as part of the flag’s API contract, defined at creation time, not changed casually.

Comparison of Approaches

ApproachPropagation LatencyEvaluation LatencyClient ComplexityFailure Mode
SDK + SSE (this design)< 500ms< 1ms (local)Medium - manages SSE connStale values on disconnect
Polling (SDK polls every 30s)~15s average< 1ms (local)LowUp to 30s stale values
Request-time fetch (no SDK)Instant10-50ms (network)NoneUnavailable if server down
Database flag columnInstant (same DB write)1-5ms (DB query)NoneDB becomes bottleneck
Environment variablesOn deploy only< 0.1msNoneRequires redeploy to change

The SDK + SSE approach wins for any system with more than a handful of services. The 15-second propagation lag of polling is unacceptable for kill switches - a service can emit hundreds of thousands of errors in 15 seconds. Request-time fetching adds network latency to every request and creates a hard dependency on the flag server’s availability: if the flag server is slow or down, your application slows down or breaks. Environment variables are fine for static configuration but useless for runtime flag changes. The database column approach collapses under load once your request rate exceeds what your flag-query-per-request budget permits.

Key Takeaways

  • In-process evaluation: putting evaluation in the SDK (not the server) eliminates network latency from the hot path and decouples flag availability from server availability.
  • Consistent bucketing: using hash(userId + flagKey) % 100 ensures users see stable feature assignments across requests, preventing “flag flickering” that destroys A/B test validity.
  • Flag-specific buckets: including the flag key in the bucket hash ensures different experiments target independent user populations; without it, all 1%-rollout flags hit the same 1% of users.
  • SSE over WebSocket: SSE is simpler, works through HTTP proxies, auto-reconnects, and is unidirectional - exactly what flag propagation requires.
  • Append-only audit log: flag changes are irreversible history; an append-only table with monthly partitioning gives you compliance-grade auditability with cheap retention management.
  • Offline-first SDK: designing the SDK to operate on last-known values (rather than failing open or closed) is the most important reliability decision in the whole system.
  • Kill switch is just a flag: a kill switch is not a special feature - it is enabled: false propagated in < 500ms; the same propagation path handles both gradual rollouts and emergency kills.
  • Default values are a contract: every flag’s default value defines the system’s behavior before any config loads; treat defaults as part of the API contract, not an afterthought.

The surprising lesson from building a feature flag service is that the flag server itself is not on the critical path for your application. Once the SDK bootstraps, your app can evaluate a billion flags per second with zero network calls, and will keep serving correct values even if every flag server in every region goes dark. The flag service becomes critical only for making changes - and you need it to be fast and reliable there. This inversion - where the “service” is mostly irrelevant to the service’s primary job - is the core architectural insight that separates a toy implementation from a production system.

Frequently Asked Questions

Q: Why not use a database row per flag and query it on each request?

At low scale this works fine, but you are adding a synchronous DB query to every request path. At 10,000 requests per second, that is 10,000 extra DB queries per second just for flag evaluation - your database becomes the bottleneck for feature rollouts. The in-process SDK approach costs zero extra queries after bootstrap.

Q: How do you handle a flag targeted to a user segment (e.g., “all enterprise users”)?

The SDK receives the full rule set, including segment membership rules. Segments are either pre-computed (user attribute plan == "enterprise") or you include the user’s segment IDs in the evaluation context. The rules engine evaluates these attributes locally without any network call - the segment definition lives in the flag config alongside the rules.

Q: What happens if two engineers change the same flag at the same time?

Last-write-wins via PostgreSQL’s row-level locking. Each update is an atomic compare-and-swap: UPDATE feature_flags SET ... WHERE id = $1 AND updated_at = $2. If the update misses because someone else changed it first, the Management API returns a 409 Conflict and the UI prompts the engineer to reload. The audit log captures both attempts, giving operators a complete picture of what happened.

Q: Why not use Redis as the primary flag store instead of PostgreSQL?

Redis is excellent for the propagation cache, but PostgreSQL gives you ACID semantics, foreign key constraints for rule integrity, and point-in-time recovery for the audit log. Losing the flag database means losing all your rollout configurations - that warrants durable storage with backup guarantees. Redis is in the read path, not the source of truth.

Q: How do you handle the SDK’s initial load (cold start latency)?

The SDK fetches all flags synchronously on startup, before your application starts handling requests. client.wait_for_ready(timeout_ms=2000) blocks until the initial fetch completes. If the flag server is unreachable at startup, the SDK uses bundled defaults (configurable at init time) and retries the fetch in the background. Most deployments see cold-start latency under 100ms on a healthy network.

Q: Why SSE instead of gRPC streaming for flag propagation?

gRPC streaming requires HTTP/2 end-to-end, which can be blocked by some load balancers and corporate proxies. SSE is plain HTTP/1.1, works through every proxy, and has built-in reconnection semantics via Last-Event-ID. For a client library that needs to work in every environment - browsers, mobile apps, legacy enterprise networks - SSE’s universality beats gRPC’s efficiency.

Interview Questions

Q: You are designing a feature flag service from scratch. How would you ensure flag evaluations never add latency to the request path?

Expected depth: explain the in-process SDK pattern - SDK bootstraps by fetching all flags, caches them in memory, evaluates locally. Discuss SSE or WebSocket for live updates. Cover graceful degradation (last-known values when connection drops). Mention that the evaluation function is a pure function of (flag_config, user_context) - no side effects, no I/O. A strong answer connects this to the CAP theorem tradeoff: you are choosing availability over consistency, accepting that some SDKs may be a few hundred milliseconds behind in exchange for zero-latency evaluation and resilience to server failures.

Q: How would you implement a “1% rollout” that ensures the same users always see the experiment?

Expected depth: cover consistent hashing with murmurhash3(userId + ”.” + flagKey) % 100. Explain why you must include the flag key in the hash (otherwise all 1%-rollout experiments hit the same users). Discuss edge cases: new users with no history, users with no ID (use anonymous session ID from a cookie), flag key changes invalidating existing bucket assignments. A strong answer also mentions that you should pre-compute bucket assignments in the SDK rather than calling out to a service, preserving the zero-latency evaluation property.

Q: The feature flag service itself goes down. What happens to your application?

Expected depth: with in-process SDK - nothing changes immediately. SDK uses last-known values from its local HashMap. Discuss the offline-first design. Cover cold-start failure specifically: an app that cannot start because the flag server is unreachable is a problem - mitigate with bundled defaults and a non-blocking startup path. Discuss what “good” default values look like for kill switches (default off) vs feature flags (default to existing behavior). A strong answer distinguishes between the Propagation Service being down (SDK evaluates stale values indefinitely) and the Management API being down (no new flag changes can be made, but evaluation is unaffected).

Q: How would you scale the flag propagation system to 1 million connected SDK instances globally?

Expected depth: horizontal scaling of Propagation Service nodes using async I/O, each handling 5,000-10,000 SSE connections. Multi-region deployment with regional propagation clusters. Redis pub/sub for intra-region fanout. Cross-region replication via PostgreSQL WAL streaming or Kafka. Discuss the thundering herd problem (mass reconnects after an outage) and mitigation: jitter on reconnect delay, Redis snapshot for initial load instead of hitting PostgreSQL, circuit breaker on the bootstrap fetch to avoid overloading the DB.

Q: How do you implement an audit log that satisfies compliance requirements (SOC 2, HIPAA)?

Expected depth: append-only table with no UPDATEs or DELETEs permitted. Capture: who, what, when, old value, new value. Use monotonic sequence IDs not just timestamps (clock skew breaks ordering). Immutable log means even admins cannot alter history. Discuss partitioning for retention management - drop old partitions rather than running DELETE. Cover the difference between audit log (human-driven flag changes, low volume, must be durable, PostgreSQL) and evaluation log (SDK evaluations, billions of events per day, different storage tier like ClickHouse or BigQuery).

Continue Learning

Want to see how these patterns hold up when traffic spikes 50x at 3 AM? That's exactly what this Premium deep-dive covers.