Build a Distributed Tracing System


observability microservices distributed-systems

System Design Deep Dive

Distributed Tracing System

Stitching billions of spans across 50+ microservices into a single queryable request timeline

⏱ 14 min read📐 Advanced🏗️ Observability

A user clicks “checkout” in your e-commerce app. The request fans out to 50+ microservices - payment validation, inventory reservation, fraud scoring, shipping cost calculation, tax computation, loyalty point deduction, notification dispatch. The response comes back in 340ms. Two weeks later, a customer reports their order never arrived and your logs show nothing obviously wrong. You need to reconstruct exactly what happened to that specific request across 50 services, in the exact order it happened, with timing.

That reconstruction problem is what distributed tracing solves. Think of it like a shipping manifest that follows a package through every warehouse, truck, and sorting facility - except your package is an HTTP request, your facilities are microservices, and the manifest must be assembled after the fact from notes left by each facility. At small scale, you can manually correlate log lines by timestamp and request ID. At 50,000 requests per second across 50 services, that produces 2.5 million log entries per second, and finding any single request’s journey requires scanning all of them.

The challenge compounds because each service only sees its own slice. Service A knows it called Service B and got a response in 12ms. Service B knows it was called by something and that it called Service C. Neither service has the full picture. Trace context propagation - threading a common identifier through every hop - is the mechanism that makes assembly possible. But then you face a storage problem: at 50,000 req/s with 20 spans per request, that is 1 billion spans per day. Storing all of them costs a fortune. Dropping spans randomly destroys the coherence that makes traces useful.

We need to solve for three things simultaneously: (1) context propagation that works across languages, protocols, and async boundaries without developer friction, (2) sampling that retains interesting traces while discarding boring ones, and (3) storage and indexing that enables sub-second lookup of any trace by ID across billions of stored spans.

Requirements and Constraints

Functional Requirements

  • Instrument services with minimal code changes - ideally automatic instrumentation via library injection
  • Propagate trace_id and span_id through HTTP headers, gRPC metadata, and message queue message attributes
  • Collect spans from 50+ microservices and correlate them into a single trace tree
  • Support both head-based sampling (decide at trace start) and tail-based sampling (decide after trace completes)
  • Allow engineers to look up any trace by trace_id and get the full span tree in under 1 second
  • Support querying traces by service name, operation name, error flag, and duration range
  • Display a waterfall timeline of spans with parent-child relationships
  • Retain trace data for 30 days (hot) and 1 year (cold archive)

Non-Functional Requirements

  • Ingest: 50,000 requests/second, ~20 spans per request = 1 billion spans per day
  • Write latency: span ingestion P99 under 10ms (tracing must not slow down the application)
  • Query latency: trace lookup by ID under 1 second P95
  • Availability: 99.9% for ingest path (span loss is acceptable), 99.95% for query path
  • Durability: sampled spans must not be lost once the sampling decision is made
  • Sampling rate: configurable per-service, per-operation, globally (target 1-10% for head-based)
  • SDK overhead: under 1ms CPU per span, under 2KB memory overhead per active span

Constraints and Assumptions

  • We are not building a full APM - metrics and logs are out of scope
  • Service mesh integration (Envoy sidecar auto-instrumentation) is assumed for greenfield services
  • Clock skew across nodes up to 50ms is expected - we handle it in the query layer, not at ingest
  • We assume OpenTelemetry wire format (OTLP/gRPC) as the standard SDK protocol

High-Level Architecture

The tracing system has six major components that handle the journey from span generation to query response.

Distributed tracing system architecture overview showing SDK, collector, sampler, storage, and query layers

The Instrumentation SDK runs inside each service process. It creates spans, injects trace context into outgoing calls, extracts context from incoming calls, and batches completed spans for export. The SDK is the only component that touches the hot path of user requests - everything else is off the critical path.

The Collector Fleet is a pool of stateless agents that receive spans over OTLP/gRPC, perform light validation and enrichment (adding host metadata, normalizing service names), and forward spans to the sampling layer. Collectors are deployed as a sidecar or as a shared service per availability zone.

The Tail Sampler is the most interesting component. It buffers spans for a configurable window (typically 30 seconds) while waiting for the trace to complete, then evaluates sampling rules against the full trace. Traces with errors, high latency, or explicit user tagging are kept; the rest are dropped based on a configured rate.

The Span Store is a time-series-aware columnar store (Cassandra or a purpose-built system like Jaeger’s Cassandra backend or Tempo). Spans are written partitioned by (trace_id, service) with a secondary index on trace_id alone for fast full-trace assembly.

The Trace Index is a separate inverted index (Elasticsearch or ClickHouse) that maps searchable attributes - service name, operation, error code, duration bucket, custom tags - to trace_id values. Queries against the index return a list of trace_ids which are then fetched from the Span Store.

The Query API receives frontend requests, executes the two-phase lookup (index then store), assembles the span tree, and returns the waterfall JSON.

Key Insight

The architecture separates the write path (ingest and store spans fast) from the read path (assemble traces on demand) because these have opposite optimization pressures - writes need to be append-only and partition-local, while reads need cross-partition joins that would destroy write performance if done inline.

The Instrumentation SDK

The SDK’s job is to make span creation invisible to application code while propagating trace context reliably across every communication boundary.

Most engineers assume the hardest part of instrumentation is creating spans. It is not. Creating spans is three lines of code. The hard part is propagating context across async boundaries - thread pools, event loops, message queues, and async/await chains where the call stack is not continuous.

Trace context propagation data flow through services showing header injection and span collection

The SDK uses a context object stored in thread-local storage (for synchronous code) or in an async context variable (Python’s contextvars, Go’s context.Context). When a new goroutine or async task is spawned, the SDK wraps the spawn point to propagate the current context into the child execution unit.

# OpenTelemetry Python SDK - context propagation through thread pools
import contextvars
from opentelemetry import trace, context
from opentelemetry.propagate import inject, extract
from concurrent.futures import ThreadPoolExecutor

tracer = trace.get_tracer("payment-service", "1.0.0")

def handle_checkout(request):
    # Extract incoming trace context from HTTP headers
    ctx = extract(request.headers)
    with tracer.start_as_current_span(
        "checkout.handle",
        context=ctx,
        kind=trace.SpanKind.SERVER
    ) as span:
        span.set_attribute("user.id", request.user_id)
        span.set_attribute("cart.item_count", len(request.items))

        # Propagate context into thread pool - SDK wraps the executor
        with ThreadPoolExecutor() as pool:
            # contextvars.copy_context() captures current trace context
            current_ctx = contextvars.copy_context()
            future = pool.submit(current_ctx.run, validate_payment, request)
            result = future.result()

        if result.error:
            span.set_status(trace.StatusCode.ERROR, result.error)
            span.record_exception(result.exception)
        return result

def validate_payment(request):
    # This runs in a thread pool but has access to the parent span context
    with tracer.start_as_current_span("payment.validate") as span:
        # Inject context into outgoing HTTP call to fraud service
        headers = {}
        inject(headers)  # Adds traceparent, tracestate headers
        response = http_client.post(
            "http://fraud-service/check",
            json={"amount": request.amount},
            headers=headers
        )
        span.set_attribute("fraud.score", response.json()["score"])
        return response

The SDK buffers completed spans in memory and exports them in batches every 1 second or when the buffer reaches 512 spans, whichever comes first. This batching is critical - individual span RPCs would add 2-5ms of network latency per span.

Watch Out

Context propagation through message queues is the most common gap. Engineers instrument HTTP calls correctly but forget to inject trace headers into Kafka message headers or SQS message attributes. The result is traces that appear to terminate at the producer and a separate orphaned trace starting at the consumer - two incomplete traces instead of one complete one.

Real World

Google Dapper, the original distributed tracing system described in the 2010 paper, solved the thread pool propagation problem by instrumenting the thread creation primitives at the JVM level. Application code needed zero changes - the JVM itself propagated the trace context into every spawned thread automatically. OpenTelemetry’s Java agent takes the same approach with bytecode instrumentation via the Java agent mechanism.

The Collector Fleet

The Collector’s job is to receive spans from SDKs across all services and route them to the right downstream component without becoming a bottleneck or a single point of failure.

A naive approach uses a single centralized collector. At 1 billion spans per day (roughly 11,600 spans/second sustained with peaks at 3-5x), a single collector becomes a bottleneck and a reliability liability. Instead, we deploy collectors in a two-tier topology.

Tier 1 agents run as sidecars or per-host daemons. They receive spans from the local SDK over a loopback connection (no network round-trip), do minimal processing (format validation, timestamp normalization), and forward to a Tier 2 collector gateway over OTLP/gRPC. The loopback connection means the SDK export operation never fails due to network issues between services.

Tier 2 gateway collectors are stateless, load-balanced, and deployed in each availability zone. They perform richer processing: attribute enrichment (adding Kubernetes pod labels, datacenter, deploy version from the host metadata), routing to the tail sampler, and fanout to the span store for unsampled-but-indexed spans.

# OpenTelemetry Collector config - gateway tier
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        max_recv_msg_size_mib: 4

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
    send_batch_max_size: 2048
  resource:
    attributes:
      - key: deployment.environment
        from_attribute: k8s.namespace.name
        action: insert
  memory_limiter:
    check_interval: 1s
    limit_mib: 2048
    spike_limit_mib: 512

exporters:
  otlp/tail_sampler:
    endpoint: tail-sampler.internal:4317
    tls:
      insecure: true
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s
  kafka/raw_spans:
    brokers: ["kafka-broker-1:9092", "kafka-broker-2:9092", "kafka-broker-3:9092"]
    topic: raw-spans
    encoding: otlp_proto

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [otlp/tail_sampler, kafka/raw_spans]
Key Insight

The Collector writes to both the tail sampler and a Kafka raw-spans topic simultaneously. The Kafka topic serves as a replay buffer - if the tail sampler falls behind or restarts, it can replay spans from Kafka rather than losing them. This decouples sampler throughput from ingest throughput.

The Tail Sampler

The Tail Sampler is the most architecturally interesting component - it must buffer all spans for a trace long enough to make an intelligent keep/drop decision, then discard the buffer without losing any spans that were decided to keep.

Head-based sampling decides whether to keep a trace at the first service, before any downstream spans exist. It is cheap (the decision travels with the trace context) but blind - it cannot keep a trace because it had an error or was slow, since those facts are not known at trace start. A 1% head-based sample rate misses 99% of interesting slow or broken requests.

Tail-based sampling buffers all spans until the trace appears complete (a configurable timeout, typically 30 seconds), then evaluates rules against the complete trace. It can keep every trace with an error, every trace slower than 2 seconds, and sample the rest at 0.1%.

Tail sampler component internals showing span buffer, policy evaluator, and routing logic

The tail sampler has a fundamental tension: to make a decision on a trace, it needs all spans for that trace to arrive at the same sampler instance. But spans arrive at any collector, which forwards to any sampler in the pool. We solve this with consistent hashing on trace_id - every span for a given trace_id is routed to the same sampler instance.

// Tail sampler - consistent hash routing and policy evaluation
package sampler

import (
    "sync"
    "time"
    "github.com/spaolacci/murmur3"
)

type TailSampler struct {
    ring        *ConsistentHashRing
    buffers     sync.Map // trace_id -> *TraceBuffer
    policies    []SamplingPolicy
    bufferTTL   time.Duration
}

type TraceBuffer struct {
    mu        sync.Mutex
    traceID   string
    spans     []*Span
    rootSeen  bool
    expiresAt time.Time
}

type SamplingPolicy interface {
    Evaluate(trace *TraceBuffer) SamplingDecision
}

type ErrorPolicy struct{}
func (p *ErrorPolicy) Evaluate(buf *TraceBuffer) SamplingDecision {
    buf.mu.Lock()
    defer buf.mu.Unlock()
    for _, span := range buf.spans {
        if span.Status == StatusError {
            return SamplingDecision{Keep: true, Reason: "error"}
        }
    }
    return SamplingDecision{Keep: false}
}

type LatencyPolicy struct {
    ThresholdMs int64
}
func (p *LatencyPolicy) Evaluate(buf *TraceBuffer) SamplingDecision {
    buf.mu.Lock()
    defer buf.mu.Unlock()
    // Root span duration represents end-to-end latency
    for _, span := range buf.spans {
        if span.ParentSpanID == "" {
            durationMs := (span.EndTimeUnixNano - span.StartTimeUnixNano) / 1_000_000
            if durationMs > p.ThresholdMs {
                return SamplingDecision{Keep: true, Reason: "high_latency"}
            }
        }
    }
    return SamplingDecision{Keep: false}
}

type ProbabilisticPolicy struct {
    Rate float64 // e.g. 0.01 for 1%
}
func (p *ProbabilisticPolicy) Evaluate(buf *TraceBuffer) SamplingDecision {
    // Deterministic: same trace_id always makes same decision
    // Use upper 64 bits of trace_id as the random seed
    h := murmur3.Sum64([]byte(buf.traceID))
    threshold := uint64(p.Rate * float64(^uint64(0)))
    return SamplingDecision{Keep: h < threshold, Reason: "probabilistic"}
}

func (s *TailSampler) AddSpan(span *Span) {
    val, _ := s.buffers.LoadOrStore(span.TraceID, &TraceBuffer{
        traceID:   span.TraceID,
        expiresAt: time.Now().Add(s.bufferTTL),
    })
    buf := val.(*TraceBuffer)
    buf.mu.Lock()
    buf.spans = append(buf.spans, span)
    if span.ParentSpanID == "" {
        buf.rootSeen = true
    }
    buf.mu.Unlock()
}

func (s *TailSampler) Flush(traceID string) {
    val, ok := s.buffers.LoadAndDelete(traceID)
    if !ok {
        return
    }
    buf := val.(*TraceBuffer)
    // Evaluate policies in priority order - first Keep wins
    for _, policy := range s.policies {
        decision := policy.Evaluate(buf)
        if decision.Keep {
            s.forwardToStore(buf.spans, decision.Reason)
            return
        }
    }
    // Dropped - buffer is discarded
}
Watch Out

Tail sampler memory is proportional to the number of in-flight traces times average spans per trace. At 50,000 req/s with 30-second buffers, you have 1.5 million concurrent trace buffers. At 20 spans per trace averaging 200 bytes each, that is 6GB of in-memory span data per sampler instance. Size your sampler fleet with memory as the primary constraint, not CPU.

Data Model

The span schema must satisfy two conflicting needs: fast append writes (every span write is a new row, never an update) and fast full-trace reads (assembling 200+ spans by trace_id in a single query).

-- Cassandra CQL schema for span storage
-- Partitioned by trace_id so all spans for a trace live on the same node
CREATE TABLE spans (
    trace_id     text,
    span_id      text,
    parent_span_id text,
    service_name text,
    operation_name text,
    start_time_us bigint,   -- microseconds since epoch
    duration_us  bigint,
    status_code  smallint,  -- 0=unset, 1=ok, 2=error
    status_message text,
    attributes   map<text, text>,
    events       list<frozen<span_event>>,
    resource_attributes map<text, text>,
    sampled_reason text,
    ingested_at  timestamp,
    PRIMARY KEY (trace_id, span_id)
) WITH CLUSTERING ORDER BY (span_id ASC)
  AND compaction = {
    'class': 'TimeWindowCompactionStrategy',
    'compaction_window_unit': 'HOURS',
    'compaction_window_size': 1
  }
  AND default_time_to_live = 2592000;  -- 30 days

CREATE TYPE span_event (
    time_us bigint,
    name text,
    attributes map<text, text>
);

-- Secondary index table for searchable attributes
-- Written to ClickHouse for analytical queries
CREATE TABLE span_index (
    trace_id        String,
    root_service    String,
    root_operation  String,
    duration_ms     UInt32,
    has_error       UInt8,
    span_count      UInt16,
    start_time      DateTime,
    tags            Map(String, String),  -- top-level tags for filtering
    ingested_at     DateTime
)
ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(start_time)
ORDER BY (root_service, start_time, trace_id)
TTL ingested_at + INTERVAL 30 DAY;

-- Fast lookup index in ClickHouse
CREATE INDEX idx_trace_lookup ON span_index(trace_id) TYPE minmax GRANULARITY 1;
CREATE INDEX idx_error_traces ON span_index(has_error, start_time) TYPE minmax GRANULARITY 8192;

Partitioning key choice: trace_id as the Cassandra partition key means all spans for a trace live on the same replica set. A full-trace read is a single-partition query - no scatter-gather across multiple nodes. The tradeoff is that trace_id has no temporal locality, so Cassandra’s time-window compaction still runs correctly but cannot merge spans for the same trace across time windows (they all arrive within the trace TTL window anyway).

Index design: We maintain a separate index in ClickHouse that stores one row per trace (not per span). This row captures the root service, total duration, error flag, and span count. The query path does a ClickHouse scan to find matching trace_ids, then fetches the full spans from Cassandra by those IDs. ClickHouse’s columnar storage makes the analytical queries (find all traces from payment-service with error and duration > 2s in the last hour) extremely fast.

Key Insight

Two separate stores with different access patterns beat one store trying to do both. Cassandra excels at single-partition reads by primary key (the trace lookup). ClickHouse excels at analytical scans over columns (the search query). Trying to do attribute search in Cassandra requires full secondary indexes that do not scale, and trying to do primary-key lookups in ClickHouse is slower than Cassandra.

Key Algorithms and Protocols

Trace Context Propagation (W3C TraceContext)

Trace context propagation is the mechanism that gives every span in a distributed trace a shared trace_id. Think of it like a wristband at a concert venue - you get it at the entrance and every staff member who sees it knows you belong to the same group.

The W3C TraceContext specification defines two HTTP headers: traceparent carries the trace_id, span_id, and sample flag; tracestate carries vendor-specific data.

# W3C TraceContext header parsing and injection
import re
from dataclasses import dataclass
from typing import Optional
import secrets

TRACEPARENT_PATTERN = re.compile(
    r'^00-([0-9a-f]{32})-([0-9a-f]{16})-([0-9a-f]{2})$'
)

@dataclass
class TraceContext:
    trace_id: str       # 128-bit hex, e.g. "4bf92f3577b34da6a3ce929d0e0e4736"
    span_id: str        # 64-bit hex, e.g. "00f067aa0ba902b7"
    sampled: bool       # whether this trace is sampled

    @classmethod
    def from_headers(cls, headers: dict) -> Optional['TraceContext']:
        header = headers.get('traceparent', '')
        m = TRACEPARENT_PATTERN.match(header.lower())
        if not m:
            return None
        trace_id, parent_span_id, flags = m.group(1), m.group(2), m.group(3)
        sampled = (int(flags, 16) & 0x01) == 1
        return cls(trace_id=trace_id, span_id=parent_span_id, sampled=sampled)

    @classmethod
    def new_root(cls, sampled: bool = False) -> 'TraceContext':
        # Generate cryptographically random IDs
        return cls(
            trace_id=secrets.token_hex(16),  # 128-bit
            span_id=secrets.token_hex(8),    # 64-bit
            sampled=sampled,
        )

    def child_context(self, new_span_id: str) -> 'TraceContext':
        """Create context for a child span - same trace_id, new span_id"""
        return TraceContext(
            trace_id=self.trace_id,
            span_id=new_span_id,
            sampled=self.sampled,
        )

    def to_header(self) -> str:
        flags = "01" if self.sampled else "00"
        return f"00-{self.trace_id}-{self.span_id}-{flags}"

Span Tree Assembly

When retrieving a trace, the query layer receives an unordered list of spans from Cassandra and must assemble them into a tree using parent-child relationships. This is a standard tree reconstruction problem but with two edge cases: orphaned spans (parent span arrived at a different sampler and was dropped) and clock-skew-adjusted timelines.

// Span tree assembly - O(n) using a map-based approach
package query

import "sort"

type Span struct {
    SpanID       string
    ParentSpanID string
    StartTimeUs  int64
    DurationUs   int64
    ServiceName  string
    OperationName string
    StatusCode   int
    Attributes   map[string]string
    Children     []*Span
}

func AssembleTree(spans []*Span) *Span {
    // Index all spans by ID for O(1) parent lookup
    byID := make(map[string]*Span, len(spans))
    for _, s := range spans {
        byID[s.SpanID] = s
    }

    var root *Span
    var orphans []*Span

    for _, span := range spans {
        if span.ParentSpanID == "" {
            // Explicit root span
            root = span
            continue
        }
        parent, ok := byID[span.ParentSpanID]
        if !ok {
            // Orphaned span - parent was dropped by sampler or lost
            orphans = append(orphans, span)
            continue
        }
        parent.Children = append(parent.Children, span)
    }

    // Sort children by start time for correct waterfall display
    var sortChildren func(s *Span)
    sortChildren = func(s *Span) {
        sort.Slice(s.Children, func(i, j int) bool {
            return s.Children[i].StartTimeUs < s.Children[j].StartTimeUs
        })
        for _, child := range s.Children {
            sortChildren(child)
        }
    }
    if root != nil {
        sortChildren(root)
    }

    // Attach orphans to root if root exists, otherwise elect earliest span as root
    if root == nil && len(spans) > 0 {
        // Clock skew or missing root - elect earliest span
        root = spans[0]
        for _, s := range spans[1:] {
            if s.StartTimeUs < root.StartTimeUs {
                root = s
            }
        }
    }
    for _, orphan := range orphans {
        if root != nil {
            root.Children = append(root.Children, orphan)
        }
    }
    return root
}

Consistent Hashing for Sampler Routing

The tail sampler pool uses consistent hashing so spans for the same trace_id always route to the same sampler instance.

# Consistent hash ring for tail sampler routing
import hashlib
import bisect
from typing import List

class ConsistentHashRing:
    def __init__(self, nodes: List[str], virtual_nodes: int = 150):
        self.virtual_nodes = virtual_nodes
        self._ring: dict[int, str] = {}
        self._sorted_keys: list[int] = []
        for node in nodes:
            self.add_node(node)

    def _hash(self, key: str) -> int:
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

    def add_node(self, node: str):
        for i in range(self.virtual_nodes):
            vnode_key = f"{node}:vn{i}"
            h = self._hash(vnode_key)
            self._ring[h] = node
            bisect.insort(self._sorted_keys, h)

    def remove_node(self, node: str):
        for i in range(self.virtual_nodes):
            vnode_key = f"{node}:vn{i}"
            h = self._hash(vnode_key)
            del self._ring[h]
            idx = bisect.bisect_left(self._sorted_keys, h)
            self._sorted_keys.pop(idx)

    def get_node(self, trace_id: str) -> str:
        if not self._ring:
            raise ValueError("Ring is empty")
        h = self._hash(trace_id)
        idx = bisect.bisect_right(self._sorted_keys, h) % len(self._sorted_keys)
        return self._ring[self._sorted_keys[idx]]
Key Insight

Using 150 virtual nodes per physical sampler is the sweet spot for load balance vs. memory overhead. With fewer virtual nodes, removing one physical node causes its entire keyspace to shift to one neighbor rather than distributing evenly. The 150x factor ensures that when any sampler instance fails, its load distributes roughly evenly across the remaining pool.

Scaling and Performance

Distributed tracing scaling diagram showing horizontal partitioning of collector, sampler, and storage tiers

Capacity Estimation

Given:
  - 50,000 requests/second peak
  - 20 spans per request average
  - 1,000 bytes per span (attributes, events, metadata)
  - 5% tail-based sampling rate (keep errors + slow + 0.1% random)
  - 30 days hot retention

Raw span ingest rate:
  50,000 req/s * 20 spans = 1,000,000 spans/second
  1,000,000 * 1,000 bytes = 1 GB/second raw throughput

Sampled spans stored:
  1,000,000 spans/s * 5% = 50,000 spans/second
  50,000 * 1,000 bytes = 50 MB/second
  50 MB/s * 86,400 s/day * 30 days = 129.6 TB hot storage

Index storage (1 row per trace in ClickHouse):
  50,000 req/s * 5% = 2,500 traces/second sampled
  2,500 * 200 bytes per index row = 500 KB/second
  500 KB/s * 86,400 * 30 = 1.3 TB index storage

Tail sampler memory (30s buffer window):
  1,000,000 spans/s * 30s in-flight = 30,000,000 spans buffered
  30,000,000 * 200 bytes (in-memory compact) = 6 GB per sampler instance
  With 3x headroom: 18 GB instance RAM, 4-6 sampler instances

Cassandra cluster:
  50 MB/s write throughput
  With RF=3: 150 MB/s effective write bandwidth
  At 500 MB/s per Cassandra node write capacity: 1 node sufficient for writes
  At 129.6 TB / 10 TB per node: 14 nodes for storage
  Cassandra cluster: 14-20 nodes (storage bound, not write bound)

Collector scaling is straightforward - collectors are stateless, so horizontal scaling behind a load balancer handles any ingest rate. The Tier 1 per-host agents mean collector capacity scales automatically with the service fleet.

Sampler scaling is constrained by memory. Each sampler holds 30 seconds of unsampled spans. Adding sampler nodes shrinks each node’s trace_id keyspace on the consistent hash ring, but migrating in-flight spans during scale-out is complex. Instead, we over-provision samplers at 2x expected peak and scale slowly (hourly granularity vs. minute granularity for collectors).

Query path scaling uses read replicas for Cassandra and ClickHouse. The query API is stateless and scales horizontally. Trace ID lookups in Cassandra are O(1) partition reads - they do not get slower as data grows, as long as the cluster adds nodes proportionally to data volume.

Real World

Grafana Tempo solves the Cassandra cost problem by using object storage (S3 or GCS) as the primary span store and a small local index for recent data. Spans are written to local disk, flushed as parquet files to S3 every 5 minutes, and the index is maintained in memory. This cuts hot storage cost by 10x versus Cassandra while keeping query latency under 2 seconds for recent traces.

Failure Modes and Recovery

FailureDetectionImpactRecovery
Collector node crashHealth check fails in 10s, load balancer removes nodeSpans from services routing to that collector are lost until re-routedStateless: SDK retries export with exponential backoff to remaining collectors; Tier 1 agent buffers 60s of spans
Tail sampler instance crashHealth check + consistent ring detects node removalAll in-flight traces routed to that instance are lost (buffer not persisted)Consistent ring rebalances; replay from Kafka raw-spans topic within TTL window
Cassandra node failureCassandra gossip protocol detects within 5sWrites and reads for partitions on that node degrade to RF-1Hinted handoff stores writes; node replacement restores RF within hours
ClickHouse index lagQuery returns stale results or misses recent tracesEngineers cannot find traces from last few minutesIndex is eventually consistent by design; UI shows “index lag” badge if ingested_at delta exceeds threshold
Clock skew between servicesSpan end_time before start_time, child starts before parentWaterfall display looks wrongQuery layer normalizes: clamp child start to parent start, flag skewed spans with attribute clock.skew_ms
Kafka partition leader failoverProducer gets LeaderNotAvailable errorUp to 30s of span buffer builds up in collectorsKafka automatically elects new leader; collector retries with exponential backoff; no data loss if retry window covers election time
Watch Out

The most common operational mistake is treating sampler buffer loss as equivalent to span store loss. When a sampler crashes, only un-flushed in-flight traces are lost - already-committed traces in Cassandra are safe. Engineers often alert on sampler restarts as if they indicate data loss, but the actual loss rate (in-flight traces at restart time) is typically under 0.1% of daily traces. Alert on actual Cassandra write failures, not sampler pod restarts.

Comparison of Approaches

ApproachWrite LatencyQuery LatencyOperational ComplexityBest For
Head-based sampling onlySub-millisecond (no buffering)Fast (smaller dataset)Low (no sampler fleet)Low-traffic systems where 1-5% random sample is acceptable
Tail-based sampling (in-process)Low (local buffer)FastMedium (per-service buffer config)Monoliths or small service counts where all spans arrive in one process
Tail-based sampling (collector fleet)Low (async export)FastHigh (sampler cluster, consistent hashing)Microservices at scale where error/latency-based retention is required
Streaming sampling with FlinkLow (Kafka buffered)Depends on storeVery high (Flink cluster)Orgs that already operate Flink and need complex sampling policies
100% sampling with tiered storageSub-millisecondFast hot, slow coldMedium (S3 lifecycle rules)Systems where storage cost is less important than complete trace coverage
Probabilistic sampling with bloom filterSub-millisecondFastLowSystems that need deduplication guarantees across retry storms

The correct choice for a 50+ microservice system at 50,000 req/s is tail-based sampling with a collector fleet. The in-process approach breaks down because a request’s spans are spread across 50 services - no single process has them all. Head-based sampling misses exactly the traces you care most about (errors, slow requests). 100% sampling is viable if you use S3/GCS as the primary store (Tempo-style), but query latency for old traces degrades to 5-30 seconds. We pick tail-based collector fleet with Cassandra hot storage and S3 cold archive as the pragmatic balance.

Key Takeaways

  • Trace context propagation is the foundational primitive - without trace_id threaded through every hop, no amount of clever storage design can reconstruct a trace.
  • Head-based vs tail-based sampling is the most important architectural decision: head-based is simple but blind, tail-based retains the interesting traces but requires stateful buffering.
  • Consistent hashing on trace_id routes all spans for a trace to the same tail sampler instance, enabling complete-trace decisions without cross-instance coordination.
  • Two-store architecture (Cassandra for span retrieval by ID, ClickHouse for attribute search) is the standard pattern because the two access patterns have opposite optimization pressures.
  • Clock skew handling belongs in the query layer, not the ingest layer - normalizing timestamps at write time loses the ability to detect and report the original skew.
  • The SDK is the only latency-critical component - everything else runs async and off the hot path, so the SDK must prioritize low overhead above all else.
  • Kafka as the replay buffer between collectors and samplers decouples ingest throughput from sampler throughput, allowing the sampler to process at its own pace without back-pressuring the application.
  • Orphaned spans (parent dropped, child retained) are inevitable in tail-based sampling - the query layer must handle them gracefully rather than hiding them.

The counter-intuitive lesson: the hardest problem in distributed tracing is not storage or query - it is getting engineers to instrument correctly. The most sophisticated tracing infrastructure is useless if half the services forget to propagate traceparent through Kafka messages. Invest in automated instrumentation (OpenTelemetry Java agent, Go auto-instrumentation) and validate propagation in CI before worrying about sampler fleet design.

Frequently Asked Questions

Q: Why not just use structured logs with a request ID instead of building a full tracing system?

A: Structured logs with a request ID solve the correlation problem but not the hierarchy problem. You can find all log lines for a request, but you cannot reconstruct the parent-child call tree, measure per-service latency, or identify which specific service in the fan-out caused the slowdown. Tracing adds span nesting, timing, and span-level attributes that make root cause analysis possible without reading thousands of log lines.

Q: Why not store all spans and skip sampling entirely?

A: At 1 billion spans per day, 100% retention costs roughly $4,000-$8,000/month in Cassandra storage alone, plus query degradation from scanning the full dataset. More importantly, most traces are identical - 95% of successful requests to the same endpoint have nearly identical span trees. Sampling at 5% while retaining all errors captures the interesting 5% at 1/20th the cost.

Q: Why Cassandra over a purpose-built time-series store like InfluxDB or TimescaleDB?

A: Spans are not time-series data - they are not queried by time range across all spans, they are queried by trace_id across a fixed set of spans. TimescaleDB and InfluxDB optimize for “give me all metrics from service X between time A and B,” which is the wrong access pattern. Cassandra optimizes for “give me all rows in partition trace_id=abc123” which is exactly what trace retrieval needs. Jaeger, Zipkin, and Tempo all use Cassandra or object storage for the same reason.

Q: How do you handle sampling across a retry storm? If a request retries 5 times, do you get 5 separate traces?

A: Yes, by design. Each retry generates a new root span with a new trace_id because the retry is a new attempt, not a continuation of the original. This is correct - you want to see each attempt separately to understand which retry succeeded and how the latency differed. You can correlate retries by adding a custom attribute request.idempotency_key to all spans and querying by that attribute in the trace index.

Q: Why not use a service mesh (Envoy/Istio) to handle all instrumentation automatically?

A: Service mesh handles inbound and outbound HTTP/gRPC propagation automatically, which covers 70-80% of the instrumentation work. But it cannot instrument database calls, cache operations, internal business logic spans, or message queue producers/consumers - those require SDK instrumentation. The correct approach is service mesh for service-to-service propagation plus OpenTelemetry SDK for internal spans, not one or the other.

Q: How do you handle trace context across async boundaries like background jobs that run minutes after the original request?

A: Background jobs should start a new trace linked to the original via a SpanLink (a W3C concept for non-parent-child relationships). The job records the original trace_id as a link attribute, which lets the UI show “this job was triggered by trace abc123” without making the job’s spans children of the original root. Treating the background job as a child span causes the original trace to appear to run for minutes even though the user request completed in milliseconds.

Interview Questions

Q: Walk me through how a trace is assembled when a request fans out to 5 services simultaneously.

Expected depth: Explain trace_id propagation in concurrent fork-join patterns. Discuss how the parent span creates child contexts for each parallel call, how each child service creates its own span with that trace_id and a new span_id, and how the parent sets its own span_id as the parent_span_id for each outgoing call. Explain that assembly uses a map lookup by parent_span_id to rebuild the tree, and discuss what happens when one of the 5 parallel calls is dropped by the sampler.

Q: Your tail sampler is running out of memory. What are your options?

Expected depth: Identify that memory is proportional to (trace buffer TTL * ingest rate * span size). Options: reduce buffer TTL (less time to collect all spans, more orphans), add sampler nodes (shrinks each node’s keyspace but requires span migration), reduce average span size (trim attributes in the collector), or switch to a hybrid approach where tail sampling only runs for traces flagged by head-based pre-selection. Discuss the tradeoff between memory and orphan rate.

Q: How do you ensure the sampling decision is consistent when the same trace has spans arriving on different collector instances?

Expected depth: Explain consistent hashing on trace_id as the routing mechanism. Discuss what happens when a sampler node joins or leaves the ring (in-flight traces may split across two sampler instances, leading to partial traces in the buffer). Explain the Kafka replay buffer as the recovery mechanism. Mention that the alternative - broadcasting every span to all samplers and deduplicating - is too expensive at scale.

Q: A request takes 5 seconds but you cannot identify which service caused the slowdown in your waterfall view. What would you investigate?

Expected depth: Discuss the gap problem - time between the parent span creating a child and the child span being recorded can indicate serialization time, thread scheduling delay, or connection pool wait. Explain that adding span.addEvent("connection_acquired") style events within a span makes these gaps visible. Discuss clock skew as a false positive (child appears to start after parent ends due to clock difference). Mention that “missing time” in a waterfall is often serialization or queuing not instrumented as spans.

Q: How would you design the sampling system to guarantee that every trace containing a specific user ID is retained, regardless of the sampling rate?

Expected depth: Explain user-targeted sampling as a hybrid: at trace start, the SDK checks if the user ID matches a targeted list (stored in a distributed bloom filter or feature flag service). If matched, the root span is marked with sampling.priority=1 in the tracestate header. The tail sampler policy evaluates this attribute as a highest-priority keep rule. Discuss the bloom filter approach for large user lists (millions of users) vs. exact set for small lists. Mention that the priority flag must propagate through all downstream spans via tracestate or all tail sampling decisions become incoherent.

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access
Unlock Full Article