Logging, Metrics, and Tracing: The Three Pillars of Observability
Your API is slow. P99 latency is 3 seconds. You check the logs. Thousands of lines. You grep for errors. Nothing obvious. You check the metrics. CPU is fine. Memory is fine. Database connections are fine. You have no idea where the 3 seconds are going.
This is the observability gap. You have data but not understanding. Observability is not about collecting more data - it is about collecting the right data and being able to ask arbitrary questions about your system’s behavior.
The three pillars
Logs: Discrete events with context. “User 123 logged in at 10:30:01.” “Payment failed: insufficient funds.” Logs are the most detailed but also the most expensive to store and query.
Metrics: Numeric measurements over time. Request rate, error rate, latency percentiles, CPU usage. Metrics are cheap to store and fast to query, but they aggregate away detail.
Traces: The path of a request through a distributed system. A trace shows which services were called, in what order, and how long each took. Traces connect the dots between logs and metrics.
The three pillars are complementary. Metrics alert you that something is wrong. Traces show you where. Logs tell you why.
Logging
Structured logging
Unstructured logs are hard to query:
2024-01-15 10:30:01 ERROR Payment failed for user 123: insufficient funds
Structured logs are JSON (or another structured format):
{"timestamp": "2024-01-15T10:30:01Z", "level": "ERROR", "event": "payment_failed", "user_id": 123, "reason": "insufficient_funds", "amount": 50.00}
Structured logs can be queried, filtered, and aggregated. You can find all payment failures for a specific user, or count failures by reason.
Log levels
- DEBUG: Detailed diagnostic information. Only in development.
- INFO: Normal operation events. User logged in, order created.
- WARN: Something unexpected but not an error. Retry succeeded, cache miss.
- ERROR: An error occurred. Request failed, database query failed.
- FATAL/CRITICAL: System cannot continue. Used sparingly.
In production, log at INFO level by default. Enable DEBUG logging temporarily for specific services when debugging.
What to log
Log:
- Request start and end (with duration)
- Errors and exceptions (with stack traces)
- Business events (order created, payment processed)
- Security events (login, logout, permission denied)
- External service calls (with duration and result)
Do not log:
- Passwords, tokens, credit card numbers
- PII without proper handling (GDPR)
- High-frequency events that would overwhelm storage (every cache hit)
graph LR subgraph logging["Logging Pipeline"] APP["Application"] -->|"structured logs"| AGENT["Log agent (Fluentd, Filebeat)"] AGENT -->|"ship logs"| AGG["Log aggregator (Elasticsearch, Loki)"] AGG -->|"query"| DASH["Dashboard (Kibana, Grafana)"] AGG -->|"alert"| ALERT["Alerting (PagerDuty)"] end style APP fill:#EEEDFE,stroke:#534AB7,color:#3C3489 style AGG fill:#E1F5EE,stroke:#0F6E56,color:#085041 style DASH fill:#FAEEDA,stroke:#854F0B,color:#633806
Metrics
Types of metrics
Counter: Monotonically increasing value. Total requests, total errors. Use rate() to get per-second rate.
Gauge: Current value that can go up or down. Active connections, queue depth, memory usage.
Histogram: Distribution of values. Request latency, response size. Allows calculating percentiles (p50, p95, p99).
Summary: Pre-calculated percentiles. Less flexible than histograms but lower overhead.
The RED method
For services, track:
- Rate: Requests per second
- Errors: Error rate (percentage of requests that fail)
- Duration: Latency distribution (p50, p95, p99)
These three metrics tell you if a service is healthy. If rate drops, errors increase, or latency spikes, something is wrong.
The USE method
For resources (CPU, memory, disk, network), track:
- Utilization: Percentage of time the resource is busy
- Saturation: Amount of work waiting (queue depth)
- Errors: Error rate
High utilization + high saturation = bottleneck. High errors = failure.
Cardinality
Metrics have labels (dimensions). http_requests_total{method="GET", status="200", endpoint="/users"}. Each unique combination of label values is a time series.
High cardinality (many unique label values) causes performance problems in metrics systems. Do not use high-cardinality labels like user_id, request_id, or IP address. Use low-cardinality labels: method, status code, endpoint (grouped), service name.
Distributed tracing
A trace represents the end-to-end journey of a request through a distributed system. It consists of spans - individual units of work.
Trace: The entire request journey. Has a unique trace ID.
Span: A single operation within the trace. Has a span ID, parent span ID, start time, duration, and tags.
Context propagation: The trace ID and span ID are passed between services via HTTP headers (traceparent in W3C Trace Context, X-B3-TraceId in Zipkin format). Each service creates a child span with the received trace ID.
graph TB subgraph trace["Distributed Trace - Single Request"] ROOT["Span: API Gateway 0ms - 250ms trace_id: abc123"] S1["Span: User service 5ms - 15ms"] S2["Span: Order service 20ms - 200ms"] S3["Span: DB query 25ms - 180ms"] S4["Span: Cache lookup 20ms - 22ms"] ROOT --> S1 ROOT --> S2 S2 --> S4 S2 --> S3 end style ROOT fill:#EEEDFE,stroke:#534AB7,color:#3C3489 style S1 fill:#E1F5EE,stroke:#0F6E56,color:#085041 style S2 fill:#E1F5EE,stroke:#0F6E56,color:#085041 style S3 fill:#FAEEDA,stroke:#854F0B,color:#633806 style S4 fill:#F1EFE8,stroke:#888780,color:#444441
The trace shows that the order service’s database query took 155ms out of the total 250ms. That is where to focus optimization.
Where it breaks or gets interesting
Sampling
Tracing every request is expensive. For high-traffic services, sample a percentage of requests (1%, 0.1%). Head-based sampling: decide at the start of the request whether to trace it. Tail-based sampling: collect all spans, decide after the request completes whether to keep the trace (keep slow requests, discard fast ones). Tail-based sampling is more useful but requires buffering all spans.
Log correlation
Logs and traces are most useful when correlated. Include the trace ID in every log message. When you see an error in logs, you can find the corresponding trace to see the full request context.
{"timestamp": "...", "level": "ERROR", "trace_id": "abc123", "span_id": "def456", "message": "Database query failed"}
Metrics vs logs for alerting
Alert on metrics, not logs. Metrics are aggregated and fast to query. Alerting on log patterns (grep for “ERROR”) is slow and unreliable. Use metrics for alerting (error rate > 1%), use logs for investigation.
The cost of observability
Observability has a cost: storage, compute, and network. Logs are the most expensive (high volume, large size). Metrics are cheap. Traces are in between.
Optimize: use sampling for traces, set appropriate log retention (7-30 days for most logs), use log compression, use efficient metrics storage (Prometheus, Thanos).
Real-world systems
ELK Stack - Elasticsearch, Logstash, Kibana. The classic log aggregation stack. Elasticsearch stores logs, Logstash processes them, Kibana visualizes them.
Grafana Loki - Log aggregation system designed to work with Grafana. Cheaper than Elasticsearch (indexes only metadata, not full text). Good for Kubernetes environments.
Prometheus - Pull-based metrics collection. PromQL for querying. Grafana for visualization. The standard for Kubernetes monitoring.
Datadog - Commercial observability platform. Logs, metrics, and traces in one platform. APM (Application Performance Monitoring) with automatic instrumentation.
Jaeger - Open-source distributed tracing. Compatible with OpenTelemetry. Used by Uber, Netflix.
OpenTelemetry - Vendor-neutral observability framework. Standardizes instrumentation for logs, metrics, and traces. Supported by all major observability vendors.
How to apply it in practice
The observability stack
A typical production observability stack:
- Instrumentation: OpenTelemetry SDK in your application
- Collection: OpenTelemetry Collector (receives, processes, exports)
- Storage: Prometheus (metrics), Loki or Elasticsearch (logs), Jaeger or Tempo (traces)
- Visualization: Grafana (dashboards for all three)
- Alerting: Prometheus Alertmanager or Grafana Alerting
Starting with observability
If you are starting from scratch:
- Add structured logging first (highest value, lowest cost)
- Add the RED metrics (request rate, error rate, duration)
- Add distributed tracing for your most complex flows
- Build dashboards and alerts based on what you actually need to know
Correlation IDs
Generate a unique request ID at the entry point (API gateway or first service). Pass it through all services as a header (X-Request-ID). Include it in all logs and spans. This lets you find all logs and spans for a specific request.
FAQ
Q: What is the difference between monitoring and observability?
Monitoring is checking known metrics against thresholds. “Alert if CPU > 80%.” It answers questions you thought to ask in advance. Observability is the ability to understand the internal state of a system from its external outputs. It answers questions you did not think to ask. A highly observable system lets you debug novel failures without adding new instrumentation. Monitoring is a subset of observability.
Q: How do you choose between Prometheus and Datadog?
Prometheus is open-source, self-hosted, and free. It requires operational expertise to run at scale (Thanos or Cortex for long-term storage). Datadog is a managed service with a per-host pricing model. It is easier to set up and has more features (APM, log management, synthetics). For small teams or startups, Datadog’s simplicity is worth the cost. For large teams with Kubernetes expertise, Prometheus + Grafana is more cost-effective at scale.
Q: How much should you sample traces?
It depends on your traffic volume and budget. For low-traffic services (under 100 req/s), trace 100% of requests. For medium traffic (100-10,000 req/s), trace 10-100%. For high traffic (over 10,000 req/s), trace 1% or use tail-based sampling (keep all slow and error traces, sample fast successful ones). The goal is to have enough traces to debug issues without overwhelming your trace storage.
Interview questions
Q1: Your API’s P99 latency is 3 seconds. You have logs, metrics, and traces. Walk through your investigation.
Strong answer: Start with metrics. Check the RED metrics: is the error rate elevated? Is the request rate normal? Look at latency by endpoint - is it all endpoints or specific ones? Check resource metrics: CPU, memory, database connections, cache hit rate. If you find a specific endpoint or resource that is the bottleneck, look at traces for that endpoint. Find a slow trace (P99 latency). The trace shows the breakdown: which service or operation is taking the most time. If the database query is slow, check the slow query log. If a downstream service is slow, check its metrics and traces. Use the trace ID to find the corresponding logs for more context. The combination of metrics (where to look), traces (what is slow), and logs (why it is slow) gives you the full picture.
Q2: How do you implement distributed tracing in a microservices architecture?
Strong answer: Use OpenTelemetry for vendor-neutral instrumentation. Add the OpenTelemetry SDK to each service. Configure automatic instrumentation for HTTP clients and servers (most frameworks have auto-instrumentation). For custom spans, use the SDK to create spans around important operations (database queries, external API calls). Configure context propagation: the SDK automatically propagates trace context via HTTP headers (W3C Trace Context format). Deploy an OpenTelemetry Collector to receive spans from all services and export to your tracing backend (Jaeger, Tempo, Datadog). Include the trace ID in all log messages for correlation. Set up sampling: 100% for development, 1-10% for production (or tail-based sampling to keep slow traces).
Q3: What is the difference between a counter and a gauge in Prometheus, and when do you use each?
Strong answer: A counter is a monotonically increasing value that only goes up (or resets to zero on restart). Use counters for things that accumulate: total requests, total errors, total bytes sent. To get the rate, use rate(counter[5m]) in PromQL. A gauge is a value that can go up or down. Use gauges for current state: active connections, queue depth, memory usage, temperature. You can take the current value directly without rate(). The key distinction: if the value can decrease, use a gauge. If it only increases, use a counter. Common mistake: using a gauge for request count (it would go down when requests complete). Use a counter for total requests and derive the rate.