Circuit Breaker: Stopping Cascading Failures Before They Spread

Your recommendation service is slow. Every request to it times out after 5 seconds. Your product page calls the recommendation service. Now every product page takes 5 seconds to load. Users abandon. Your web servers accumulate threads waiting for timeouts. Thread pools exhaust. Your entire application becomes unresponsive. The recommendation service’s failure has cascaded into a full outage.

This is a cascading failure. The circuit breaker pattern prevents it.

What a circuit breaker is

A circuit breaker is a proxy that monitors calls to a downstream service. When failures exceed a threshold, the circuit “opens” - subsequent calls fail immediately without attempting the downstream call. After a timeout, the circuit enters a “half-open” state and allows a test request through. If it succeeds, the circuit closes. If it fails, it opens again.

The name comes from electrical circuit breakers: when current exceeds a safe level, the breaker trips and cuts the circuit to prevent damage.

Three states:

Closed (normal): Requests pass through. Failures are counted. If failures exceed the threshold within a time window, the circuit opens.

Open (failing): Requests fail immediately without calling the downstream service. A fallback response is returned. After a timeout (e.g., 30 seconds), the circuit moves to half-open.

Half-open (testing): A limited number of test requests are allowed through. If they succeed, the circuit closes. If they fail, the circuit opens again.

graph LR
subgraph states["Circuit Breaker States"]
  CLOSED["CLOSED
Normal operation
Count failures"] -->|"failure threshold exceeded"| OPEN["OPEN
Fail fast
Return fallback"]
  OPEN -->|"timeout expires"| HALF["HALF-OPEN
Allow test requests"]
  HALF -->|"test succeeds"| CLOSED
  HALF -->|"test fails"| OPEN
end

style CLOSED fill:#E1F5EE,stroke:#0F6E56,color:#085041
style OPEN fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
style HALF fill:#FAEEDA,stroke:#854F0B,color:#633806

How it prevents cascading failures

Without a circuit breaker:

Recommendation service becomes slow
Product page waits 5 seconds for each recommendation request
Web server threads are blocked waiting
Thread pool exhausts
New requests cannot be handled
Entire application becomes unresponsive

With a circuit breaker:

Recommendation service becomes slow
First few requests time out, incrementing the failure counter
Circuit opens after threshold is exceeded
Subsequent requests fail immediately (no 5-second wait)
Fallback response is returned (no recommendations, or cached recommendations)
Web server threads are not blocked
Application continues serving other requests normally

Implementing a circuit breaker

Failure detection

What counts as a failure?

HTTP 5xx responses
Timeouts
Connection refused
Exceptions

What does not count as a failure?

HTTP 4xx responses (client errors, not service failures)
Successful responses

Threshold configuration

Count-based: Open after N failures in a row. Simple but sensitive to burst failures.

Rate-based: Open when the failure rate exceeds X% in a time window. More robust. Resilience4j uses this approach.

Example configuration:

Sliding window: last 10 requests
Failure rate threshold: 50% (5 of 10 requests fail)
Wait duration in open state: 30 seconds
Permitted calls in half-open state: 3

Fallback responses

When the circuit is open, return a fallback:

Cached data (last known good response)
Default response (empty recommendations list)
Error response with a user-friendly message
Response from a secondary service

The fallback should be fast and not depend on the failing service.

graph TB
subgraph flow["Circuit Breaker Request Flow"]
  REQ["Incoming request"] --> CB["Circuit Breaker"]
  CB -->|"CLOSED: pass through"| SVC["Downstream service"]
  SVC -->|"success"| RESP["Response to client"]
  SVC -->|"failure/timeout"| CB
  CB -->|"OPEN: fail fast"| FALL["Fallback response
(cached data or default)"]
  FALL --> RESP
end

style CB fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style SVC fill:#E1F5EE,stroke:#0F6E56,color:#085041
style FALL fill:#FAEEDA,stroke:#854F0B,color:#633806

Where it breaks or gets interesting

Choosing the right threshold

Too sensitive: the circuit opens on transient failures (a brief network hiccup). The service is healthy but the circuit is open. Users get fallback responses unnecessarily.

Too insensitive: the circuit does not open until many requests have failed. Users experience slow responses for too long before the circuit opens.

Start with a 50% failure rate over a 10-request window. Tune based on your service’s normal error rate and acceptable failure duration.

The half-open state is critical

The half-open state is where recovery happens. If you allow too many test requests, you might overwhelm a recovering service. If you allow too few, recovery detection is slow.

Start with 3-5 test requests in half-open state. If all succeed, close the circuit. If any fail, open again.

Circuit breakers and retries

Circuit breakers and retries work together but can conflict. Retries increase the number of requests to a failing service. If you retry 3 times and have 10 concurrent requests, you are making 30 requests to the failing service. This can overwhelm it further.

Use retries for transient failures (network hiccup, brief timeout). Use circuit breakers for sustained failures (service is down). Configure retries to not retry on circuit-open errors.

Bulkhead pattern

The bulkhead pattern complements circuit breakers. Instead of one thread pool for all downstream calls, use separate thread pools for each downstream service. If the recommendation service is slow and exhausts its thread pool, the payment service’s thread pool is unaffected.

Named after ship bulkheads that prevent flooding from spreading between compartments.

Monitoring circuit breaker state

Circuit breaker state changes are important events. Log and alert on:

Circuit opening (downstream service is failing)
Circuit closing (downstream service has recovered)
Fallback rate (how often fallbacks are being served)

Real-world systems

Netflix Hystrix - The original circuit breaker library for Java. Popularized the pattern. Now in maintenance mode, replaced by Resilience4j.

Resilience4j - Modern Java circuit breaker library. Supports circuit breaker, rate limiter, retry, bulkhead, and time limiter. Integrates with Spring Boot.

Polly - .NET resilience library. Supports circuit breaker, retry, timeout, bulkhead, and fallback.

Envoy - Service mesh proxy. Built-in circuit breaking, outlier detection, and retry policies. Configured via Envoy’s cluster configuration.

Istio - Service mesh that uses Envoy as the data plane. Circuit breaking configured via DestinationRule resources.

AWS SDK - Built-in retry and circuit breaker logic for AWS service calls.

How to apply it in practice

Where to put circuit breakers

Put circuit breakers on every call to an external service or a service that can fail independently:

HTTP calls to other microservices
Database calls (if the database can be slow or unavailable)
Third-party API calls (payment processors, email services)
Cache calls (if the cache can be unavailable)

Do not put circuit breakers on:

In-process function calls
Calls to services that are always available (local filesystem)

Configuration guidelines

Start with these defaults and tune based on observation:

Sliding window size: 10 requests
Failure rate threshold: 50%
Wait duration in open state: 30 seconds
Permitted calls in half-open: 3
Timeout per request: 2-5 seconds (shorter than the circuit breaker window)

Testing circuit breakers

Test the circuit breaker explicitly:

Inject failures into the downstream service and verify the circuit opens
Verify the fallback response is correct
Verify the circuit closes after the service recovers
Test the half-open state behavior

Use chaos engineering tools (Chaos Monkey, Gremlin) to inject failures in production and verify circuit breakers work correctly.

FAQ

Q: What is the difference between a circuit breaker and a timeout?

A timeout limits how long you wait for a single request. A circuit breaker stops making requests entirely when failures accumulate. They are complementary: use timeouts to prevent individual requests from hanging indefinitely, and circuit breakers to stop making requests when the downstream service is consistently failing. Without a timeout, the circuit breaker would never open (requests would hang forever, not fail). Without a circuit breaker, timeouts would still cause thread exhaustion under sustained failures.

Q: Should you use a circuit breaker for database calls?

Yes, if the database can be slow or unavailable. A circuit breaker on database calls prevents thread exhaustion when the database is overloaded. The fallback might be to return cached data or a degraded response. However, for most applications, the database is a critical dependency and there is no meaningful fallback. In that case, the circuit breaker still helps by failing fast (returning an error immediately) rather than waiting for timeouts, which prevents thread exhaustion.

Q: How do you handle circuit breakers in a distributed system where multiple instances of a service each have their own circuit breaker?

Each instance has its own circuit breaker state. If one instance sees many failures, its circuit opens. Other instances might still have their circuits closed. This is fine - each instance independently protects itself. The aggregate effect is that as more instances open their circuits, less traffic reaches the failing downstream service. For centralized circuit breaker state (all instances share the same state), use a distributed circuit breaker backed by Redis. This is more complex but provides consistent behavior across all instances.

Interview questions

Q1: Your product page calls 5 microservices. One of them (recommendations) is slow. How do you prevent this from taking down the entire product page?

Strong answer: Add a circuit breaker on the recommendations service call. Configure a short timeout (500ms) and a circuit breaker that opens after 50% failure rate. When the circuit opens, return a fallback (empty recommendations or cached recommendations). The product page renders without recommendations rather than timing out. Also add a bulkhead: use a separate thread pool for recommendation calls so they cannot exhaust the main thread pool. Monitor the circuit breaker state and alert when it opens. Investigate the root cause of the recommendation service slowness. The circuit breaker buys time to fix the underlying issue without causing a full outage.

Q2: Design the resilience strategy for a payment service that calls an external payment processor.

Strong answer: Multiple layers of resilience. First, timeout: set a 3-second timeout on payment processor calls. Second, retry: retry on transient failures (network timeout, 503) with exponential backoff (1s, 2s, 4s). Do not retry on 4xx errors (invalid card, insufficient funds). Third, circuit breaker: open after 30% failure rate over 20 requests. Wait 60 seconds before half-open. Fallback: queue the payment for retry rather than failing immediately (if the business allows async payment processing). Fourth, bulkhead: separate thread pool for payment processor calls. Fifth, idempotency: include an idempotency key in every payment request so retries do not cause double charges. Monitor all of these with metrics and alerts.

Q3: A circuit breaker is open. The downstream service has recovered. How does the circuit know to close?

Strong answer: After the configured wait duration (e.g., 30 seconds), the circuit transitions to half-open state. In half-open state, a limited number of test requests (e.g., 3) are allowed through to the downstream service. If all test requests succeed (within the timeout), the circuit closes and normal operation resumes. If any test request fails, the circuit opens again and the wait duration resets. The half-open state is the recovery detection mechanism. Without it, the circuit would stay open forever even after the downstream service recovers. The wait duration should be long enough for the downstream service to recover (typically 30-60 seconds) but short enough that recovery is detected quickly.