Inbox/Outbox Pattern: Reliable Messaging Without Distributed Transactions


Your order service creates an order in the database and publishes an “order created” event to Kafka. Most of the time, both succeed. But sometimes the database write succeeds and the Kafka publish fails. Now the order exists but no downstream service knows about it. Inventory is not updated. No confirmation email is sent. The order is in a ghost state.

You could use a distributed transaction (two-phase commit) to make both operations atomic. But 2PC is slow, complex, and a single point of failure. There is a better way.

The dual-write problem

The core problem: you need to update a database and publish a message atomically. Either both happen or neither happens. Without a distributed transaction, you cannot guarantee this.

Option 1: Write to DB first, then publish

  • DB write succeeds, Kafka publish fails: event is lost
  • DB write succeeds, service crashes before Kafka publish: event is lost

Option 2: Publish first, then write to DB

  • Kafka publish succeeds, DB write fails: event published for a transaction that did not happen
  • Kafka publish succeeds, service crashes before DB write: same problem

Neither option is safe. The outbox pattern solves this.

The outbox pattern

Instead of writing to the database and publishing to Kafka separately, write both to the database in a single transaction. A separate process reads from the database and publishes to Kafka.

How it works:

  1. The service writes the business data (order) and an outbox record (the event to publish) in the same database transaction.
  2. A separate “relay” process reads unprocessed outbox records and publishes them to Kafka.
  3. After successful publishing, the relay marks the outbox record as processed (or deletes it).

The database transaction guarantees atomicity: either both the order and the outbox record are written, or neither is. The relay handles the Kafka publishing separately, with retries.

graph TB
subgraph transaction["Single Database Transaction"]
  APP["Order service"] -->|"BEGIN TRANSACTION"| DB["Database"]
  DB --> OR["orders table
INSERT order"]
  DB --> OB["outbox table
INSERT event"]
  DB -->|"COMMIT"| APP
end

subgraph relay["Relay Process"]
  REL["Outbox relay
(polls or CDC)"] -->|"read unprocessed"| OB2["outbox table"]
  REL -->|"publish"| KAFKA["Kafka"]
  KAFKA -->|"success"| REL
  REL -->|"mark processed"| OB2
end

style DB fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style KAFKA fill:#E1F5EE,stroke:#0F6E56,color:#085041
style REL fill:#FAEEDA,stroke:#854F0B,color:#633806

Implementing the outbox

The outbox table

CREATE TABLE outbox (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  aggregate_type VARCHAR(255) NOT NULL,  -- 'order', 'user', etc.
  aggregate_id VARCHAR(255) NOT NULL,    -- the entity ID
  event_type VARCHAR(255) NOT NULL,      -- 'order.created', 'order.shipped'
  payload JSONB NOT NULL,                -- the event data
  created_at TIMESTAMP DEFAULT NOW(),
  processed_at TIMESTAMP,
  status VARCHAR(50) DEFAULT 'pending'   -- 'pending', 'processed', 'failed'
);

The relay: polling vs CDC

Polling relay: A background job queries the outbox table for unprocessed records every few seconds. Simple to implement. Adds latency (up to the poll interval). Adds load to the database.

Change Data Capture (CDC): Use Debezium to stream changes from the database’s WAL (write-ahead log) directly to Kafka. Near-real-time (milliseconds). No polling overhead. More complex to set up.

CDC is the preferred approach for production systems. Debezium reads the PostgreSQL WAL and publishes changes to Kafka. The outbox table changes are captured and forwarded to the appropriate Kafka topics.

At-least-once delivery

The relay publishes events at-least-once. If the relay crashes after publishing but before marking the record as processed, it will publish again on restart. Consumers must be idempotent (handle duplicate events).

Include a unique event ID in every event. Consumers store processed event IDs and skip duplicates.

The inbox pattern

The inbox pattern is the consumer-side complement to the outbox. When a service receives an event, it writes it to an inbox table before processing. This ensures the event is not lost if the service crashes during processing.

How it works:

  1. Consumer receives an event from Kafka
  2. Consumer writes the event to the inbox table (with the event ID)
  3. Consumer processes the event (updates business data)
  4. Consumer marks the inbox record as processed

If the consumer crashes between steps 2 and 3, it restarts and processes the event from the inbox. If it crashes between steps 3 and 4, it restarts and tries to process again - but the business logic must be idempotent (check if the event was already processed before applying changes).

graph LR
subgraph inbox["Inbox Pattern"]
  KAFKA2["Kafka"] -->|"event"| CONS["Consumer"]
  CONS -->|"1. Write to inbox"| IB["inbox table"]
  CONS -->|"2. Process event"| BIZ["Business logic
Update orders table"]
  CONS -->|"3. Mark processed"| IB
  IB -->|"idempotency check
before processing"| CONS
end

style KAFKA2 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style IB fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style BIZ fill:#FAEEDA,stroke:#854F0B,color:#633806

Where it breaks or gets interesting

Outbox table growth

The outbox table grows as events are added. If the relay falls behind, the table can grow large. Regularly delete processed records. Use a background job to clean up records older than N days.

Ordering guarantees

The outbox relay publishes events in the order they were inserted. But if the relay uses multiple threads, events might be published out of order. Use a single-threaded relay or partition events by aggregate ID to maintain ordering.

The relay as a single point of failure

If the relay crashes, events stop being published. Use multiple relay instances with leader election (only one publishes at a time) or use CDC (Debezium handles its own reliability).

Transactional outbox with NoSQL

The outbox pattern requires atomic writes to two tables. This is easy with SQL databases (single transaction). With NoSQL databases, you need a different approach: use a document database that supports multi-document transactions (MongoDB 4.0+), or use a single document that contains both the business data and the outbox event.

Real-world systems

Debezium - Open-source CDC platform. Reads database WAL and publishes changes to Kafka. Supports PostgreSQL, MySQL, MongoDB, and others. The standard tool for implementing the outbox pattern with CDC.

Eventuate Tram - Framework for transactional messaging in microservices. Implements the outbox pattern with polling and CDC options.

Axon Framework - Java framework for event-driven microservices. Built-in support for the outbox pattern.

Stripe - Uses a similar pattern for reliable event publishing. Events are stored in the database before being published to their event bus.

How to apply it in practice

When to use the outbox pattern

Use the outbox pattern when:

  • You need to update a database and publish an event atomically
  • You cannot afford to lose events (payment events, order events)
  • You want to avoid distributed transactions (2PC)

Do not use it when:

  • The event can be lost (analytics, metrics)
  • The operation is idempotent and retrying is safe
  • You are using a database that does not support transactions

Choosing between polling and CDC

Use polling when:

  • Simplicity is more important than latency
  • You do not want to set up Debezium
  • Event volume is low (under 1,000 events per second)

Use CDC when:

  • Low latency is important (events published within milliseconds)
  • High event volume
  • You want to minimize database load (no polling queries)

Monitoring

Monitor:

  • Outbox table size (growing = relay is falling behind)
  • Relay lag (time between event creation and publication)
  • Failed events (events that could not be published after N retries)

FAQ

Q: Is the outbox pattern the same as event sourcing?

No. Event sourcing stores all state changes as events and derives current state by replaying them. The outbox pattern is a reliability mechanism for publishing events to a message broker. They are complementary: an event-sourced system might use the outbox pattern to reliably publish events to Kafka.

Q: What if the Kafka publish fails repeatedly?

Move the event to a dead letter queue (or mark it as failed in the outbox table). Alert on failed events. Investigate the cause (Kafka unavailable, schema mismatch, etc.). Fix the issue and replay the failed events. The outbox table serves as a durable buffer - events are not lost even if Kafka is unavailable for hours.

Q: Can you use the outbox pattern with a NoSQL database?

Yes, if the database supports atomic multi-document writes. MongoDB 4.0+ supports multi-document transactions. Write the business document and the outbox document in the same transaction. For databases without transactions (DynamoDB, Cassandra), use a single document that contains both the business data and the outbox event, or use a conditional write to ensure atomicity.

Interview questions

Q1: You are building an order service. When an order is created, you need to publish an event to Kafka. How do you ensure the event is never lost?

Strong answer: Use the outbox pattern. In the same database transaction that creates the order, insert a record into the outbox table with the event payload. A separate relay process (using Debezium CDC or polling) reads the outbox table and publishes events to Kafka. After successful publishing, the relay marks the record as processed. If the order service crashes after the transaction commits but before the relay publishes, the relay will publish on its next run. If the relay crashes after publishing but before marking as processed, it will publish again - so consumers must be idempotent. This guarantees at-least-once delivery without distributed transactions.

Q2: Your outbox relay is falling behind. The outbox table has 1 million unprocessed records. How do you catch up?

Strong answer: First, identify why the relay is falling behind: is Kafka slow? Is the relay CPU-bound? Is there a network issue? Fix the root cause. To catch up: scale the relay horizontally (multiple relay instances, each handling a subset of records - partition by aggregate ID to maintain ordering). Increase the batch size (publish more events per Kafka batch). If Kafka is the bottleneck, increase Kafka partitions and producer throughput. Monitor the outbox table size and relay lag continuously. Set up alerts so you catch this before it becomes a 1 million record backlog. Consider using CDC (Debezium) instead of polling - CDC has lower latency and less database load.

Q3: How does the outbox pattern compare to using a distributed transaction (2PC) for the same problem?

Strong answer: Two-phase commit (2PC) makes the database write and Kafka publish atomic by coordinating between both systems. The coordinator asks both to “prepare,” then tells both to “commit.” Problems: the coordinator is a single point of failure, 2PC is slow (multiple round trips), and if the coordinator crashes after “prepare” but before “commit,” both systems are locked until the coordinator recovers. The outbox pattern avoids 2PC by using the database as the source of truth. The database transaction is local (fast, reliable). The Kafka publish is a separate, retryable operation. The tradeoff: eventual consistency (events are published after the transaction, not atomically with it). For most use cases, this is acceptable. The outbox pattern is simpler, faster, and more reliable than 2PC.