Two Truths and One Database


distributed-systems microservices databases

System Design Scenario

Two Truths and One Database

When distributed systems create multiple versions of reality - and they’re all correct

⏱ 12 min read📐 Intermediate🔒 Distributed Systems

It’s Monday morning when the support ticket arrives: “Customer says they’re logged in but can’t access premium features.” The user service shows the account as active with a premium subscription. The billing service shows the same account as suspended for non-payment. Both systems are running perfectly. Both databases are consistent. Both answers are correct.

The customer paid their bill on Friday at 4:58 PM. The payment service processed it successfully and marked the account as paid. But the billing service was down for maintenance. When it came back online at 6 PM, it didn’t know about the payment. It suspended the account for non-payment at midnight, exactly as designed. The user service never received the suspension notice because the billing service was busy catching up on weekend processing.

This is distributed systems reality - multiple sources of truth that diverge over time, each making locally correct decisions based on incomplete information. It’s like two security guards at different entrances to a building, each with their own visitor log, making access decisions independently. The building has two different realities of who’s allowed inside, and both guards are following their procedures perfectly.

Why This Happens

The instinct when building microservices is to give each service its own database to ensure independence and avoid the coordination overhead of shared storage. Each service becomes the authoritative source for its domain - user service owns user data, billing service owns payment data, subscription service owns subscription states.

But business operations span domains. A payment affects billing status, which affects subscription access, which affects user permissions. When services make decisions based on their local state without coordination, they create divergent views of reality.

Payment processed successfully (payment service)
  -> Billing service offline during update
    -> User service never receives payment notification
      -> Billing service suspends for non-payment  
        -> User service still shows active
          -> Conflicting states across services

The problem is amplified by event processing delays. Modern systems use message queues and event streams to communicate between services. When one service is slow to process events or goes offline temporarily, the event lag creates temporary inconsistency that can become permanent if not handled properly.

Key Insight

Each microservice optimizes for local consistency at the cost of global consistency - the CAP theorem in action.

The Naive Solution (and where it breaks)

Most teams reach for distributed transactions or synchronous API calls between services to maintain consistency. The thinking is that if all services agree on changes before committing them, there can be no divergent state.

Distributed transactions use two-phase commit (2PC) to ensure all services either commit together or rollback together. But 2PC requires all participating services to be available and responsive. If the billing service is down when a payment is processed, the entire transaction blocks until billing comes back online.

Naive approach showing distributed transactions blocking on service failures

Synchronous API calls between services create similar problems. When the payment service tries to notify the billing service about a successful payment, it must wait for billing to respond. If billing is slow or down, the payment processing slows or fails entirely.

Watch Out

Distributed transactions trade availability for consistency - one slow service makes all services slow.

Small scale: 3 services -> 2PC coordination manageable
Large scale: 12 services -> any service outage blocks entire system

The Better Solution - Saga Pattern

Here’s what actually fixes this: use the saga pattern to coordinate long-running business transactions across services while maintaining service independence. A saga is like a choreographed dance where each service performs its part and signals the next service to begin, rather than a conductor requiring all musicians to play in perfect synchronization.

Sagas break complex business transactions into a series of local transactions, each managed by a single service. If any step fails, compensation actions undo the previous steps. Services remain loosely coupled and available.

# Payment saga orchestrator
class PaymentSaga:
    def __init__(self):
        self.steps = [
            {"service": "payment", "action": "charge_card", "compensation": "refund_charge"},
            {"service": "billing", "action": "mark_paid", "compensation": "mark_unpaid"},
            {"service": "subscription", "action": "activate", "compensation": "deactivate"},
            {"service": "notification", "action": "send_confirmation", "compensation": "send_cancellation"}
        ]
    
    async def execute_saga(self, payment_data):
        completed_steps = []
        
        try:
            for step in self.steps:
                result = await self.execute_step(step, payment_data)
                completed_steps.append({"step": step, "result": result})
                
                # Persist saga state after each step
                await self.save_saga_state(payment_data.saga_id, completed_steps)
                
        except Exception as error:
            # Compensate all completed steps in reverse order
            await self.compensate(completed_steps.reverse(), error)
            raise SagaFailedException(error)
    
    async def compensate(self, completed_steps, original_error):
        for step_record in completed_steps:
            try:
                await self.execute_compensation(step_record["step"], step_record["result"])
            except CompensationException as comp_error:
                # Log compensation failure but continue with other compensations
                logger.error(f"Compensation failed: {comp_error}")
Real World

Uber uses sagas extensively for trip booking - driver assignment, route planning, payment authorization, and notification all run as coordinated but independent transactions.

The Better Solution - Outbox Pattern

For reliable event publication, implement the outbox pattern to ensure events are published exactly once, even if the service crashes between processing and publishing.

The outbox pattern stores business data and events in the same database transaction. A separate process reads from the outbox table and publishes events to the message queue, ensuring no events are lost due to service failures.

-- Outbox table for reliable event publishing
CREATE TABLE event_outbox (
    id SERIAL PRIMARY KEY,
    aggregate_id VARCHAR NOT NULL,
    event_type VARCHAR NOT NULL, 
    event_payload JSONB NOT NULL,
    created_at TIMESTAMP DEFAULT NOW(),
    published_at TIMESTAMP NULL,
    published BOOLEAN DEFAULT FALSE,
    
    INDEX idx_unpublished (published, created_at)
);

-- Business transaction with outbox event
BEGIN TRANSACTION;

-- Update business data
UPDATE accounts SET status = 'active', updated_at = NOW() 
WHERE user_id = $1;

-- Insert event into outbox
INSERT INTO event_outbox (aggregate_id, event_type, event_payload)
VALUES ($1, 'AccountActivated', '{"user_id": "123", "activated_at": "2024-12-19T10:00:00Z"}');

COMMIT;

The outbox publisher runs independently and handles failures gracefully:

// Outbox event publisher
type OutboxPublisher struct {
    db        *sql.DB
    publisher EventPublisher
    batchSize int
}

func (p *OutboxPublisher) PublishPendingEvents(ctx context.Context) error {
    for {
        // Get batch of unpublished events
        events, err := p.getUnpublishedEvents(p.batchSize)
        if err != nil {
            return err
        }
        
        if len(events) == 0 {
            time.Sleep(5 * time.Second)
            continue
        }
        
        // Publish events and mark as published
        for _, event := range events {
            if err := p.publisher.Publish(event.EventType, event.Payload); err != nil {
                // Skip this event, will retry on next iteration
                continue
            }
            
            // Mark as published
            p.markEventPublished(event.ID)
        }
    }
}
Key Insight

The outbox pattern decouples business logic from event publishing, ensuring events are never lost due to infrastructure failures.

The Better Solution - Compensating Transactions

For handling saga step failures, implement compensating transactions that semantically undo the effects of completed steps. Compensation doesn’t require perfect technical rollback - it requires business-level correctness.

# Compensating transaction implementation
class BillingService:
    async def mark_paid(self, account_id, payment_id):
        # Business transaction
        await self.db.execute(
            "UPDATE billing_accounts SET status = 'paid', last_payment_id = %s WHERE id = %s",
            (payment_id, account_id)
        )
        
        return {"account_id": account_id, "payment_id": payment_id, "marked_paid_at": datetime.utcnow()}
    
    async def compensate_mark_paid(self, compensation_data):
        # Compensation doesn't just reverse - it records what happened
        account_id = compensation_data["account_id"]
        payment_id = compensation_data["payment_id"]
        
        # Mark account as unpaid again
        await self.db.execute(
            "UPDATE billing_accounts SET status = 'unpaid' WHERE id = %s",
            (account_id,)
        )
        
        # Record the compensation for audit trail
        await self.db.execute(
            """INSERT INTO billing_compensations 
               (account_id, payment_id, compensated_at, reason)
               VALUES (%s, %s, %s, %s)""",
            (account_id, payment_id, datetime.utcnow(), "saga_rollback")
        )

Compensating transactions handle idempotency automatically - if the same compensation is requested twice, it should be safe to execute twice without causing inconsistency.

Saga pattern showing compensating transactions and event flow

The Full Architecture

Complete eventual consistency architecture with sagas, outbox pattern, and event sourcing

The complete system manages eventual consistency through multiple coordinated patterns. The outbox pattern ensures reliable event publishing from each service. The saga orchestrator coordinates multi-service transactions with proper compensation logic. Event stores provide audit trails and enable replay for consistency recovery. The consistency checker runs periodic reconciliation jobs to detect and resolve data divergence that may occur despite the safety mechanisms.

Each service publishes events through its outbox, which guarantees exactly-once delivery to the event stream. The saga orchestrator subscribes to relevant events and manages multi-step business processes. When inconsistencies are detected through monitoring or reconciliation jobs, compensation sagas can be triggered to restore consistency without requiring distributed rollbacks.

This architecture accepts that temporary inconsistency will occur, but provides mechanisms to detect it quickly and resolve it automatically. The system optimizes for availability and partition tolerance while eventually achieving consistency.

Key Insight

Eventual consistency isn’t about accepting broken data - it’s about building systems that self-heal from temporary inconsistencies.

Component Deep Dives

Saga Orchestrator

The saga orchestrator’s job is to coordinate multi-service transactions while handling partial failures, retries, and compensation logic. It maintains saga state persistently to recover from crashes.

# Persistent saga state management
class SagaState:
    def __init__(self, saga_id: str):
        self.saga_id = saga_id
        self.current_step = 0
        self.completed_steps = []
        self.compensation_steps = []
        self.status = SagaStatus.RUNNING
        self.created_at = datetime.utcnow()
        self.updated_at = datetime.utcnow()
    
    def complete_step(self, step_result):
        self.completed_steps.append({
            'step_index': self.current_step,
            'result': step_result,
            'completed_at': datetime.utcnow()
        })
        self.current_step += 1
        self.updated_at = datetime.utcnow()
    
    def add_compensation(self, compensation_action):
        self.compensation_steps.insert(0, {  # LIFO for compensation
            'action': compensation_action,
            'for_step': self.current_step - 1
        })

class SagaOrchestrator:
    def __init__(self, saga_repository, event_publisher):
        self.saga_repo = saga_repository
        self.event_publisher = event_publisher
    
    async def execute_saga(self, saga_definition, initial_data):
        saga_state = SagaState(saga_id=str(uuid4()))
        
        try:
            for step in saga_definition.steps[saga_state.current_step:]:
                # Execute step with retry logic
                step_result = await self.execute_step_with_retry(step, initial_data)
                saga_state.complete_step(step_result)
                
                # Persist state after each step
                await self.saga_repo.save_state(saga_state)
                
                # Publish step completion event
                await self.event_publisher.publish(
                    f"saga.{saga_definition.name}.step.completed",
                    {"saga_id": saga_state.saga_id, "step": step.name, "result": step_result}
                )
                
        except StepExecutionException as e:
            saga_state.status = SagaStatus.COMPENSATING
            await self.execute_compensation(saga_state, e)

The orchestrator persists state after every step, enabling recovery from failures without losing transaction progress or executing steps multiple times.

Outbox Event Publisher

The outbox publisher’s job is to reliably deliver events from the outbox table to message queues while handling duplicate detection, ordering, and error recovery.

// Reliable outbox event publishing with deduplication
type OutboxEvent struct {
    ID           int64     `db:"id"`
    AggregateID  string    `db:"aggregate_id"`  
    EventType    string    `db:"event_type"`
    Payload      string    `db:"event_payload"`
    CreatedAt    time.Time `db:"created_at"`
    Published    bool      `db:"published"`
}

type OutboxPublisher struct {
    db            *sqlx.DB
    eventBus      EventBus
    publishedSet  *redis.Client  // For deduplication
    batchSize     int
    maxRetries    int
}

func (p *OutboxPublisher) PublishBatch(ctx context.Context) error {
    // Get oldest unpublished events
    events, err := p.getUnpublishedEventsBatch()
    if err != nil {
        return err
    }
    
    for _, event := range events {
        // Check if already published (handle crashes during publishing)
        published, err := p.publishedSet.Exists(ctx, p.eventKey(event)).Result()
        if err != nil {
            // Redis error - proceed with publishing, accept potential duplicate
        } else if published == 1 {
            // Mark as published in DB and continue
            p.markEventPublished(event.ID)
            continue
        }
        
        // Publish event
        if err := p.eventBus.Publish(event.EventType, event.Payload); err != nil {
            // Log error but continue with batch - will retry later
            p.logPublishError(event, err)
            continue
        }
        
        // Mark as published in both Redis and DB
        pipe := p.publishedSet.Pipeline()
        pipe.Set(ctx, p.eventKey(event), "1", 24*time.Hour)
        pipe.Exec(ctx)
        
        p.markEventPublished(event.ID)
    }
    
    return nil
}

The publisher uses Redis to track recently published events, preventing duplicates during recovery scenarios while allowing the database to be the source of truth for unpublished events.

Consistency Reconciliation

The reconciliation service’s job is to periodically detect and resolve data inconsistencies that may arise despite saga coordination and event publishing safeguards.

# Consistency reconciliation service
class ConsistencyReconciler:
    def __init__(self, services: Dict[str, ServiceClient]):
        self.services = services
        self.inconsistency_handlers = {
            'account_billing_mismatch': self.handle_account_billing_mismatch,
            'subscription_payment_mismatch': self.handle_subscription_payment_mismatch,
        }
    
    async def run_reconciliation(self):
        # Check account status consistency across services
        inconsistencies = []
        
        # Get all active accounts from user service
        active_accounts = await self.services['user'].get_active_accounts()
        
        for account in active_accounts:
            # Check billing status
            billing_status = await self.services['billing'].get_account_status(account.id)
            
            # Check subscription status  
            subscription_status = await self.services['subscription'].get_subscription_status(account.id)
            
            # Detect inconsistencies
            if account.status == 'active' and billing_status.status == 'suspended':
                inconsistencies.append({
                    'type': 'account_billing_mismatch',
                    'account_id': account.id,
                    'user_status': account.status,
                    'billing_status': billing_status.status,
                    'detected_at': datetime.utcnow()
                })
        
        # Resolve inconsistencies
        for inconsistency in inconsistencies:
            await self.resolve_inconsistency(inconsistency)
    
    async def resolve_inconsistency(self, inconsistency):
        handler = self.inconsistency_handlers.get(inconsistency['type'])
        if handler:
            await handler(inconsistency)
        else:
            # Log unhandled inconsistency type
            logger.warning(f"No handler for inconsistency type: {inconsistency['type']}")

The reconciler runs on a schedule (hourly or daily) and can also be triggered manually when inconsistencies are detected through monitoring alerts.

Event Store

The event store’s job is to provide an immutable audit trail of all state changes, enabling replay for consistency recovery and debugging of distributed transaction issues.

-- Event store schema
CREATE TABLE event_store (
    event_id BIGSERIAL PRIMARY KEY,
    stream_id VARCHAR NOT NULL,
    event_number INTEGER NOT NULL,
    event_type VARCHAR NOT NULL,
    event_data JSONB NOT NULL,
    metadata JSONB,
    occurred_at TIMESTAMP DEFAULT NOW(),
    
    UNIQUE (stream_id, event_number),
    INDEX (stream_id, event_number),
    INDEX (event_type, occurred_at)
);

-- Saga event projection
CREATE TABLE saga_projections (
    saga_id VARCHAR PRIMARY KEY,
    saga_type VARCHAR NOT NULL,
    status VARCHAR NOT NULL,
    current_step INTEGER NOT NULL,
    created_at TIMESTAMP NOT NULL,
    updated_at TIMESTAMP NOT NULL,
    data JSONB,
    
    INDEX (status, updated_at)
);

The event store enables rebuilding saga state from events, debugging transaction flows, and implementing temporal queries to understand how data evolved over time.

Comparison Table

ApproachConsistency LevelAvailabilityPerformanceComplexityFailure RecoveryBest Use Case
Distributed TXStrongLowSlowMediumAutomatic rollbackFinancial transactions
Synchronous callsStrongLowMediumLowManual retrySimple workflows
Async eventsEventualHighFastMediumManual reconciliationSocial media updates
Saga patternEventualHighFastHighAutomatic compensationE-commerce checkout
Outbox + EventsEventualHighFastHighAutomatic retryOrder processing

Saga pattern with outbox publishing provides the best balance for most business workflows, achieving high availability while ensuring business-level consistency through compensation logic.

Key Takeaways

  • CAP theorem forces a choice between consistency and availability - distributed systems must choose eventual consistency for high availability
  • Saga pattern coordinates long-running business transactions without distributed locks or blocking operations
  • Outbox pattern ensures events are published exactly once, preventing data loss during service failures
  • Compensating transactions provide business-level rollback without requiring technical transaction rollback
  • Event sourcing creates an audit trail that enables debugging and recovery from consistency failures
  • Reconciliation processes detect and resolve inconsistencies that occur despite coordination mechanisms
  • Idempotency is essential for all distributed operations to handle retries and duplicate processing safely
  • Monitoring and alerting on consistency violations enables rapid detection and resolution of data divergence

The counterintuitive lesson: strong consistency in distributed systems often leads to system unavailability, while eventual consistency with proper coordination mechanisms provides both reliability and performance. The key is building systems that detect and automatically resolve temporary inconsistencies rather than preventing them entirely.

Frequently Asked Questions

Q: How do you handle cases where compensating transactions are not possible?
A: Design business processes to avoid non-compensatable operations, or use reserved/pending states. For example, instead of immediately charging a credit card, authorize the payment and capture it only after all saga steps complete successfully.

Q: What happens if the saga orchestrator itself fails during execution?
A: Saga state is persisted after each step, so recovery processes can resume execution from the last completed step. Multiple orchestrator instances can handle different sagas for horizontal scaling and failure recovery.

Q: How do you prevent saga execution from taking too long and holding resources?
A: Implement saga timeouts with automatic compensation triggers. Set reasonable step timeouts and overall saga timeouts. Use exponential backoff for retries and dead letter queues for persistently failing sagas.

Q: Can sagas handle concurrent modifications to the same data?
A: Sagas use optimistic locking or versioning to detect concurrent modifications. If conflicts are detected, the saga can retry with updated data or trigger compensation depending on the business requirements.

Q: How do you test saga workflows and compensation logic?
A: Use chaos engineering to inject failures at different steps. Implement saga simulation environments that can trigger compensation scenarios. Test compensation idempotency by executing compensations multiple times.

Q: What’s the difference between choreography and orchestration in sagas?
A: Orchestration uses a central coordinator to manage the saga flow. Choreography lets services coordinate through events without a central coordinator. Orchestration provides better visibility and control; choreography provides better service independence.

Interview Questions

Q: Design a distributed order processing system that maintains data consistency across payment, inventory, and shipping services.
Expected depth: Discuss saga patterns, event sourcing, compensation logic, and the tradeoffs between orchestration vs choreography. Cover failure scenarios and recovery mechanisms.

Q: How would you handle a scenario where the payment service succeeds but the inventory service fails during checkout?
Expected depth: Explain compensation strategies, payment authorization vs capture, reservation patterns, and designing for compensatable operations.

Q: Your system shows different user subscription statuses across three microservices. How do you detect and resolve this automatically?
Expected depth: Cover consistency monitoring, reconciliation processes, event replay, and determining authoritative data sources during conflicts.

Q: Design an event-driven architecture that guarantees no events are lost even during service failures and network partitions.
Expected depth: Discuss outbox pattern, event store design, exactly-once delivery, duplicate detection, and event ordering guarantees.

Q: How do you implement a saga that spans multiple bounded contexts with different data consistency requirements?
Expected depth: Explain domain boundaries, compensation design across contexts, event schema evolution, and coordinating between teams with different consistency needs.

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access
Unlock Full Article