Kafka Internals: How the World's Most Popular Event Stream Works


Your team decides to use Kafka. You create a topic, publish some messages, consume them. It works. Then you add more consumers. Some messages are processed twice. Some are missed. You increase partitions. Now messages for the same user arrive out of order. You set acks=all. Throughput drops by 80%.

Kafka is powerful but has a learning curve. The defaults are not always right. Understanding the internals - partitions, consumer groups, offsets, replication - tells you exactly why these things happen and how to configure Kafka correctly for your use case.

What Kafka actually is

Kafka is a distributed event streaming platform. It stores events (messages) in a durable, ordered log. Producers write events. Consumers read events. Unlike a traditional message queue, events are not deleted after consumption - they are retained for a configurable period (days, weeks, forever).

This makes Kafka fundamentally different from SQS or RabbitMQ:

  • Multiple consumer groups can read the same events independently
  • Consumers can replay events from any point in history
  • Events are ordered within a partition
  • Kafka is optimized for high throughput, not low latency

The log: Kafka’s core data structure

Kafka stores events in an append-only log. New events are always added to the end. Events are never modified or deleted (until retention expires). Each event has an offset - its position in the log.

This is why Kafka is fast: sequential disk writes are much faster than random writes. A modern SSD can sustain 500MB/s of sequential writes. Kafka exploits this by writing events sequentially.

graph LR
subgraph log["Kafka Topic - Append-Only Log"]
  E0["Offset 0
Event A"]
  E1["Offset 1
Event B"]
  E2["Offset 2
Event C"]
  E3["Offset 3
Event D"]
  E4["Offset 4
Event E (latest)"]
  E0 --> E1 --> E2 --> E3 --> E4
  NEW["New event F"] -->|"append"| E4
end

subgraph consumers["Consumer Positions"]
  C1["Consumer 1
at offset 4
(caught up)"]
  C2["Consumer 2
at offset 2
(2 behind)"]
  C3["Consumer 3
at offset 0
(replaying from start)"]
end

style E4 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style C1 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style C2 fill:#FAEEDA,stroke:#854F0B,color:#633806
style C3 fill:#F1EFE8,stroke:#888780,color:#444441

Partitions: the unit of parallelism

A Kafka topic is divided into partitions. Each partition is an independent ordered log. Events within a partition are ordered. Events across partitions are not ordered relative to each other.

Why partitions matter:

  • Parallelism - Each partition can be consumed by one consumer in a consumer group. More partitions = more parallel consumers = higher throughput.
  • Ordering - If you need events for a specific entity (user, order) to be processed in order, put them in the same partition using a partition key.
  • Scalability - Partitions are distributed across brokers. More partitions = more brokers can share the load.

Partition key: When producing a message, you specify a key. Kafka hashes the key to determine the partition. All messages with the same key go to the same partition, in order.

key="user:123" -> hash -> partition 2
key="user:456" -> hash -> partition 0
key="user:789" -> hash -> partition 2

All events for user 123 are in partition 2, in order. Events for different users may be in different partitions.

Consumer groups: the unit of consumption

A consumer group is a set of consumers that collectively consume a topic. Each partition is assigned to exactly one consumer in the group. If you have 4 partitions and 4 consumers, each consumer handles one partition. If you have 4 partitions and 2 consumers, each consumer handles 2 partitions.

Key rule: You cannot have more consumers in a group than partitions. Extra consumers sit idle.

Multiple consumer groups: Each consumer group maintains its own offset. Group A and Group B can both consume the same topic independently. Group A might be at offset 1000, Group B at offset 500. They do not interfere with each other.

graph TB
subgraph topic["Topic: orders (4 partitions)"]
  P0["Partition 0"]
  P1["Partition 1"]
  P2["Partition 2"]
  P3["Partition 3"]
end

subgraph cg1["Consumer Group: inventory-service"]
  C1A["Consumer 1
P0, P1"]
  C1B["Consumer 2
P2, P3"]
end

subgraph cg2["Consumer Group: notification-service"]
  C2A["Consumer 1
P0, P1, P2, P3"]
end

P0 --> C1A
P1 --> C1A
P2 --> C1B
P3 --> C1B
P0 --> C2A
P1 --> C2A
P2 --> C2A
P3 --> C2A

style P0 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style P1 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style P2 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style P3 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style C1A fill:#E1F5EE,stroke:#0F6E56,color:#085041
style C1B fill:#E1F5EE,stroke:#0F6E56,color:#085041
style C2A fill:#FAEEDA,stroke:#854F0B,color:#633806

Replication: durability and availability

Each partition has a leader and N-1 followers (replicas). Producers write to the leader. Followers replicate from the leader. If the leader fails, a follower is elected as the new leader.

Replication factor: How many copies of each partition exist. Replication factor 3 means 3 copies (1 leader + 2 followers). Can tolerate 2 broker failures.

acks setting: Controls when the producer considers a write successful:

  • acks=0 - Fire and forget. No acknowledgment. Fastest, but messages can be lost.
  • acks=1 - Leader acknowledges. Fast, but if the leader fails before replication, messages are lost.
  • acks=all (or acks=-1) - All in-sync replicas acknowledge. Slowest, but no data loss. Use for critical data.

min.insync.replicas: Minimum number of replicas that must acknowledge a write. With replication factor 3 and min.insync.replicas=2, at least 2 replicas must acknowledge. If only 1 is available, writes fail. This prevents data loss at the cost of availability.

Where it breaks or gets interesting

Rebalancing

When a consumer joins or leaves a consumer group, Kafka reassigns partitions. During rebalancing, consumption stops. All consumers in the group pause, partitions are reassigned, and consumption resumes.

Rebalancing is triggered by:

  • A consumer joining the group (new deployment)
  • A consumer leaving (crash, shutdown)
  • A consumer failing to send a heartbeat within session.timeout.ms

Minimize rebalancing: use static group membership (group.instance.id), increase session.timeout.ms, use incremental cooperative rebalancing (Kafka 2.4+) which only reassigns the partitions that need to move.

Offset management

Consumers commit their offset to Kafka (or ZooKeeper in older versions) to track their position. If a consumer crashes and restarts, it resumes from the last committed offset.

Auto-commit: Kafka automatically commits offsets every auto.commit.interval.ms. Risk: if the consumer processes a message but crashes before the auto-commit, the message is reprocessed on restart (at-least-once).

Manual commit: The consumer explicitly commits after processing. More control, more complexity. Commit after processing, not before, to avoid losing messages.

Exactly-once: Use Kafka transactions to atomically process a message and commit the offset. Complex but guarantees exactly-once processing.

The consumer lag problem

Consumer lag is the difference between the latest offset and the consumer’s current offset. High lag means the consumer is falling behind.

Monitor lag with kafka-consumer-groups.sh --describe or tools like Burrow, Datadog, or Confluent Control Center. Alert when lag exceeds a threshold. Scale consumers (add more instances, up to the partition count) to reduce lag.

Partition count: choose carefully

You cannot decrease the partition count of a topic without deleting and recreating it. Increasing partitions is possible but causes rebalancing and may break ordering guarantees (messages that were in the same partition may end up in different partitions after the increase).

Choose partition count based on expected throughput: partitions = max_throughput / throughput_per_consumer. Start with more partitions than you need (you can always add consumers later). A common starting point: 3-6 partitions for low-throughput topics, 12-24 for high-throughput.

Real-world systems

LinkedIn - Kafka was created at LinkedIn. Used for activity tracking, metrics, and log aggregation. Handles trillions of messages per day.

Uber - Uses Kafka for real-time analytics, driver location updates, and event sourcing. Multiple Kafka clusters for different use cases.

Netflix - Uses Kafka for real-time monitoring, event sourcing, and data pipeline. Keystone is Netflix’s Kafka-based data pipeline.

Airbnb - Uses Kafka for event streaming between microservices. Replaced direct service-to-service calls with Kafka events.

Confluent - Commercial Kafka distribution with additional features: Schema Registry (Avro/Protobuf schema management), Kafka Connect (connectors to databases and other systems), ksqlDB (SQL on Kafka streams).

How to apply it in practice

When to use Kafka vs SQS/RabbitMQ

Use Kafka when:

  • Multiple consumer groups need to read the same events
  • You need event replay (reprocess historical events)
  • High throughput (millions of events per second)
  • Event sourcing or CQRS architecture
  • Building a data pipeline

Use SQS/RabbitMQ when:

  • Simple task queue (one consumer group)
  • You do not need event replay
  • Lower throughput
  • Simpler operational requirements
  • Managed service with less configuration

Producer configuration

acks=all                    # No data loss
retries=Integer.MAX_VALUE   # Retry on failure
max.in.flight.requests.per.connection=1  # Preserve ordering with retries
enable.idempotence=true     # Exactly-once producer semantics
compression.type=lz4        # Compress messages for throughput
batch.size=16384            # Batch messages for throughput
linger.ms=5                 # Wait 5ms to fill batch

Consumer configuration

group.id=my-service         # Consumer group name
auto.offset.reset=earliest  # Start from beginning if no committed offset
enable.auto.commit=false    # Manual commit for reliability
max.poll.records=500        # Process 500 records per poll
session.timeout.ms=30000    # 30 second session timeout

FAQ

Q: How does Kafka achieve such high throughput?

Several design decisions: sequential disk writes (much faster than random writes), zero-copy data transfer (sendfile syscall bypasses user space), batching (producers batch messages, consumers fetch batches), compression (LZ4, Snappy, GZIP reduce network and disk I/O), and partitioning (parallel processing across multiple consumers). The combination of these optimizations allows Kafka to sustain millions of messages per second on commodity hardware.

Q: What is the difference between Kafka and a database?

Kafka is an event log, not a database. It stores events in time order. You cannot query Kafka like a database (no SQL, no indexes, no joins). You can only read events sequentially from a partition. Kafka is optimized for high-throughput sequential writes and reads. Databases are optimized for random access, complex queries, and transactions. Use Kafka for event streaming and data pipelines. Use a database for storing and querying application state.

Q: How do you handle schema evolution in Kafka?

Use a schema registry (Confluent Schema Registry or AWS Glue Schema Registry). Producers register their schema. Consumers validate messages against the schema. The registry enforces compatibility rules: backward compatible (new schema can read old messages), forward compatible (old schema can read new messages), or full compatible (both). Use Avro or Protobuf for efficient binary serialization with schema evolution support. Avoid JSON without a schema registry - schema changes are invisible and break consumers silently.

Interview questions

Q1: You have a Kafka topic with 4 partitions and 6 consumers in a consumer group. What happens?

Strong answer: Only 4 consumers are active. Each active consumer handles one partition. The remaining 2 consumers sit idle - they are assigned no partitions. This is a waste of resources. To fix: either reduce the consumer group to 4 consumers, or increase the partition count to 6 (or more). The rule is: you cannot have more active consumers than partitions. The idle consumers are not wasted entirely - they serve as hot standbys. If an active consumer fails, one of the idle consumers takes over its partition during rebalancing.

Q2: You need to process payment events in order per user. How do you configure Kafka?

Strong answer: Use the user ID as the partition key. All events for the same user go to the same partition, in order. The consumer processes events from each partition sequentially. Set acks=all and min.insync.replicas=2 to prevent data loss. Use enable.idempotence=true on the producer to prevent duplicate messages on retry. Use manual offset commit on the consumer: commit after successfully processing each event, not before. This ensures at-least-once delivery with ordering per user. For exactly-once, use Kafka transactions: wrap the processing and offset commit in a transaction. Choose the partition count based on the number of concurrent users you expect to process simultaneously.

Q3: Your Kafka consumer lag is growing. The consumer is processing 1,000 messages per second but the producer is publishing 5,000 per second. How do you fix this?

Strong answer: Scale the consumers. If the topic has 10 partitions, you can have up to 10 consumers in the group. Add more consumer instances (up to 10) to increase throughput. Each consumer handles a subset of partitions. If you already have 10 consumers and still cannot keep up, the bottleneck is per-consumer throughput. Optimize the consumer: batch database writes (instead of one write per message, batch 100 messages and write them together), use async processing (process messages concurrently within each consumer), or optimize the processing logic. If the bottleneck is the downstream database, add read replicas or a cache. If the lag is temporary (traffic spike), the queue is doing its job - it will catch up after the spike. If it is permanent, you need more capacity.