Lesson 41 made the conceptual move from bounded to unbounded data. The natural processing shape for unbounded data is streaming, and streaming needs a substrate: a place where producers write events and consumers read them, durable enough to survive failures, fast enough to keep up with the firehose, and replayable so consumers can rewind when they get something wrong. The dominant answer in 2026, by a wide margin, is Apache Kafka. This lesson covers what Kafka is, what it guarantees, and why it ended up everywhere.
The TLDR fits in one sentence. Kafka is a distributed, durable, partitioned, append-only log that producers write to and consumers read from at their own pace.
Each of those words is doing real work, and the rest of the lesson unpacks them.
The mental model
A Kafka cluster has six concepts. Get these right and the rest is detail.
Topic. A named log. Conceptually like a database table, except records are appended rather than updated, and consumers read from a position rather than running queries. A Kafka deployment usually has tens to thousands of topics, one per logical event stream: orders.placed, payments.completed, users.signed-up, inventory.updated, and so on.
Partition. A sub-log within a topic. A topic with twelve partitions is twelve separate append-only logs that share a name. Each partition is strictly ordered: records appended to partition 3 will be read in the same order they were written. Across partitions there is no order guarantee. Partitions are the unit of parallelism: more partitions, more parallel consumers, more parallel writers.
Producer. A client that writes records to a topic. The producer either picks the partition explicitly, hashes a key to pick deterministically (so all events for user-42 go to the same partition and stay in order), or lets Kafka round-robin across partitions when key order does not matter.
Consumer. A client that reads records from one or more partitions. The consumer tracks its position in each partition via an offset: a monotonically increasing integer that names a specific record. “I have read up to offset 1,847,392 of partition 5 of topic orders.placed” is a complete description of a consumer’s state. Offsets are stored by Kafka itself in a special __consumer_offsets topic, not by the consumer.
Consumer group. A coordinated set of consumers that share the read load. Each partition of a subscribed topic is consumed by exactly one member of the group at a time. If you have twelve partitions and four consumers in the group, each consumer reads three partitions. If you scale the group to twelve consumers, each reads one. If you scale to twenty, eight sit idle (you cannot have more consumers than partitions in a group). When a consumer dies, the cluster rebalances and another consumer in the group picks up its partitions.
Broker. A Kafka server. A cluster has many brokers (a small deployment is three, a large one is hundreds). Each partition has multiple replicas spread across brokers, with one replica designated the leader (handles writes and reads) and the others followers (replicate from the leader). If a broker dies, the cluster elects a new leader for the partitions whose leaders died, and producers and consumers fail over without manual intervention.
flowchart LR
P1[Producer 1] --> T
P2[Producer 2] --> T
subgraph T[Topic: orders.placed]
direction TB
PA[Partition 0]
PB[Partition 1]
PC[Partition 2]
end
subgraph CG1[Consumer group: billing]
C1[Consumer 1]
C2[Consumer 2]
end
subgraph CG2[Consumer group: analytics]
C3[Consumer 3]
end
PA --> C1
PB --> C2
PC --> C2
PA --> C3
PB --> C3
PC --> C3
Diagram to create: a polished version showing producers on the left writing to a topic with three partitions in the middle, brokers replicating the partitions implicitly, and two consumer groups on the right reading the same topic independently. The key visual point is that partitions are the unit of parallelism within a group, but each consumer group reads the full topic independently.
The diagram captures the most important property of the model: two consumer groups reading the same topic are independent. The billing group reads its own copy at its own pace; the analytics group reads its own copy at its own pace. They share the data; they do not share the position. This is what “decouples producers from consumers” means in practice.
The guarantees
Kafka makes a small, specific set of guarantees, and a lot of mental confusion goes away once you know exactly which ones.
Within a partition, strict order. If producer A writes record X and then record Y to the same partition, every consumer reads X before Y. This is the strongest ordering guarantee Kafka provides, and it is the reason key-based partitioning matters: if you want events for user-42 to be processed in order, you key by user ID so they land in the same partition.
Across partitions, no order guarantee. If producer A writes record X to partition 0 and record Y to partition 1, consumers may read them in either order. This is not a bug; it is the price of parallelism. If you need global ordering, you have one partition, which means one consumer, which means no parallelism. Most teams pick partition keys carefully and live with per-key ordering.
Durability via replication. A record is durable once it has been replicated to N nodes, where N is the configured acks level. With acks=all and replication factor 3, a record is acknowledged to the producer only after all three replicas have it on disk. If a broker dies after the ack, the record survives on the remaining replicas. If all three replicas die, the record is lost; this is rare but possible, and it is why Kafka is configured carefully in regulated industries.
At-least-once delivery by default. A record is delivered at least once, possibly more. The “more than once” case happens when a consumer processes a record, dies before committing the offset, and a replacement reads from the last committed offset (before the record). The record gets processed twice. Idempotent consumers (lesson 38 covered the pattern in batch; the streaming version is identical) are how you live with at-least-once.
Exactly-once within Kafka via transactions. Since Kafka 0.11 (2017), transactions let a producer write atomically to multiple partitions and let a consumer read records inside committed transactions only. Combined with the idempotent producer, this gives exactly-once semantics for read-from-Kafka, process, write-to-Kafka workflows. Exactly-once that crosses the boundary out of Kafka is harder, and lesson 45 covers it.
The summary that fits on a sticky note: within a partition, strict order; across partitions, none; at-least-once unless you opt into transactions; durability tunable via acks and replication factor.
Why Kafka became dominant
Kafka started at LinkedIn in 2010, was open-sourced in 2011, became an Apache top-level project in 2012, and by 2018 was the default streaming substrate at most companies that had one. The reasons are structural rather than marketing.
Producers and consumers are decoupled. A producer writes; the record sits in the log; consumers read whenever they get to it. The producer does not need to know who is reading; the consumer does not need to be online when the producer writes. Compare this to RPC, where caller and callee need to be online simultaneously, and you see why event-driven architectures (Module 7) ride on Kafka rather than on direct service-to-service calls.
Replays are cheap. Kafka retains records for a configurable period (often seven days, sometimes weeks, sometimes forever for compacted topics). A consumer can rewind to any offset and reprocess; a new consumer can start at offset zero and read the entire history. This is the property that makes Kappa architecture (lesson 46) work.
Scaling by partitions is mechanical. A topic with twelve partitions can be consumed by up to twelve parallel consumers in a group. Need more throughput? Add partitions, add consumers. Compare this to scaling a relational database, where adding capacity requires real architectural work.
It is the integration spine. A microservices system has an order service, a payment service, an inventory service, a notification service, an analytics service. Without Kafka, each pair of services that needs to communicate has to define an API: an N-by-M problem. With Kafka, every service writes events to topics and reads the topics it cares about. The integration shape collapses from a mesh of HTTP calls to a star around Kafka. This is the single biggest reason Kafka spread from “data engineering tool” to “general infrastructure” between 2015 and 2020.
Strong client ecosystem. Producers and consumers exist for every language a real company uses: Java, Python (confluent-kafka, kafka-python), Go, Rust, Node.js, .NET, Scala. Schema registries integrate cleanly. Connect, the Kafka-to-other-systems framework, has hundreds of connectors. The ecosystem is itself a moat.
The 2026 alternatives
Kafka is dominant; it is not alone. Four contenders matter enough to name.
Redpanda. A C++ rewrite of the Kafka protocol, stable enough for production by 2022. Speaks the Kafka wire protocol so existing clients work, ships as a single binary with no JVM and no ZooKeeper. Operationally simpler, lower tail latency, fewer moving parts. The trade-off is a smaller ecosystem and a single vendor behind the open-source distribution.
Apache Pulsar. Started at Yahoo, now an Apache project. Separates storage (BookKeeper) from serving, giving better elasticity and tiered storage out of the box. Adoption is real but smaller than Kafka.
AWS Kinesis, Google Pub/Sub, Azure Event Hubs. The cloud-native managed alternatives. Each operationally trivial compared to running Kafka yourself, each locked to its cloud. Many teams pick one of these and never look back. Many others run managed Kafka (Confluent Cloud, AWS MSK, Aiven) precisely because they want Kafka semantics with cloud operations.
The honest reading. If you are starting fresh and committed to one cloud, the cloud’s native option is fine and probably easier. If you want portability and ecosystem leverage, Kafka (managed or self-hosted, or Redpanda as a drop-in) is still the default.
When NOT to use Kafka
Kafka is a powerful tool. It is also a heavy one, and not every workload that involves “events” needs it.
Low-volume queueing. If you have a few thousand async jobs per day (send an email, generate a PDF, kick off a recalculation), you do not need Kafka. A Postgres table with a pending/done column and a worker that polls it works fine to surprisingly high volumes. SQS, RabbitMQ, and Redis-backed queues (Sidekiq, Bull, Celery) are simpler operationally and a better fit.
Truly real-time control loops. Kafka latency is in the millisecond range. Some workloads need microsecond latency: high-frequency trading, industrial control, certain robotics. Kafka is not the right tool for those.
Small datasets where you do not need a log. If your “stream” is a few hundred events per day from a single producer to a single consumer, a webhook plus a database table beats a Kafka cluster.
The rule of thumb: when you have multiple services producing events, multiple consuming, a need to replay history, or throughput that exceeds tens of thousands of events per second, Kafka starts paying for itself. Below that, simpler tools win.
Where this leads
Kafka is the substrate. Lesson 43 covers the engines that read from it and write back to it: Apache Flink, Kafka Streams, Spark Structured Streaming. The choice of engine is the next architectural decision, and the trade-offs are different from the Kafka-versus-alternatives choice. Lesson 44 covers event time and watermarks, which is where the engines start looking different from each other. Lesson 47 covers Change Data Capture, the pattern that turns relational databases into Kafka producers and lets streaming systems integrate cleanly with the OLTP world.
The recurring theme through the rest of Module 6 is that Kafka is the easy part. The hard part is what you do with the records once they are flowing.
Citations and further reading
- Apache Kafka documentation,
https://kafka.apache.org/documentation/(retrieved 2026-05-01). The canonical reference. The “Design” section is short and worth reading end to end; it explains the partitioning, replication, and consumer-group model in the original authors’ own words. - Jay Kreps, Neha Narkhede, Jun Rao, “Kafka: a Distributed Messaging System for Log Processing”, NetDB workshop, 2011. The original paper from the LinkedIn team. Short and clear; the design has not changed in its essentials.
- Jay Kreps, “The Log: What every software engineer should know about real-time data’s unifying abstraction”, LinkedIn Engineering, 2013,
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying(retrieved 2026-05-01). The essay that explained why “the log” is the right abstraction for distributed data, and why Kafka shaped itself around it. - “Kafka: The Definitive Guide” (Gwen Shapira, Todd Palino, Rajini Sivaram, Krit Petty, 2nd edition, O’Reilly, 2021). The standard reference book.
- Redpanda documentation,
https://docs.redpanda.com/(retrieved 2026-05-01). For the C++ Kafka-compatible alternative. - “Designing Data-Intensive Applications” (Martin Kleppmann, O’Reilly, 2017), chapter 11. The systems-context framing of log-based messaging, against which Kafka is the canonical example.