Stream processing: Flink, Kafka Streams, Spark Structured Streaming

Lesson 42 covered Kafka, the log that records sit in. Records sitting in a log do not do anything by themselves. Something has to read them, transform them, aggregate them, join them, and write the results somewhere. That something is a stream processor, and in 2026 there are three options that matter for most teams: Apache Flink, Kafka Streams, and Spark Structured Streaming. The choice between them is the second-biggest architectural decision in a streaming system after the choice of log. This lesson walks the comparison.

The short version, before the detail. Flink is the heavyweight: a standalone stream processing engine with the richest event-time and state APIs, the lowest latency, and the steepest operational curve. Kafka Streams is the library: a Java JAR you embed in your application, simplest deployment, tightly bound to Kafka. Spark Structured Streaming is the compromise: micro-batch streaming on top of Spark, the right answer when you already run Spark for batch and want unified code.

The three are not interchangeable. Picking the wrong one is recoverable but painful, and the cost of the migration is high enough that the choice is worth taking seriously.

Apache Flink: the stream-native heavyweight

Flink started at the Berlin Technical University as the Stratosphere project around 2010, became an Apache project in 2014, and has been the reference implementation of stream processing since around 2017. Stream processing here is the literal sense: Flink processes records one at a time as they arrive, with no batching at the engine level.

The state model is what makes Flink stand out. Each operator maintains its own keyed state, partitioned by the same key as the input stream. State is backed by RocksDB on local disk for large state and by managed memory for small state, with periodic checkpoints to durable storage (S3, HDFS, GCS) for recovery. Savepoints, user-triggered checkpoints with versioning, let you stop a job, upgrade the code, and resume from the same point with the same state. This makes upgrade-without-data-loss possible for long-running jobs, and it is rare among streaming engines.

Event-time semantics are first class. Flink’s watermarks (lesson 44 covers them in depth) are part of the core API. Windowing operations (tumbling, sliding, session) are well-understood primitives. Late-arriving events have explicit handling: side-output streams, allowed lateness, custom triggers. The expressive power is closer to a language than a framework.

The cost is operational complexity. A Flink deployment has a JobManager (coordinator), TaskManagers (workers), and a state backend, all of which need to be running and configured correctly. The JobManager is a single point of failure unless you run it in HA mode (ZooKeeper or Kubernetes leader election). RocksDB tuning matters at scale. Checkpoints to S3 need sizing. Upgrading Flink versions requires care because savepoint compatibility is not guaranteed across major versions.

Teams that pick Flink usually do so because they need its capabilities: complex event-time processing, exactly-once with external sinks, large state (gigabytes per operator), or millisecond latency. Teams that pick Flink and do not need its capabilities tend to spend a lot of operations time learning to keep it healthy. The “you might not need Flink” case is real.

Kafka Streams: the library that runs in your app

Kafka Streams is structurally different from the other two: it is not a cluster you deploy. It is a Java library you import, alongside your other application code, that runs streaming pipelines inside a regular Java process. The JAR is part of your application; the deployment is your application’s deployment; if you already run Java microservices on Kubernetes, you already know how to run Kafka Streams.

The model is tightly bound to Kafka. Every Kafka Streams application reads from Kafka topics, writes to Kafka topics, and stores state in Kafka topics (compacted topics serve as a durable backing store, with a local RocksDB cache for read performance). There is no separate state backend, no separate job submission, no cluster. The application is the streaming engine, and Kafka is everything around it.

The API has two layers. The lower-level Processor API gives you control over individual record processing, state stores, and punctuation (timer-driven callbacks). The higher-level Streams DSL gives you a fluent set of operations: map, filter, groupByKey, aggregate, join, windowedBy. The DSL covers most workloads cleanly, and the Processor API is the escape hatch when the DSL falls short.

Scaling is by Kafka’s partition model. A Kafka Streams application reading a topic with twelve partitions can run on up to twelve instances in parallel; the library coordinates partition assignment through the consumer-group mechanism Kafka already has. State is colocated with the partition: each instance owns a subset of the partitions and the corresponding state stores. Rebalances move state between instances, which costs network bandwidth but is automatic.

The trade-offs are the price of the simplicity. State management is good but not as flexible as Flink’s. Event-time semantics are present but less complete. There is no SQL frontend in the core library (ksqlDB exists, but as a separate Confluent product on top). Latency is low (comparable to Flink) but the upper bound on state size is RAM-plus-RocksDB on a single instance, smaller than Flink’s distributed state model.

Teams that pick Kafka Streams usually have a Java or Scala codebase, run microservices on Kafka already, and want to add streaming logic to existing services without standing up a new platform. The fit is excellent for that. For teams without a JVM stack, the choice is awkward (Python and Go ports are unofficial and incomplete), and that pushes them toward Flink or Spark.

Spark Structured Streaming: micro-batch on Spark

Spark Structured Streaming sits in the middle. It is a streaming API on top of the Spark batch engine, processing records in small batches (typically 100 to 500ms) rather than one at a time. The same DataFrame API works for batch and streaming: if you already use Spark for batch ETL, your streaming code looks almost identical, runs on the same cluster, uses the same libraries.

The architecture is the standard Spark one: a driver coordinating executors, with the streaming runtime adding a trigger that fires every batch interval. Each trigger reads new records from the source (Kafka, Kinesis, files), runs the DataFrame query, and writes the results. State is held in memory, optionally backed by HDFS or S3 via Spark checkpointing. Watermarks and windowing are supported. The set of supported operations is smaller than Flink’s, larger than what most teams need.

Latency is the most visible difference from Flink and Kafka Streams. Micro-batch has a structural floor: even with 100ms batches, end-to-end latency is typically several hundred milliseconds to a second, not the low single-digit milliseconds Flink can hit. For dashboards, alerting, and ETL this is fine. For sub-100ms requirements it is not. Continuous Processing mode (Spark 2.3+) attempts true streaming on the Spark engine; in 2026 it is still less mature than Flink’s native streaming and most teams stay on micro-batch.

Operational complexity is in the middle. A Spark cluster is a known thing in 2026: most data teams have one or have access through Databricks, EMR, Dataproc, or Synapse. Adding streaming jobs to an existing deployment is mostly writing the code and pointing it at the cluster. Compared to Flink, smoother because Spark has had more eyes on it for longer; compared to Kafka Streams, heavier because there is a separate cluster to keep alive.

Teams that pick Structured Streaming usually have Spark for batch and want to consolidate on one engine. The unified API is real value: code can be shared between batch and streaming pipelines. PySpark course Module 9 covers Structured Streaming end to end; the architectural decision in this lesson is when to reach for it versus Flink or Kafka Streams.

The comparison

Putting the three side by side on the dimensions that matter for the choice.

Latency. Flink and Kafka Streams are both in the low-millisecond range for typical workloads. Spark Structured Streaming is several hundred milliseconds to a second in micro-batch mode. For most analytical and operational workloads, all three are fast enough. For sub-100ms requirements, Flink or Kafka Streams.

State management. Flink wins: RocksDB-backed state, savepoints, large-state support, exactly-once across operators. Kafka Streams second: RocksDB stores per task, durable backing in compacted topics, smaller scale. Spark third: in-memory state with checkpoint-based fault tolerance, simpler model, smaller scale.

Operational complexity. Kafka Streams wins (just a library). Spark second (an existing known cluster). Flink third (a separate cluster with its own learning curve). The order is exactly opposite to the state-management ranking: the engines that give you more pay for it with operational footprint.

Ecosystem fit. Spark for SQL-and-batch shops, especially anyone on Databricks or running PySpark. Kafka Streams for Java/Scala microservice shops on Kafka. Flink for the “we need real real-time” case, the regulated-finance “exactly-once across systems” case, the gaming/IoT/adtech case where state and latency both matter.

Language support. Spark has full Python and Scala support, partial Java. Flink has full Java/Scala, decent Python (PyFlink), no Go or other languages. Kafka Streams is Java/Scala only; the unofficial ports for Python and Go are not production-grade.

SQL frontends. Flink has Flink SQL, which is competent and getting better. Spark has Spark SQL, which is mature and excellent. Kafka Streams has ksqlDB (technically a separate product, but the de-facto SQL story).

flowchart LR
    subgraph Sources[Sources]
      K1[(Kafka)]
      K2[(Kinesis)]
      F1[(Files / S3)]
    end
    subgraph Engines[Engines]
      FL[Flink<br/>standalone cluster]
      KS[Kafka Streams<br/>library in app]
      SS[Spark Structured Streaming<br/>on Spark cluster]
    end
    subgraph Sinks[Sinks]
      OK[(Kafka topics)]
      DB[(Databases)]
      WH[(Warehouses)]
      DL[(Data lake)]
    end
    K1 --> FL
    K1 --> KS
    K1 --> SS
    K2 --> FL
    K2 --> SS
    F1 --> SS
    FL --> OK
    FL --> DB
    FL --> WH
    KS --> OK
    KS --> DB
    SS --> OK
    SS --> WH
    SS --> DL

Diagram to create: a polished side-by-side of the three engines, with their typical input sources on the left and output sinks on the right. The visual point is that the source side overlaps almost completely (any of the three can read from Kafka) but the sink side and the deployment shape differ. Flink is a standalone cluster with broad sink support. Kafka Streams is a library inside an app, primarily writing back to Kafka. Spark Structured Streaming is on a Spark cluster with strong warehouse and data-lake support.

When each is the right answer

Three questions usually clarify the choice.

What does your team already run? If you have Spark, Structured Streaming is the path of least resistance. If you have Kafka and a Java/Scala stack, Kafka Streams. If neither, the question shifts to the workload.

What are your latency requirements? Sub-100ms end-to-end, Flink or Kafka Streams. A second or two, any of the three.

How much state do you need to manage? Gigabytes per key, Flink. A few hundred megabytes per partition with no cross-operator transactions, Kafka Streams or Spark are fine.

The 2026 reality is that most teams pick one and run with it for years. Flink has the strongest case for greenfield streaming-heavy systems. Kafka Streams for existing Java microservice systems where streaming is one capability among many. Spark Structured Streaming for analytics-and-data-engineering shops where the same team writes batch and streaming pipelines.

The wrong-pick failure modes differ. Picking Flink when Kafka Streams would have been enough buys you a streaming platform team you did not need. Picking Kafka Streams when Flink would have been right gives you a pipeline that hits a state-management wall a year in. Picking Spark Structured Streaming when you needed sub-second latency gives you a working pipeline that misses its SLO. None catastrophic; all months of work to fix.

Cross-references

PySpark course Module 9 covers Structured Streaming in depth: API, watermarking and windowing, Kafka integration, writeStream sinks for Delta and Iceberg, production operational patterns.
Lesson 44 covers event time, watermarks, and windowing as a cross-engine concept. Vocabulary is shared even though implementations differ.
Lesson 45 covers exactly-once processing, where Flink’s transactional sinks, Kafka Streams’ transactional producer, and Spark’s idempotent writes play out differently.
Lesson 47 covers Change Data Capture, where the choice of stream processor interacts with the choice of CDC tool (Debezium most often).

Citations and further reading

Apache Flink documentation, https://flink.apache.org/ (retrieved 2026-05-01). The canonical reference. The “Concepts” section, especially the parts on state and time, repays careful reading.
Kafka Streams documentation, https://kafka.apache.org/documentation/streams/ (retrieved 2026-05-01). Part of the Apache Kafka project; concise and well-structured.
Spark Structured Streaming Programming Guide, https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html (retrieved 2026-05-01). The standard reference; the sections on output modes and watermarking are essential.
Tyler Akidau, Slava Chernyak, Reuven Lax, “Streaming Systems” (O’Reilly, 2018). The cross-engine conceptual reference; the model it describes maps cleanly onto Flink and reasonably onto Spark and Kafka Streams.
“Stream Processing with Apache Flink” (Fabian Hueske, Vasiliki Kalavri, O’Reilly, 2019). The standard Flink book.
“Kafka Streams in Action” (Bill Bejeck, Manning, 2nd edition, 2024). The standard Kafka Streams book.
“Learning Spark” (Jules S. Damji et al, O’Reilly, 2nd edition, 2020). The standard Spark book; the Structured Streaming chapters cover the basics that Module 9 of the PySpark course expands on.