Kafka Streams vs Spark
Kafka Streams vs Spark Structured Streaming, decided honestly: a per-record library vs a micro-batch cluster engine. With a side-by-side table.
Pick between a per-record library and a cluster engine.
Most "Kafka Streams vs Spark" comparisons drown you in a feature grid and miss the one split that actually decides it. Kafka Streams is a library you embed in your own app to process Kafka data one record at a time. Spark Structured Streaming is a distributed cluster engine whose heritage is micro-batch processing, sitting inside a platform that also does batch, SQL, and machine learning at scale. Those are different tools for different shapes of problem, and the decision falls out of that, not from counting connectors.
We sell neither, so there's no thumb on the scale. This is the senior-architect version: the deployment and latency models that actually decide it, and an honest account of where Spark wins outright.
What you'll learn:
- Why "library vs cluster engine" decides this, not a feature list
- The per-record vs micro-batch latency model (and Spark's continuous mode)
- A side-by-side of state, scaling, language, and sources
- When Kafka Streams wins, and when Spark clearly wins
Library vs cluster engine
Kafka Streams is a org.apache.kafka:kafka-streams JAR on your classpath. You write a topology, your service becomes the stream processor, and it scales like a consumer group, no separate cluster, no job to submit. It reads and writes Kafka, and only Kafka. The application team owns its memory, its state, its restarts.
Spark is a distributed compute engine. You stand up a cluster, a driver plus executors, on YARN, Kubernetes, or a managed platform, and submit a job that the cluster schedules across executors, checkpointing state to durable storage. It reads from and writes to almost anything: Kafka, object storage, files, JDBC databases, and more, often several in one job. It scales far past your Kafka partition count, and it usually comes with a data or platform team operating it as shared infrastructure.
That single split, code you ship inside an app versus jobs you submit to a cluster, predicts most of the rest: who operates it, how it scales, what it can read, and which kind of workload it's built for. Start there; the feature comparison is a tiebreaker.
The latency model: per-record vs micro-batch
This is the difference people reach for first, and it's real, with a nuance worth getting right.
Kafka Streams processes one record at a time. A record arrives, the topology runs, the result is emitted. There's no batching delay inherent to the model, but the defaults add two: Streams overrides the producer's linger.ms to 100, so out-of-the-box end-to-end latency sits around 100 ms (set it back to 0 and it's genuinely single-digit milliseconds), and cached KTable aggregations coalesce downstream updates until the commit interval (30 s by default) unless you disable caching.
Spark Structured Streaming's heritage is micro-batch. It collects records into small batches, one per trigger (by default a new micro-batch starts as soon as the previous one finishes; a fixed interval is opt-in), and processes each batch as a tiny Spark job. That adds latency on the order of hundreds of milliseconds to seconds, in exchange for very high throughput and the full power of Spark's batch engine on each batch.
🚫 "Spark does real-time streaming, so its latency is the same as Kafka Streams."
Spark added a low-latency continuous processing mode (in 2.3) that drops per-record latency toward the millisecond range, but it's still experimental, gives only at-least-once (not exactly-once), and supports just map-like operations, no aggregations, so it narrows this gap for only a thin slice of jobs. Spark 4.1 shipped a separate Real-Time Mode, a third execution engine targeting sub-100 ms latency, currently limited to stateless queries. Micro-batch remains its center of gravity, the default, the best-supported path, and where its throughput and unified-batch strengths live. If single-record, millisecond-tail latency inside an application is the hard requirement, Kafka Streams meets it natively; with Spark you're choosing a mode and accepting its constraints.
Side by side
A map of where each is at home, not a scoreboard. The "when it wins" row is the one that matters.
| Dimension | Kafka Streams | Spark Structured Streaming |
|---|---|---|
| What it is | A JVM library embedded in your app | A distributed cluster engine (driver + executors) |
| Processing model | Per-record (event-at-a-time) | Micro-batch by default; low-latency continuous mode available |
| Deployment | A JAR, a normal service, no extra cluster | A cluster you operate (YARN / K8s / managed) |
| Languages | Java + Scala (official APIs); any JVM language | Scala, Python, Java, SQL + a large ML ecosystem |
| Sources / sinks | Kafka only, in and out | Kafka, object storage, files, JDBC, many more |
| Scaling ceiling | ≤ partition count of the input | Far beyond partition count (executor parallelism) |
| State & recovery | Local store + changelog replay | State backend + checkpoint to durable storage |
| Batch + ML | Streaming only | Unified batch, streaming, SQL, and ML in one engine |
| Operational owner | The application team | A data / platform team |
| When it wins | In-app stateful event processing, microservices | Heavy batch+stream unification, ML pipelines, multi-source ETL at scale |
Scaling ceiling. Kafka Streams parallelism tops out at the partition count of the busiest sub-topology, extra instances past that sit idle, or serve as standbys if you set num.standby.replicas above its default of 0. Spark decouples parallelism from input partitions and scales across executors, so it handles genuinely huge jobs and wide fan-out that Kafka Streams can't. Most in-app workloads never approach the Streams ceiling, so this decides the large, cluster-scale jobs. See scaling Kafka Streams.
State & recovery. Both checkpoint state durably, but the models differ in who holds it and how a restart behaves. Kafka Streams replays a changelog into a local store on the instance, which puts restore time on the app team. Spark checkpoints to durable storage (HDFS/S3/DBFS) that the cluster manages, so recovery is the cluster's job, not the app's.
When each wins
Choose by deployment model and workload shape, then confirm the capability fits.
Choose Kafka Streams when:
- The processing belongs inside an application your team owns and deploys, an event-driven microservice, not a data-platform job.
- Your sources and sinks are Kafka, and you're a JVM shop.
- You need per-record, low-millisecond latency without choosing a special engine mode.
- You want stream processing to scale and deploy like any other service, no cluster to operate.
- The state is manageable, or you're prepared to own restore time and RocksDB memory.
Choose Spark when:
- You're unifying batch and streaming in one engine, backfills and live processing sharing logic.
- You're pulling from non-Kafka sources, object storage, files, JDBC, several systems in one job, for multi-source ETL.
- You need machine learning alongside streaming (feature pipelines, MLlib, model scoring at scale).
- You must scale past your Kafka partition count or run large shared jobs as platform infrastructure.
- Your authors live in Python, Scala, or SQL rather than the JVM DSL.
Be honest about the boundary: a heavy multi-source ETL or ML pipeline crammed into a Kafka-only library fights you forever, and so does an in-app, low-latency microservice forced onto a cluster you have to operate. Spark and Kafka Streams coexist in plenty of shops, Spark for the data-platform batch+stream+ML work, Kafka Streams for the stateful logic living inside services the app team owns. They're complements more often than rivals.
What is the difference between Kafka Streams and Spark Structured Streaming?
Kafka Streams is a JVM library you embed in your own application to process Kafka data per-record, scaling like a consumer group with no separate cluster. Spark Structured Streaming is a distributed cluster engine, micro-batch at its core, that reads many sources, scales past partition count, and unifies batch, streaming, SQL, and ML, operated as shared infrastructure.
Is Kafka Streams micro-batch?
No. Kafka Streams processes one record at a time (event-at-a-time), so there's no inherent batching delay. Spark Structured Streaming is the micro-batch engine by default, it groups records into small batches, one per trigger, though it also offers a low-latency continuous mode.
Which has lower latency, Kafka Streams or Spark?
Kafka Streams generally has lower and more predictable per-event latency because its model is per-record: low tens of milliseconds out of the box, single-digit once you drop Streams' default producer linger.ms of 100 to 0. Spark's default micro-batch adds latency on the order of hundreds of milliseconds; its continuous mode narrows the gap but micro-batch remains its primary, best-supported path.
When should I use Spark over Kafka Streams?
Choose Spark when you need to unify batch and streaming, read non-Kafka sources, run machine-learning pipelines, scale far past your Kafka partition count, or write in Python/Scala/SQL. Spark is a data-platform engine; Kafka Streams is for stateful logic embedded in a Kafka-native application your team owns.
Can Kafka Streams replace Spark for ETL?
Only for Kafka-to-Kafka, JVM-based, in-app transformations. Kafka Streams reads and writes Kafka only and is a library, not a multi-source ETL or batch engine. Multi-source ingestion, batch backfills, and ML feature pipelines at scale are Spark's territory, not Kafka Streams'.
See it in practice with Conduktor
Whichever engine you pick, the Kafka side of it shows up as consumer groups, lag, and topics. A Kafka Streams app is a consumer group named after its
application.id, plus changelog topics. A Spark job reading Kafka appears as an auto-named consumer group too, but it doesn't commit offsets to Kafka by default (progress lives in its checkpoint), so setkafka.group.idand offset commits if you want Kafka-side lag. Conduktor Console lets you watch consumer group lag, inspect partition assignment, and see the topics each one reads and writes, so you can tell whether a Streams app is keeping up with the partitions it owns, independent of which framework does the processing.
Next steps
- Kafka Streams vs Flink, the other cluster-engine comparison, by deployment model
- What is Kafka Streams?, the library model, in depth
- Scaling Kafka Streams, why parallelism caps at your partition count