Flink vs Kafka Streams: When to Choose

Stéphane Derosiaux July 2, 2026 10 min read

Apache Flink is a distributed stream processing framework with its own cluster runtime — you submit jobs to a Flink cluster (JobManager + TaskManagers). Kafka Streams is a Java library you embed directly in your application — no separate cluster, no job submission, just a JAR. The core difference: Flink is infrastructure you operate; Kafka Streams is code you ship.

TL;DR

Dimension	Apache Flink	Kafka Streams
Deployment model	Cluster runtime (JobManager + TaskManagers)	Library embedded in your application
Language support	Java, Scala, Python, SQL	Java (Scala wrapper deprecated since 4.3)
Kafka dependency	Optional (any source/sink)	Mandatory (reads/writes Kafka only)
State backend	RocksDB, HashMap, custom	RocksDB (local), changelog topics for durability
Fault tolerance	Checkpoints to durable storage (S3, HDFS)	Changelog topics in Kafka
Event-time semantics	Full: watermarks, late data handling, timers	Full: event-time windowing, grace periods
Windowing	Tumbling, sliding, session, global, custom	Tumbling, hopping, session, sliding (KS 3.x)
SQL / Table API	Yes (Flink SQL, Table API)	No native SQL (ksqlDB is separate)
Batch processing	Unified (Flink handles batch and stream)	No — streaming only
Scaling model	Task-level parallelism set at submission	Partition-based, scales with app instances
Operational complexity	High (cluster management, checkpoints)	Low (standard JVM service operations)
Ecosystem	Broad (non-Kafka sources/sinks)	Kafka-centric

What is Apache Flink?

Apache Flink is a distributed stateful stream processing framework. You write a processing job (Java, Scala, Python, or SQL), package it, and submit it to a running Flink cluster. The cluster consists of a JobManager (coordinator: scheduling, checkpointing, failure recovery) and one or more TaskManagers (workers: execute processing tasks). State is stored in a configurable backend — typically RocksDB for large state, HashMap for small in-memory state. Fault tolerance uses periodic checkpoints: Flink snapshots state to durable external storage (S3, HDFS, GCS) so jobs can recover from exactly where they left off without reprocessing.

See What is Apache Flink? Stateful Stream Processing for a deeper treatment.

What is Kafka Streams?

Kafka Streams is a Java library (part of the Apache Kafka project since 0.10) for building stateful stream processing applications that read from and write to Kafka topics. A Scala wrapper exists but was deprecated in Kafka 4.3 and is targeted for removal in Kafka 5.0. There is no separate cluster: you add the kafka-streams dependency to your application and deploy instances like any other microservice. Kafka handles partition assignment and rebalancing across running instances. Local state is stored in RocksDB; durability comes from changelog topics in Kafka — if an instance crashes and restarts (possibly on a different host), it rebuilds local state by replaying the changelog.

See Introduction to Kafka Streams and State Stores in Kafka Streams for detailed coverage.

Architecture compared

Cluster-managed vs embedded library

Flink: Requires a running Flink cluster before any job can execute. In Kubernetes environments, this typically means a Flink Operator (e.g., Apache Flink Kubernetes Operator) managing JobManager and TaskManager pods. Jobs are submitted via REST API or CLI. Multiple jobs can share a cluster (session mode) or each job can have a dedicated cluster (application mode). You get centralized job management, a web UI, and cluster-level metrics.

Kafka Streams: No cluster to manage. Your application IS the processing cluster. Deploy multiple instances of your app; Kafka Streams uses Kafka's consumer group protocol to distribute partitions across instances automatically. Scale out: run more instances. Scale in: stop some instances; Kafka rebalances within minutes. No separate web UI — observability is through your application's metrics and Kafka's consumer group metrics.

State management

Flink state backends:

RocksDBStateBackend: state stored on TaskManager local disk, checkpointed to remote storage. Handles very large state (terabytes) that doesn't fit in memory. Suited for joins with large lookup tables, long-window aggregations.
HashMapStateBackend: state stored in JVM heap, checkpointed to remote storage. Fast for small-to-medium state but limited by TaskManager memory.
Checkpoints are full or incremental snapshots to S3/HDFS/GCS. On failure, Flink restores from the last successful checkpoint and reprocesses events since that checkpoint.

Kafka Streams state stores:

RocksDB by default, stored on the application instance's local disk.
Each state store has a corresponding changelog topic in Kafka. Every write to local state is mirrored to the changelog topic.
On failure/restart, Kafka Streams rebuilds state by replaying the changelog. For large state, this standby replica concept (configurable via num.standby.replicas) pre-warms state on standby instances to minimize recovery time.
State is co-partitioned: state store partitions align with input topic partitions, which simplifies joins but constrains topology design.

Event-time semantics

Both support full event-time processing with watermarks:

Flink: WatermarkStrategy assigns watermarks to events; the JobManager propagates watermarks across operators. Flink's event-time model handles out-of-order events, configurable late data strategies (drop, side output, or allow), and triggers. See Event Time and Watermarks in Flink and Flink State Management and Checkpointing.

Kafka Streams: Timestamps come from Kafka record metadata (CreateTime or LogAppendTime). Stream time advances as records arrive; late records are handled with configurable grace periods per window. Kafka Streams does not use Flink-style explicit watermarks — windowing advances based on stream time derived from record timestamps, with punctuation callbacks (wall-clock or stream-time) for processing-time triggers.

SQL and batch

Flink provides a unified Table API and Flink SQL that works on both streaming and batch data. You can write a single SQL query that runs on a bounded dataset (batch) or an unbounded stream. This is powerful for teams with SQL expertise and mixed batch/stream workloads. See Flink SQL and Table API.

Kafka Streams has no built-in SQL. ksqlDB (a separate product from Confluent) adds SQL over Kafka Streams, but it is its own deployment and operational surface.

Sources and sinks

Flink: Source/sink agnostic. Flink connectors exist for Kafka, Kinesis, Pulsar, Cassandra, JDBC, Elasticsearch, S3, Hudi, Iceberg, Delta Lake, and many others. Flink is commonly used in non-Kafka architectures.

Kafka Streams: Kafka-only. Input must come from Kafka topics; output must go to Kafka topics. External lookups are possible via GlobalKTable (backed by a topic) or custom state store implementations, but the primary data flow is Kafka ↔ Kafka Streams ↔ Kafka.

Operational trade-offs

Flink advantages:

Handles very large state (terabytes) via RocksDB + checkpoints to object storage
Cluster-level resource management and multi-job orchestration
SQL support for analyst-friendly pipeline development
Source/sink agnostic — works with any data system, not just Kafka
Unified batch and stream processing in one framework

Flink disadvantages:

Requires cluster infrastructure (JobManager, TaskManagers, checkpoint storage, cluster monitoring)
Higher operational burden: job versioning, checkpoint management, TaskManager memory tuning
Savepoints (for schema evolution or job upgrades) add operational ceremony
Harder to debug locally — requires a local Flink cluster or MiniCluster test harness

Kafka Streams advantages:

Zero additional infrastructure — deploy like any microservice
Fault tolerance via Kafka changelog — no external storage dependency
Simple horizontal scaling: add instances, Kafka rebalances
Easier local development and testing (EmbeddedKafka or TestTopologyDriver)
Natural fit for microservices architectures where each service owns its processing

Kafka Streams disadvantages:

Kafka-only: cannot consume from or produce to non-Kafka systems natively
No SQL interface — topology must be defined in Java/Scala code
State store rebuild from changelog can be slow for large state on cold start (mitigated by standbys)
No batch processing support

When to choose Flink

Your pipeline consumes from or writes to non-Kafka sources/sinks (JDBC, S3, Kinesis, Iceberg, etc.)
You need very large state that exceeds what comfortably fits on a Kafka Streams instance's disk
You require SQL-based pipeline authoring for analyst-driven transformations
You have batch and stream workloads that benefit from a unified execution engine
Your organization already runs Flink for other workloads (shared cluster amortizes cost)
You need complex event patterns (CEP) via Flink's CEP library

When to choose Kafka Streams

Your entire pipeline is Kafka ↔ Kafka — no external sources or sinks needed
You want zero infrastructure overhead — embed processing in your microservice
Your team is Java-first and prefers library APIs over framework deployments
Your state is moderate in size (up to tens of GB per partition, manageable on instance disk)
You are building a microservice that owns both its business logic and its stream processing
Local development simplicity is important — TestTopologyDriver makes unit testing easy

Migration considerations

Porting a Kafka Streams topology to Flink (or vice versa) requires rewriting the processing code — the APIs are different (Flink DataStream API vs Kafka Streams DSL).
State migration is the hard part: Kafka Streams state lives in changelog topics (accessible via Kafka); Flink state lives in checkpoints on object storage (opaque binary format). There is no tooling for direct state migration between the two.
If you need SQL, migrating from Kafka Streams to Flink SQL is the more common direction — Kafka Streams topology is translated into Flink SQL Table API definitions.

See also: the existing Kafka Streams vs Apache Flink page for additional architectural depth. For a hands-on Kafka Streams walkthrough, see the Learn Kafka Streams course.

Do I need a Kafka cluster to use Flink?

No. Flink is source/sink agnostic. You can use Flink with Kafka as one of many connectors, but Flink also works with Kinesis, Pulsar, JDBC, S3, and more. Kafka Streams, by contrast, requires Kafka — it reads from and writes to Kafka topics exclusively.

Can Kafka Streams handle large state?

Yes, but with caveats. Kafka Streams uses RocksDB for local state, so state size is bounded by instance disk. For very large state (hundreds of GB to terabytes), recovery from changelog topics can be slow unless standby replicas are configured. Flink with RocksDB + checkpoints to object storage scales more gracefully for massive state.

Is Flink harder to operate than Kafka Streams?

Yes, meaningfully so. Flink requires a cluster (JobManager + TaskManagers), checkpoint storage configuration, job lifecycle management (submit, cancel, upgrade, savepoint), and cluster-level monitoring. Kafka Streams is a library — you operate it like any other JVM application, with no additional infrastructure.

Which has better exactly-once guarantees?

Both support exactly-once semantics. Kafka Streams achieves EOS via Kafka transactions (producer + consumer in a single transaction per commit interval). Flink achieves exactly-once internal state via checkpoints; end-to-end exactly-once with Kafka sinks uses a two-phase commit protocol coordinated with Flink checkpoints — both the source and sink must support it. See Exactly-Once Semantics in Kafka for the Kafka side of this story.

Conduktor Console: Inspect stream processing topology and debug stateful operators. Explore Conduktor Console →