# Flink vs Kafka Streams: When to Choose

**Apache Flink** is a distributed stream processing framework with its own cluster runtime — you submit jobs to a Flink cluster (JobManager + TaskManagers). **Kafka Streams** is a Java library you embed directly in your application — no separate cluster, no job submission, just a JAR. The core difference: Flink is infrastructure you operate; Kafka Streams is code you ship.

## TL;DR

| Dimension | Apache Flink | Kafka Streams |
|---|---|---|
| Deployment model | Cluster runtime (JobManager + TaskManagers) | Library embedded in your application |
| Language support | Java, Scala, Python, SQL | Java (Scala wrapper deprecated since 4.3) |
| Kafka dependency | Optional (any source/sink) | Mandatory (reads/writes Kafka only) |
| State backend | RocksDB, HashMap, custom | RocksDB (local), changelog topics for durability |
| Fault tolerance | Checkpoints to durable storage (S3, HDFS) | Changelog topics in Kafka |
| Event-time semantics | Full: watermarks, late data handling, timers | Full: event-time windowing, grace periods |
| Windowing | Tumbling, sliding, session, global, custom | Tumbling, hopping, session, sliding (KS 3.x) |
| SQL / Table API | Yes (Flink SQL, Table API) | No native SQL (ksqlDB is separate) |
| Batch processing | Unified (Flink handles batch and stream) | No — streaming only |
| Scaling model | Task-level parallelism set at submission | Partition-based, scales with app instances |
| Operational complexity | High (cluster management, checkpoints) | Low (standard JVM service operations) |
| Ecosystem | Broad (non-Kafka sources/sinks) | Kafka-centric |

## What is Apache Flink?

Apache Flink is a distributed stateful stream processing framework. You write a processing job (Java, Scala, Python, or SQL), package it, and submit it to a running Flink cluster. The cluster consists of a **JobManager** (coordinator: scheduling, checkpointing, failure recovery) and one or more **TaskManagers** (workers: execute processing tasks). State is stored in a configurable backend — typically RocksDB for large state, HashMap for small in-memory state. Fault tolerance uses periodic checkpoints: Flink snapshots state to durable external storage (S3, HDFS, GCS) so jobs can recover from exactly where they left off without reprocessing.

See [What is Apache Flink? Stateful Stream Processing](https://www.conduktor.io/glossary/what-is-apache-flink-stateful-stream-processing) for a deeper treatment.

## What is Kafka Streams?

Kafka Streams is a Java library (part of the Apache Kafka project since 0.10) for building stateful stream processing applications that read from and write to [Kafka topics](https://www.conduktor.io/glossary/kafka-topics-explained). A Scala wrapper exists but was deprecated in Kafka 4.3 and is targeted for removal in Kafka 5.0. There is no separate cluster: you add the `kafka-streams` dependency to your application and deploy instances like any other microservice. Kafka handles partition assignment and rebalancing across running instances. Local state is stored in RocksDB; durability comes from **changelog topics** in Kafka — if an instance crashes and restarts (possibly on a different host), it rebuilds local state by replaying the changelog.

See [Introduction to Kafka Streams](https://www.conduktor.io/glossary/introduction-to-kafka-streams) and [State Stores in Kafka Streams](https://www.conduktor.io/glossary/state-stores-in-kafka-streams) for detailed coverage.

## Architecture compared

### Cluster-managed vs embedded library

**Flink**: Requires a running Flink cluster before any job can execute. In Kubernetes environments, this typically means a Flink Operator (e.g., Apache Flink Kubernetes Operator) managing JobManager and TaskManager pods. Jobs are submitted via REST API or CLI. Multiple jobs can share a cluster (session mode) or each job can have a dedicated cluster (application mode). You get centralized job management, a web UI, and cluster-level metrics.

**Kafka Streams**: No cluster to manage. Your application IS the processing cluster. Deploy multiple instances of your app; Kafka Streams uses Kafka's [consumer group](https://www.conduktor.io/glossary/kafka-consumer-groups-explained) protocol to distribute partitions across instances automatically. Scale out: run more instances. Scale in: stop some instances; Kafka rebalances within minutes. No separate web UI — observability is through your application's metrics and Kafka's consumer group metrics.

### State management

**Flink state backends:**
- `RocksDBStateBackend`: state stored on TaskManager local disk, checkpointed to remote storage. Handles very large state (terabytes) that doesn't fit in memory. Suited for joins with large lookup tables, long-window aggregations.
- `HashMapStateBackend`: state stored in JVM heap, checkpointed to remote storage. Fast for small-to-medium state but limited by TaskManager memory.
- Checkpoints are full or incremental snapshots to S3/HDFS/GCS. On failure, Flink restores from the last successful checkpoint and reprocesses events since that checkpoint.

**Kafka Streams state stores:**
- RocksDB by default, stored on the application instance's local disk.
- Each state store has a corresponding **changelog topic** in Kafka. Every write to local state is mirrored to the changelog topic.
- On failure/restart, Kafka Streams rebuilds state by replaying the changelog. For large state, this standby replica concept (configurable via `num.standby.replicas`) pre-warms state on standby instances to minimize recovery time.
- State is co-partitioned: state store partitions align with input topic partitions, which simplifies joins but constrains topology design.

### Event-time semantics

Both support full event-time processing with watermarks:

**Flink**: WatermarkStrategy assigns watermarks to events; the JobManager propagates watermarks across operators. Flink's event-time model handles out-of-order events, configurable late data strategies (drop, side output, or allow), and triggers. See [Event Time and Watermarks in Flink](https://www.conduktor.io/glossary/event-time-and-watermarks-in-flink) and [Flink State Management and Checkpointing](https://www.conduktor.io/glossary/flink-state-management-and-checkpointing).

**Kafka Streams**: Timestamps come from Kafka record metadata (`CreateTime` or `LogAppendTime`). Stream time advances as records arrive; late records are handled with configurable grace periods per window. Kafka Streams does not use Flink-style explicit watermarks — windowing advances based on stream time derived from record timestamps, with punctuation callbacks (wall-clock or stream-time) for processing-time triggers.

### SQL and batch

Flink provides a unified Table API and Flink SQL that works on both streaming and batch data. You can write a single SQL query that runs on a bounded dataset (batch) or an unbounded stream. This is powerful for teams with SQL expertise and mixed batch/stream workloads. See [Flink SQL and Table API](https://www.conduktor.io/glossary/flink-sql-and-table-api-for-stream-processing).

Kafka Streams has no built-in SQL. ksqlDB (a separate product from Confluent) adds SQL over Kafka Streams, but it is its own deployment and operational surface.

### Sources and sinks

**Flink**: Source/sink agnostic. Flink connectors exist for Kafka, Kinesis, Pulsar, Cassandra, JDBC, Elasticsearch, S3, Hudi, Iceberg, Delta Lake, and many others. Flink is commonly used in non-Kafka architectures.

**Kafka Streams**: Kafka-only. Input must come from Kafka topics; output must go to Kafka topics. External lookups are possible via `GlobalKTable` (backed by a topic) or custom state store implementations, but the primary data flow is Kafka ↔ Kafka Streams ↔ Kafka.

## Operational trade-offs

**Flink advantages:**
- Handles very large state (terabytes) via RocksDB + checkpoints to object storage
- Cluster-level resource management and multi-job orchestration
- SQL support for analyst-friendly pipeline development
- Source/sink agnostic — works with any data system, not just Kafka
- Unified batch and stream processing in one framework

**Flink disadvantages:**
- Requires cluster infrastructure (JobManager, TaskManagers, checkpoint storage, cluster monitoring)
- Higher operational burden: job versioning, checkpoint management, TaskManager memory tuning
- Savepoints (for schema evolution or job upgrades) add operational ceremony
- Harder to debug locally — requires a local Flink cluster or MiniCluster test harness

**Kafka Streams advantages:**
- Zero additional infrastructure — deploy like any microservice
- Fault tolerance via Kafka changelog — no external storage dependency
- Simple horizontal scaling: add instances, Kafka rebalances
- Easier local development and testing (EmbeddedKafka or TestTopologyDriver)
- Natural fit for microservices architectures where each service owns its processing

**Kafka Streams disadvantages:**
- Kafka-only: cannot consume from or produce to non-Kafka systems natively
- No SQL interface — topology must be defined in Java/Scala code
- State store rebuild from changelog can be slow for large state on cold start (mitigated by standbys)
- No batch processing support

## When to choose Flink

- Your pipeline consumes from or writes to **non-Kafka sources/sinks** (JDBC, S3, Kinesis, Iceberg, etc.)
- You need **very large state** that exceeds what comfortably fits on a Kafka Streams instance's disk
- You require **SQL-based pipeline authoring** for analyst-driven transformations
- You have **batch and stream workloads** that benefit from a unified execution engine
- Your organization already runs Flink for other workloads (shared cluster amortizes cost)
- You need **complex event patterns** (CEP) via Flink's CEP library

## When to choose Kafka Streams

- Your entire pipeline is **Kafka ↔ Kafka** — no external sources or sinks needed
- You want **zero infrastructure overhead** — embed processing in your microservice
- Your team is **Java-first** and prefers library APIs over framework deployments
- Your state is **moderate in size** (up to tens of GB per partition, manageable on instance disk)
- You are building a **microservice** that owns both its business logic and its stream processing
- **Local development simplicity** is important — TestTopologyDriver makes unit testing easy

## Migration considerations

- Porting a Kafka Streams topology to Flink (or vice versa) requires rewriting the processing code — the APIs are different (Flink DataStream API vs Kafka Streams DSL).
- State migration is the hard part: Kafka Streams state lives in changelog topics (accessible via Kafka); Flink state lives in checkpoints on object storage (opaque binary format). There is no tooling for direct state migration between the two.
- If you need SQL, migrating from Kafka Streams to Flink SQL is the more common direction — Kafka Streams topology is translated into Flink SQL Table API definitions.

See also: the existing [Kafka Streams vs Apache Flink](https://www.conduktor.io/glossary/kafka-streams-vs-apache-flink) page for additional architectural depth. For a hands-on Kafka Streams walkthrough, see the [Learn Kafka Streams course](https://www.conduktor.io/kafka-streams).

**Do I need a Kafka cluster to use Flink?**

No. Flink is source/sink agnostic. You can use Flink with Kafka as one of many connectors, but Flink also works with Kinesis, Pulsar, JDBC, S3, and more. Kafka Streams, by contrast, requires Kafka — it reads from and writes to Kafka topics exclusively.

**Can Kafka Streams handle large state?**

Yes, but with caveats. Kafka Streams uses RocksDB for local state, so state size is bounded by instance disk. For very large state (hundreds of GB to terabytes), recovery from changelog topics can be slow unless standby replicas are configured. Flink with RocksDB + checkpoints to object storage scales more gracefully for massive state.

**Is Flink harder to operate than Kafka Streams?**

Yes, meaningfully so. Flink requires a cluster (JobManager + TaskManagers), checkpoint storage configuration, job lifecycle management (submit, cancel, upgrade, savepoint), and cluster-level monitoring. Kafka Streams is a library — you operate it like any other JVM application, with no additional infrastructure.

**Which has better exactly-once guarantees?**

Both support exactly-once semantics. Kafka Streams achieves EOS via Kafka transactions (producer + consumer in a single transaction per commit interval). Flink achieves exactly-once internal state via checkpoints; end-to-end exactly-once with Kafka sinks uses a two-phase commit protocol coordinated with Flink checkpoints — both the source and sink must support it. See [Exactly-Once Semantics in Kafka](https://www.conduktor.io/glossary/exactly-once-semantics-in-kafka) for the Kafka side of this story.
