Why Kafka Streams rebalancing is slow
Kafka Streams rebalancing slow? Why stateful rebalances stall for minutes, the configs that trigger them, and how cooperative rebalancing fixes it.
Diagnose and fix slow Kafka Streams rebalances.
A rebalance in a plain consumer group is cheap: partitions move, consumers resume, life goes on. A rebalance in a stateful Kafka Streams app can stall every instance for minutes while local state is rebuilt, and the lag piling up during that stall is what gets you paged. This is the single most common production complaint about Kafka Streams, and the official troubleshooting for it is thin.
This page is about why that happens and the specific configs that make it stop.
What you'll learn:
- Why a stateful rebalance is effectively stop-the-world
- The real symptoms: multi-minute stalls,
Running↔Rebalancingloops, rebalance storms on Kubernetes - The configs that trigger rebalances you didn't ask for
- The fix toolbox: cooperative rebalancing, warm-up replicas, standbys, static membership
Why a stateful rebalance stops the world
When the consumer group protocol reassigns work, a stateless consumer just starts polling its new partitions. A Kafka Streams task can't. Each stateful task owns a state store (a local RocksDB database backed by a changelog topic) and it cannot process a single record until that store holds the right data for its partition.
So when a task lands on an instance that has no local copy of its state, the instance must restore the store by replaying the changelog topic before processing resumes. With the classic eager rebalance protocol, the sequence is brutal:
- Every instance revokes all of its tasks (and releases their stream threads).
- The group leader computes a new assignment.
- Each instance picks up its newly assigned tasks and restores any state it's missing.
- Only then does processing resume, across the whole application.
Steps 1 and 4 are the problem. Under eager rebalancing, one instance joining or leaving makes the entire app stop processing while everyone hands back their work and re-acquires it. If any reassigned task has a large store, step 3 dominates and the stall stretches from seconds into minutes. The restore itself (why a big store takes so long) is its own topic; see state restore time.
The headline number people quote ("my rebalance took 40 minutes") is almost always restore time wearing a rebalance costume.
What it actually looks like in production
The symptoms are distinctive once you've seen them:
- Processing stalls for minutes, then catches up. Consumer-group lag climbs in a straight line during the stall, then drains. A deploy or a single pod restart triggers it.
- A
Running↔Rebalancingloop after a client upgrade. The app never settles. State threads keep getting revoked. This usually means a mismatch in the rebalance protocol or assignor across instances during a rolling upgrade: half the group speaks one protocol, half speaks another. - Rebalance storms on Kubernetes. A team running 10 pods, 7 stream threads each, across ~180 partitions sees rebalances that take four to five minutes and seem to fire constantly. High thread-to-partition ratios make every reassignment touch more tasks, and spot-instance churn keeps the group membership moving.
- A rolling restart rebalances on every single pod. You expected one settle at the end. Instead each pod that bounces triggers a full group rebalance.
That last one surprises people, and it's not a bug.
Kafka Streams opts out of clean leave-on-close, on purpose. A normal consumer leaves the group on a graceful shutdown, triggering one prompt rebalance. Kafka Streams deliberately keeps a closing instance in the group: through Kafka 4.1 it forced
internal.leave.group.on.close=false; since 4.2 that internal config is gone andclose()defaults toREMAIN_IN_GROUP(KIP-1153), withKafkaStreams.close(CloseOptions.groupMembershipOperation(LEAVE_GROUP))as the explicit opt-out for scale-in. The reasoning: a Streams instance is usually coming back (a rolling restart, a pod reschedule), and you don't want to reshuffle expensive state every time one bounces. The cost is that a shut-down instance stays "in" the group until itssession.timeout.msexpires, so a rolling restart still rebalances, just on the session-timeout clock rather than instantly. Static membership (below) is the tool that turns this caveat into an advantage.
The configs that trigger rebalances you didn't ask for
Before tuning the rebalance mechanism, stop the spurious rebalances. Most "constant rebalancing" reports trace back to one of these:
| Config | Default | How it bites |
|---|---|---|
max.poll.interval.ms | 300000 (5 min) | If one poll()→poll() cycle takes longer than this (a slow punctuator, a blocking external call, a long restore on the same thread), the broker assumes the instance is dead and kicks it out. It rejoins → rebalance. A too-low value here is the classic self-inflicted rebalance loop. |
session.timeout.ms | 45000 (since 3.0) | The heartbeat deadline. Set too low for your network or GC pauses, a transient stall looks like death and ejects the instance. Set higher, and you tolerate blips at the cost of slower detection of real failures. |
heartbeat.interval.ms | 3000 | Should stay roughly ⅓ of session.timeout.ms. Rarely the root cause alone, but a mismatch makes the session timeout behave unexpectedly. |
session.timeout.ms when the real culprit is max.poll.interval.ms. They fail differently: blowing the session timeout means heartbeats stopped (the instance was unreachable or paused); blowing the poll interval means heartbeats kept flowing but the application thread was stuck between polls. Check which one the logs name before you touch either. 🚫 "We're on cooperative rebalancing, so rebalances don't pause processing anymore."
Cooperative rebalancing reduces the pause dramatically, but it does not make it zero, least of all for stateful apps. A task that moves to a new instance still has to restore its state before it runs there, and that instance is still blocked on that task during the restore. Cooperative rebalancing's win is that tasks which stay put keep running; it does not abolish the cost of moving state. If you've adopted it and your stalls didn't vanish, restore time is your real problem, not the protocol.
Fix #1: incremental cooperative rebalancing
The biggest single improvement, and you most likely already have it. The eager protocol's "revoke everything, then redistribute" dance is what made one membership change freeze the whole group. Incremental cooperative rebalancing (KIP-429) changes that: instances keep the tasks they're going to retain and only revoke the specific tasks that need to move. Most of the app keeps processing through the rebalance.
This has been the default in Kafka Streams since 2.4 via the StreamsPartitionAssignor. You don't enable it with a flag; you get it unless you've overridden the assignor or you're running a version old enough that you have other problems. (For the underlying consumer-protocol mechanics, see incremental rebalancing and static membership.)
How much does it matter? Confluent's published benchmark, a 10-instance stateful Streams app with RocksDB stores under a rolling bounce, measured total pause time dropping from 37,138 ms to 3,522 ms, roughly a 10x reduction in stop-the-world time, just by moving from eager to cooperative (confluent.io/blog).
The one operational catch: you cannot jump protocols in a single rolling bounce. Upgrading from a pre-2.4 eager deployment requires a documented two-rolling-restart upgrade path, because the group can't mix eager and cooperative members arbitrarily. That mismatch is, not coincidentally, a common cause of the Running↔Rebalancing loop after an upgrade.
Fix #2: warm-up replicas (probing rebalances)
Cooperative rebalancing stops unrelated tasks from pausing, but a task that genuinely needs to move to a new instance still blocks that instance while it restores. KIP-441 makes that migration non-disruptive.
When the assignor wants to move a stateful task, it does not hand the active task over immediately. It first places a warm-up replica on the destination instance, which restores the store in the background (off the hot path) while the original instance keeps serving the active task. Periodic probing rebalances check whether the warm-up has caught up. Only once it's within acceptable.recovery.lag of the changelog tail does the active task actually switch over, which is then near-instant.
| Config | Default | What it controls |
|---|---|---|
acceptable.recovery.lag | 10000 | How many records behind the changelog a warm-up store can be and still be treated as "caught up" and eligible to take over the active task. Lower = stricter (longer warm-up, faster handover); higher = looser. |
max.warmup.replicas | 2 | The cap on how many warm-up replicas the assignor will move at once. Raise it to migrate state faster when scaling out, at the cost of extra restore traffic and broker load. |
probing.rebalance.interval.ms | 600000 (10 min) | How often the assignor runs a probing rebalance to check warm-up progress and rebalance toward the balanced assignment. |
Fix #3: standby replicas
num.standby.replicas (default 0) tells Streams to keep N extra copies of each store continuously updated on other instances. These standbys tail the changelog in real time, so they're always close to current.
Standbys are primarily a crash insurance: if an instance dies, a task can fail over to an instance that already holds a hot copy of its state, skipping most of the restore. They cost extra disk, memory, and changelog-consumption traffic on every instance: you're maintaining N+1 copies of your state.
Crucially, standbys interact with warm-up replicas: a standby that's already hot is a much cheaper starting point for promotion than a cold instance restoring from zero. Set num.standby.replicas=1 if your restore times are painful and you can afford the resource overhead. Just be clear about what they do and don't cover: the rolling-restart caveat is covered in depth in state restore time.
Fix #4: static membership, do or don't
Static membership (group.instance.id) gives each instance a stable identity. When an instance with a static ID disconnects and reconnects within session.timeout.ms, the broker recognizes it as the same member and hands back its previous assignment without a rebalance at all.
This is the direct antidote to the rolling-restart problem. A pod bounces, comes back with the same group.instance.id inside the session window, and reclaims its tasks and its local state: no reshuffle, no restore.
# Each instance MUST get a UNIQUE, STABLE value.
# On Kubernetes, the StatefulSet pod ordinal works well:
group.instance.id=streams-app-${POD_NAME}
session.timeout.ms=120000 # must exceed your worst-case restart-to-rejoin time The honest tradeoff, and why it's a "do or don't" rather than an "always do":
- The IDs must be unique and stable. Two instances with the same
group.instance.idget fenced. Generate them from something durable (a StatefulSet ordinal), never a random UUID per process: a fresh UUID on every restart defeats the entire mechanism. - It widens your real-failure detection window. The same
session.timeout.msthat lets a planned restart skip the rebalance also delays detecting a genuinely dead instance. Set it long enough to cover a normal restart but short enough that a true crash doesn't strand partitions for minutes. This is a real tension, and there's no universally correct value; pick it from your actual restart timings. - Pair it with persistent local state. Static membership lets the rejoining pod reclaim its tasks, but it only avoids a restore if the local store survived the restart. On ephemeral pods with no persistent volume, the state is gone and you restore anyway; see state restore time.
Use it when restarts are frequent and predictable (Kubernetes, autoscaling). Skip it when your instances are genuinely ephemeral and identity churns, where it adds fragility for little gain.
What's changing: KIP-848 and KIP-1071
The rebalance machinery itself is being rebuilt, and it's worth knowing where it's headed.
- KIP-848: the new consumer group protocol. Moves rebalancing off the clients and onto the broker-side group coordinator, replacing the leader-driven "stop and recompute" model with incremental, coordinator-driven reconciliation. It went GA in Kafka 4.0. For plain consumers it removes whole classes of rebalance-storm behavior. Streams cannot ride it, though: the Streams main consumer hard-codes
group.protocol=classic, which is exactly why KIP-1071 exists. - KIP-1071: a Streams-specific rebalance protocol. Built on KIP-848's foundation but aware of Streams concepts the plain consumer protocol can't see: tasks, standbys, warm-ups, and stateful assignment. This is the one that targets the pain on this page directly. It reached GA in Kafka 4.2, but read the fine print: it's opt-in on the client (
group.protocol=streams; new 4.2 brokers ship the feature enabled) and ships a deliberate subset. Sticky assignment only (no warm-up or rack-aware placement yet), no static membership, no on-the-fly topology changes, and migration to or from the classic protocol is offline only (and broken for existing groups in 4.2.0 by a broker-side bug, KAFKA-20254). Pilot it; don't default-migrate production yet. No default-switch has been announced, and client protocol defaults historically flip only at major releases, so most apps still run the classic protocol today.
Until then, the four fixes above are what you have, and for most teams cooperative rebalancing plus warm-up replicas plus right-sized timeouts already turns a multi-minute stall into a tolerable blip.
Why is my Kafka Streams rebalance so slow?
The rebalance itself is fast; the slow part is state restore. A task that moves to a new instance must replay its changelog into a local RocksDB store before it can process, and a large store can stretch that from seconds into minutes. The "40-minute rebalance" people quote is almost always restore time wearing a rebalance costume.
Why does my Kafka Streams app keep rebalancing in a loop?
A Running ↔ Rebalancing loop after a client upgrade usually means a rebalance-protocol or assignor mismatch across instances during a rolling upgrade. Constant spurious rebalances more often trace to a too-low max.poll.interval.ms (a slow punctuator or blocking call exceeds it, so the broker ejects the instance and it rejoins).
What is cooperative rebalancing and does it stop the pause?
Incremental cooperative rebalancing (KIP-429, default in Streams since 2.4) lets instances keep the tasks they retain and only revoke tasks that must move, so most of the app keeps processing. It reduces the pause dramatically but does not make it zero: a task that moves still has to restore its state before it runs.
Should I use static membership to reduce rebalances?
Use it when restarts are frequent and predictable (Kubernetes, autoscaling): a stable group.instance.id lets a bouncing instance reclaim its tasks within session.timeout.ms with no rebalance. The IDs must be unique and stable (a StatefulSet ordinal, never a fresh UUID per process), and it widens your real-failure detection window. Pair it with persistent local state, or you still restore.
What is the KIP-1071 Streams rebalance protocol and can I use it?
KIP-1071 is a Streams-aware rebalance protocol built on KIP-848 that understands tasks, standbys, and warm-ups. It reached GA in Kafka 4.2 but is opt-in and ships a deliberate subset (sticky assignment only, no static membership, offline-only migration). Pilot it; don't default-migrate production. Most apps still run the classic protocol today.
See it in practice with Conduktor
A rebalance is, from the outside, your consumer group going quiet and lag climbing. Conduktor Console shows your Streams app's consumer-group lag and partition assignment in real time, so you can see exactly when the group stalls, watch lag drain as restore completes, and confirm tasks landed where you expected, instead of inferring all of it from application logs.
Next steps
- State restore time & standby replicas: why the stall is actually restore, and how to shorten it
- Scaling Kafka Streams: threads, tasks, and the parallelism ceiling that shapes rebalances
- Kafka Streams architecture: tasks, threads, and how assignment works under the hood