Kafka Streams anti-patterns

A field guide to Kafka Streams anti-patterns: blocking calls, unbounded state, broken co-partitioning, and topology mistakes that bite you in production.

Spot the eight that cause most production incidents.

Most Kafka Streams outages are not exotic. They trace back to a handful of design choices that look reasonable in a code review and fall over under load, during a rebalance, or six months later when someone edits the topology. This page is a map: each anti-pattern gets the symptom you'll see, why it's wrong, and a link to the page that fixes it properly.

If you read one section, read the first. A blocking call inside the topology is the single most common way a Kafka Streams app goes from "fast" to "stalled" with no error in the logs.

What you'll learn:

  • The eight anti-patterns behind most Streams production incidents
  • The symptom each one produces, so you can recognize it in your own app
  • Why each is wrong at the mechanism level, not just "best practice says so"
  • Where to go for the real fix

1. Synchronous external calls inside the topology

Symptom: throughput collapses under load; one slow dependency stalls a whole partition; consumer lag climbs while CPU sits idle; with exactly-once, transactions time out and you see producer-fenced errors.

A map or process that makes a blocking REST or database call per record is the most common Streams anti-pattern, and the most damaging. A stream thread processes records on one partition sequentially. Block it on a 50ms network round-trip and that partition's ceiling is ~20 records/second, no matter how many cores you have. It gets worse:

  • It breaks ordering and back-pressure. The thread can't move to the next record until the call returns; a single slow upstream caps the partition.
  • It breaks exactly-once. A transaction stays open across the blocking call. If the call is slow, the transaction can hit transaction.timeout.ms and abort, and a task migration fences the producer mid-flight.
  • It couples availability. Your Streams app is now only as available as the slowest service it calls per record.

The fix: bring the data to the stream instead of calling out per record. Load reference data as a KTable or GlobalKTable and join against it, a local lookup, no network. When you truly must call an external system, make it asynchronous and batched, and accept that you've left the comfortable part of the framework.

🚫 "It's just one quick REST call inside the map to enrich each event, it's fast enough."

2. Hand-rolling joins and aggregations

Symptom: a growing pile of bespoke code that keeps state in a plain HashMap, loses it on restart, double-counts after a rebalance, and silently corrupts on reprocessing.

Implementing a join by stashing one side in a static map, or an aggregation by mutating a field in a processor, throws away everything the framework gives you: changelog-backed fault tolerance, correct partitioning, and recovery after failure. A HashMap in your app is not fault-tolerant, when the instance dies, the state is gone, and a rebalance moves the partition to an instance that never saw the data.

The fix: use the DSL's aggregations and joins. They're backed by a state store and a changelog topic, so state survives crashes and follows partitions. Drop to the Processor API only for logic the DSL genuinely can't express, and even then, use a real state store, not a field.

3. Unbounded state with no TTL or windowing

Symptom: disk fills up over weeks; restarts get slower and slower; RocksDB memory creeps until the container is OOMKilled; the changelog topic grows without bound.

Aggregating on an ever-growing key space, user IDs, session IDs, request IDs, with a plain KTable keeps every key forever. There is no automatic expiry on a non-windowed aggregate. The state store, its changelog, and the restore time all grow with your key cardinality until something breaks.

The fix: bound the state deliberately. Use windowed aggregations with a retention period so old windows are purged, or a windowed store, or a Processor API punctuator that deletes expired keys on a timer. If state is genuinely unbounded by nature, plan for its size explicitly, see state restore time and RocksDB tuning.

4. Ignoring co-partitioning

Symptom: a join that produces nothing, or drops records, with no exception in the logs.

A KStream-KStream or KStream-KTable join requires both sides to be co-partitioned: same partition count and records placed by the same partitioner. You'd hope a partition-count mismatch fails fast with TopologyException, but Streams only runs that co-partitioning check when the topology contains at least one repartition topic (verified on Kafka 3.9 and 4.3). A plain join of two source topics with mismatched counts starts cleanly and silently misses matches on the extra partitions. A partitioner mismatch is just as quiet, if one topic was written by a producer using a different partitioning scheme, the same key lands on different partition numbers on each side, the join silently sees no match, and you get empty or partial output with nothing in the logs.

The fix: ensure both sides share a partition count and partitioner, or re-key and let Streams repartition. This failure is subtle enough to deserve its own page, see joins that drop data.

5. Not naming your operators

Symptom: a topology change you expected to be a rolling upgrade instead orphans all your state; the app refuses to start, or starts with an empty store, after you inserted one operator.

Kafka Streams names internal stores, changelog topics, and repartition topics positionally, KSTREAM-AGGREGATE-STATE-STORE-0000000005 and the like. Insert or reorder an operator and every downstream number shifts. The new topology looks for ...0000000006, finds nothing, and your existing state under ...0000000005 is orphaned.

The fix: name everything explicitly from day one, Materialized.as(...), Grouped.as(...), Repartitioned.as(...), StreamJoined.withName(...), Named.as(...). Names you control don't shift when the graph changes. Since Kafka 4.3, building a topology with unnamed internal resources logs a WARN listing each one, and setting ensure.explicit.internal.resource.naming=true (KIP-1111) turns any unnamed operator into a hard startup failure. This is the foundation of safe upgrades, see evolving a topology.

6. One giant topology

Symptom: a deploy of one unrelated feature forces a full-app rebalance and state restore; you can't scale or tune one part without affecting all of it; one poison record takes down the whole pipeline.

Cramming every unrelated stream into a single application.id couples concerns that should be independent. The blast radius of any change, any rebalance, and any failure becomes the entire application. Independent workloads end up sharing thread pools, restore time, and failure domains for no reason.

The fix: split unrelated pipelines into separate applications with their own application.id, deployed and scaled independently. Smaller topologies rebalance faster, restore less state, and isolate failures. Use sub-topologies (Streams splits the graph at repartition boundaries) within an app, and separate apps across unrelated domains, see scaling.

7. Using exactly-once as a deduplication substitute

Symptom: you turn on exactly_once_v2 expecting all duplicates to vanish, and downstream consumers still see repeats.

Exactly-once semantics make the consume-process-produce cycle inside Kafka atomic. They do not deduplicate records that were duplicated before they reached your topology, a producer that sent the same business event twice, an upstream retry, a re-ingested file. EOS is effect-once for the read-process-write loop, not a content-level dedup over your input.

The fix: understand exactly what EOS covers and what it doesn't (exactly-once, and why you still see duplicates), then deduplicate on a business key explicitly with a state store when you need content-level dedup (deduplication). Layer them: idempotent producer upstream, EOS for the cycle, explicit dedup for cross-producer repeats.

8. In-memory stores for large state

Symptom: steady-state performance is great, then a restart or rebalance takes minutes because the store rebuilds from scratch every time.

An in-memory store (Stores.inMemoryKeyValueStore) is bounded by your heap and, crucially, holds nothing on disk. So on every restart it must replay its entire changelog from the beginning before processing a single record. A persistent RocksDB store keeps data on local disk and only catches up on the delta it missed. For large state, in-memory turns a 10-second restart into a multi-minute one.

The fix: use the default persistent (RocksDB) store for anything but small, fast-churning state where you've accepted the full-restore cost. The trade-off is laid out in state stores, and the restore mechanics in state restore time.

The common thread

Seven of these eight share a root cause: treating Kafka Streams like an ordinary application and ignoring that it's a stateful, partitioned, fault-tolerant one. State lives in stores backed by changelogs, work is pinned to partitions, and tasks move between instances on rebalance. Design with that model and most of these anti-patterns never appear. Fight it, block a thread, keep state in a field, ignore partitioning, and you'll meet them in production.

What are the most common Kafka Streams anti-patterns?

The ones behind most production incidents are synchronous external calls inside the topology, hand-rolling joins and aggregations in a plain HashMap, unbounded state with no TTL or windowing, ignoring co-partitioning, not naming operators, cramming everything into one giant topology, treating exactly-once as a dedup mechanism, and using in-memory stores for large state.

Should I call a REST API or database inside map()?

No. A stream thread processes one partition sequentially, so a blocking 50ms call caps that partition at roughly 20 records/second regardless of cores, breaks back-pressure, and can blow transaction.timeout.ms under exactly-once. Bring the data to the stream as a KTable or GlobalKTable and join against it instead.

What is the "ever-growing state" anti-pattern and how do I avoid it?

Aggregating on an unbounded key space with a plain KTable keeps every key forever, so disk, changelog, and restore time grow until something is OOMKilled. Bound it deliberately with windowed aggregations and a retention period, or a Processor API punctuator that deletes expired keys on a timer.

Why is naming your operators a production best practice?

Kafka Streams names internal stores, changelog, and repartition topics positionally, so inserting or reordering one operator shifts every downstream number and orphans your existing state. Naming everything explicitly with Materialized.as(...), Grouped.as(...), and Named.as(...) keeps those names stable across topology changes.

Does exactly-once deduplicate records duplicated before they reach my topology?

No. Exactly-once makes the consume-process-produce cycle inside Kafka atomic, but it does not remove duplicates a producer sent twice, an upstream retry, or a re-ingested file. For content-level dedup you need an explicit dedup step on a business key with a state store.

See it in practice with Conduktor

Several of these anti-patterns show up first as Kafka symptoms, not application errors: climbing consumer group lag (the blocking-call stall), a sprawling set of internal topics (unbounded state), or an empty changelog after a deploy (orphaned state from an unnamed operator). Conduktor Console surfaces the lag, the changelog and repartition topics, and the partition assignment you need to catch these before they page you.

Next steps