# Kafka Streams anti-patterns

*Spot the eight that cause most production incidents.*

Most Kafka Streams outages are not exotic. They trace back to a handful of design choices that look reasonable in a code review and fall over under load, during a rebalance, or six months later when someone edits the topology. This page is a map: each anti-pattern gets the symptom you'll see, why it's wrong, and a link to the page that fixes it properly.

If you read one section, read the first. A blocking call inside the topology is the single most common way a Kafka Streams app goes from "fast" to "stalled" with no error in the logs.

**What you'll learn:**
- The eight anti-patterns behind most Streams production incidents
- The symptom each one produces, so you can recognize it in your own app
- Why each is wrong at the mechanism level, not just "best practice says so"
- Where to go for the real fix

## 1. Synchronous external calls inside the topology

**Symptom:** throughput collapses under load; one slow dependency stalls a whole partition; consumer lag climbs while CPU sits idle; with exactly-once, transactions time out and you see producer-fenced errors.

A `map` or `process` that makes a blocking REST or database call per record is the most common Streams anti-pattern, and the most damaging. A stream thread processes records on one partition *sequentially*. Block it on a 50ms network round-trip and that partition's ceiling is ~20 records/second, no matter how many cores you have. It gets worse:

- **It breaks ordering and back-pressure.** The thread can't move to the next record until the call returns; a single slow upstream caps the partition.
- **It breaks exactly-once.** A transaction stays open across the blocking call. If the call is slow, the transaction can hit `transaction.timeout.ms` and abort, and a [task migration fences the producer](https://www.conduktor.io/kafka-streams/exactly-once-duplicates) mid-flight.
- **It couples availability.** Your Streams app is now only as available as the slowest service it calls per record.

**The fix:** bring the data *to* the stream instead of calling out per record. Load reference data as a [`KTable` or `GlobalKTable`](https://www.conduktor.io/kafka-streams/kstream-ktable-globalktable) and [join](https://www.conduktor.io/kafka-streams/joins) against it, a local lookup, no network. When you truly must call an external system, make it asynchronous and batched, and accept that you've left the comfortable part of the framework.

> 🚫 *"It's just one quick REST call inside the `map` to enrich each event, it's fast enough."*

## 2. Hand-rolling joins and aggregations

**Symptom:** a growing pile of bespoke code that keeps state in a plain `HashMap`, loses it on restart, double-counts after a rebalance, and silently corrupts on reprocessing.

Implementing a join by stashing one side in a static map, or an aggregation by mutating a field in a processor, throws away everything the framework gives you: changelog-backed fault tolerance, correct partitioning, and recovery after failure. A `HashMap` in your app is not fault-tolerant, when the instance dies, the state is gone, and a rebalance moves the partition to an instance that never saw the data.

**The fix:** use the DSL's [aggregations](https://www.conduktor.io/kafka-streams/aggregations) and [joins](https://www.conduktor.io/kafka-streams/joins). They're backed by a state store and a changelog topic, so state survives crashes and follows partitions. Drop to the [Processor API](https://www.conduktor.io/kafka-streams/processor-api) only for logic the DSL genuinely can't express, and even then, use a *real* state store, not a field.

## 3. Unbounded state with no TTL or windowing

**Symptom:** disk fills up over weeks; restarts get slower and slower; RocksDB memory creeps until the container is OOMKilled; the changelog topic grows without bound.

Aggregating on an ever-growing key space, user IDs, session IDs, request IDs, with a plain `KTable` keeps *every* key forever. There is no automatic expiry on a non-windowed aggregate. The state store, its changelog, and the restore time all grow with your key cardinality until something breaks.

**The fix:** bound the state deliberately. Use [windowed aggregations](https://www.conduktor.io/kafka-streams/windowing) with a retention period so old windows are purged, or a windowed store, or a [Processor API punctuator](https://www.conduktor.io/kafka-streams/processor-api) that deletes expired keys on a timer. If state is genuinely unbounded by nature, plan for its size explicitly, see [state restore time](https://www.conduktor.io/kafka-streams/state-restore) and [RocksDB tuning](https://www.conduktor.io/kafka-streams/rocksdb-tuning).

## 4. Ignoring co-partitioning

**Symptom:** a join that produces nothing, or drops records, with no exception in the logs.

A `KStream`-`KStream` or `KStream`-`KTable` join requires both sides to be *co-partitioned*: same partition count **and** records placed by the same partitioner. You'd hope a partition-count mismatch fails fast with `TopologyException`, but Streams only runs that co-partitioning check when the topology contains at least one repartition topic (verified on Kafka 3.9 and 4.3). A plain join of two source topics with mismatched counts starts cleanly and silently misses matches on the extra partitions. A partitioner mismatch is just as quiet, if one topic was written by a producer using a different partitioning scheme, the same key lands on different partition numbers on each side, the join silently sees no match, and you get empty or partial output with nothing in the logs.

**The fix:** ensure both sides share a partition count and partitioner, or re-key and let Streams repartition. This failure is subtle enough to deserve its own page, see [joins that drop data](https://www.conduktor.io/kafka-streams/join-troubleshooting).

## 5. Not naming your operators

**Symptom:** a topology change you expected to be a rolling upgrade instead orphans all your state; the app refuses to start, or starts with an empty store, after you inserted one operator.

Kafka Streams names internal stores, changelog topics, and repartition topics *positionally*, `KSTREAM-AGGREGATE-STATE-STORE-0000000005` and the like. Insert or reorder an operator and every downstream number shifts. The new topology looks for `...0000000006`, finds nothing, and your existing state under `...0000000005` is orphaned.

**The fix:** name everything explicitly from day one, `Materialized.as(...)`, `Grouped.as(...)`, `Repartitioned.as(...)`, `StreamJoined.withName(...)`, `Named.as(...)`. Names you control don't shift when the graph changes. Since Kafka 4.3, building a topology with unnamed internal resources logs a WARN listing each one, and setting `ensure.explicit.internal.resource.naming=true` (KIP-1111) turns any unnamed operator into a hard startup failure. This is the foundation of safe upgrades, see [evolving a topology](https://www.conduktor.io/kafka-streams/topology-evolution).

## 6. One giant topology

**Symptom:** a deploy of one unrelated feature forces a full-app rebalance and state restore; you can't scale or tune one part without affecting all of it; one poison record takes down the whole pipeline.

Cramming every unrelated stream into a single `application.id` couples concerns that should be independent. The blast radius of any change, any rebalance, and any failure becomes the entire application. Independent workloads end up sharing thread pools, restore time, and failure domains for no reason.

**The fix:** split unrelated pipelines into separate applications with their own `application.id`, deployed and scaled independently. Smaller topologies rebalance faster, restore less state, and isolate failures. Use sub-topologies (Streams splits the graph at repartition boundaries) within an app, and separate apps across unrelated domains, see [scaling](https://www.conduktor.io/kafka-streams/scaling).

## 7. Using exactly-once as a deduplication substitute

**Symptom:** you turn on `exactly_once_v2` expecting all duplicates to vanish, and downstream consumers still see repeats.

Exactly-once semantics make the *consume-process-produce cycle inside Kafka* atomic. They do **not** deduplicate records that were duplicated *before* they reached your topology, a producer that sent the same business event twice, an upstream retry, a re-ingested file. EOS is effect-once for the read-process-write loop, not a content-level dedup over your input.

**The fix:** understand exactly what EOS covers and what it doesn't ([exactly-once, and why you still see duplicates](https://www.conduktor.io/kafka-streams/exactly-once-duplicates)), then deduplicate on a business key explicitly with a state store when you need content-level dedup ([deduplication](https://www.conduktor.io/kafka-streams/deduplication)). Layer them: idempotent producer upstream, EOS for the cycle, explicit dedup for cross-producer repeats.

## 8. In-memory stores for large state

**Symptom:** steady-state performance is great, then a restart or rebalance takes minutes because the store rebuilds from scratch every time.

An in-memory store (`Stores.inMemoryKeyValueStore`) is bounded by your heap and, crucially, holds nothing on disk. So on every restart it must replay its *entire* changelog from the beginning before processing a single record. A persistent RocksDB store keeps data on local disk and only catches up on the delta it missed. For large state, in-memory turns a 10-second restart into a multi-minute one.

**The fix:** use the default persistent (RocksDB) store for anything but small, fast-churning state where you've accepted the full-restore cost. The trade-off is laid out in [state stores](https://www.conduktor.io/kafka-streams/state-store), and the restore mechanics in [state restore time](https://www.conduktor.io/kafka-streams/state-restore).

## The common thread

Seven of these eight share a root cause: treating Kafka Streams like an ordinary application and ignoring that it's a *stateful, partitioned, fault-tolerant* one. State lives in stores backed by changelogs, work is pinned to partitions, and tasks move between instances on rebalance. Design with that model and most of these anti-patterns never appear. Fight it, block a thread, keep state in a field, ignore partitioning, and you'll meet them in production.

**What are the most common Kafka Streams anti-patterns?**

The ones behind most production incidents are synchronous external calls inside the topology, hand-rolling joins and aggregations in a plain `HashMap`, unbounded state with no TTL or windowing, ignoring co-partitioning, not naming operators, cramming everything into one giant topology, treating exactly-once as a dedup mechanism, and using in-memory stores for large state.

**Should I call a REST API or database inside map()?**

No. A stream thread processes one partition sequentially, so a blocking 50ms call caps that partition at roughly 20 records/second regardless of cores, breaks back-pressure, and can blow `transaction.timeout.ms` under exactly-once. Bring the data to the stream as a `KTable` or `GlobalKTable` and join against it instead.

**What is the "ever-growing state" anti-pattern and how do I avoid it?**

Aggregating on an unbounded key space with a plain `KTable` keeps every key forever, so disk, changelog, and restore time grow until something is OOMKilled. Bound it deliberately with windowed aggregations and a retention period, or a Processor API punctuator that deletes expired keys on a timer.

**Why is naming your operators a production best practice?**

Kafka Streams names internal stores, changelog, and repartition topics positionally, so inserting or reordering one operator shifts every downstream number and orphans your existing state. Naming everything explicitly with `Materialized.as(...)`, `Grouped.as(...)`, and `Named.as(...)` keeps those names stable across topology changes.

**Does exactly-once deduplicate records duplicated before they reach my topology?**

No. Exactly-once makes the consume-process-produce cycle inside Kafka atomic, but it does not remove duplicates a producer sent twice, an upstream retry, or a re-ingested file. For content-level dedup you need an explicit dedup step on a business key with a state store.

> **See it in practice with Conduktor**
> Several of these anti-patterns show up first as Kafka symptoms, not application errors: climbing consumer group lag (the blocking-call stall), a sprawling set of internal topics (unbounded state), or an empty changelog after a deploy (orphaned state from an unnamed operator). [Conduktor Console](https://docs.conduktor.io/guide/manage-kafka/kafka-resources/topics) surfaces the lag, the changelog and repartition topics, and the partition assignment you need to catch these before they page you.

## Next steps

- [Joins that drop data](https://www.conduktor.io/kafka-streams/join-troubleshooting), diagnosing the co-partitioning failure
- [Evolving a topology safely](https://www.conduktor.io/kafka-streams/topology-evolution), why naming operators matters
- [Deduplication in Kafka Streams](https://www.conduktor.io/kafka-streams/deduplication), content-level dedup the right way
