# Kafka Streams exactly-once but still seeing duplicates

*Diagnose duplicates that survive exactly_once_v2.*

You set `processing.guarantee=exactly_once_v2`, redeployed, and there are still duplicate records downstream. This is the most-pasted Kafka Streams support thread in existence, and the frustrating part is that the flag is almost always doing exactly what it promises. The duplicates are real, but they're coming from somewhere the Kafka transaction was never able to reach.

This page assumes you already know what exactly-once *is*. If not, start with [exactly-once](https://www.conduktor.io/kafka-streams/exactly-once). Here we go straight to the failure modes: the four ways duplicates slip past a correctly-configured EOS app, the alarming-but-usually-benign errors in your logs, and the upgrade hazard that produced *genuine* duplicate processing on real clusters.

**What you'll learn:**
- The four real causes of duplicates with EOS switched on, each with a fix
- Why `InvalidProducerEpochException` and "task may be migrated out" are usually noise
- The version-upgrade path that caused actual duplicate processing
- The broker config EOS needs, and why `initTransactions` hangs without it

> 🚫 *"exactly_once_v2 removes every duplicate record everywhere in my pipeline."*

Exactly-once makes the **consume → process → produce cycle inside Kafka** atomic. That is the entire scope. It does not reach your database, your REST calls, your downstream consumers' config, or two genuinely-distinct records that happen to mean the same thing. Every cause below is a place that boundary doesn't cover.

## Cause 1: a downstream consumer is reading uncommitted records

This is the single most common one, and it's not even a bug in your Streams app.

When EOS is on, your app writes output inside a Kafka transaction. Records from an *aborted* transaction physically exist in the partition (they were written, then marked aborted) and they stay there until compaction or retention removes them. Whether a consumer *sees* them is decided entirely by that consumer's `isolation.level`:

| `isolation.level` | What it reads |
|---|---|
| `read_uncommitted` (the consumer default) | Every record, including ones from aborted and in-flight transactions |
| `read_committed` | Only records from committed transactions |

The default for a plain Kafka consumer is `read_uncommitted`. So a *correctly* producing EOS Streams app can write a clean, de-duplicated output topic, and a downstream service reading that topic with default settings will still see the aborted attempts, which look exactly like duplicates.

The fix is one line on the **downstream** consumer:

```properties
isolation.level=read_committed
```

This applies to every reader of an EOS-produced topic: a second Streams app (it sets `read_committed` for you when *it* runs EOS, but not otherwise), a Kafka Connect sink, a microservice, an ad-hoc console consumer you're using to debug. If you're verifying "did EOS work?" with `kafka-console-consumer` and not passing `--isolation-level read_committed`, you are looking at the aborted records and concluding EOS is broken. It isn't; your consumer is.

## Cause 2: the external sink isn't idempotent

EOS covers Kafka reads, Kafka writes, and Kafka offset commits, as one transaction. It cannot enroll anything that isn't Kafka.

So if your `process()` or `foreach()` does this:

```java
// Inside the topology: NOT covered by the Kafka transaction
.foreach((key, order) -> jdbc.execute(
    "INSERT INTO orders(id, total) VALUES (?, ?)", order.id(), order.total()));
```

…then when a batch aborts and reprocesses (a crash, a rebalance, a transient broker error, all normal), that `INSERT` runs a second time. The Kafka side rolls back cleanly; the row in your database does not. You get a duplicate row with EOS reporting success, because from Kafka's point of view nothing went wrong.

The same applies to every non-Kafka effect: an HTTP `POST` to a payment API, an email, a webhook, a write to a cache or search index. The transaction aborts; the side effect already happened.

The fix is to make the *sink itself* absorb the duplicate, keyed by something deterministic from the record (not a generated UUID or a timestamp):

```sql
-- Idempotent write: reprocessing the same record is a no-op
INSERT INTO orders (id, total) VALUES (?, ?)
ON CONFLICT (id) DO UPDATE SET total = EXCLUDED.total;
```

For a REST call, prefer an endpoint that takes an idempotency key (most payment and messaging providers support this) and derive that key from the record's business id. The general rule: **EOS is effect-once for Kafka; for everything else, you make the edge idempotent.** This is also why many teams skip EOS entirely and run at-least-once with an idempotent sink: same result at the boundary that matters, lower cost. That trade-off is covered on the [exactly-once](https://www.conduktor.io/kafka-streams/exactly-once) page.

## Cause 3: non-deterministic processing logic

EOS quietly assumes that reprocessing the same input produces the same output. That assumption is what lets it abort an attempt and retry it as if the first never happened.

Break the assumption and you get duplicates that look impossible: two *different* output records for one input. The usual culprits live inside the topology:

- **Reading the wall clock**: `System.currentTimeMillis()`, `Instant.now()`. The retry sees a later time, emits a different value.
- **Random or UUID generation**: `UUID.randomUUID()` as a record key or field. Every reprocess produces a new id, so a downstream dedup keyed on it sees two distinct records.
- **External mutable state**: a lookup against a database or cache that changed between the first attempt and the retry.

After an abort-and-retry, the first attempt's output is invisible to `read_committed` readers, *if it was deterministic*. If your logic produced value A on the first try and value B on the retry, only B is committed, but if A leaked through a non-Kafka side effect (cause 2) or you generated a fresh key, you now have two records in play. The fix is to make the logic deterministic: derive keys and timestamps from the *record* (event-time carried in the payload, a business id from the value), never from ambient state.

## Cause 4: the records are genuinely distinct; EOS was never going to help

This is the case people miss because it's not a malfunction at all.

If an upstream producer publishes the same logical event **twice** (two separate records, two offsets, perhaps because *its* producer retried without [idempotence](https://www.conduktor.io/kafka/idempotent-kafka-producer), or a batch job ran twice), Kafka Streams sees two distinct inputs and faithfully processes both. EOS guarantees each input is handled exactly once. It has no idea the two inputs are "the same event"; they aren't, as far as Kafka is concerned. Two committed outputs, both correct, both unwanted.

No transaction setting fixes this, because nothing went wrong inside the cycle. De-duplicating by a *business key* (across producers, across restarts, across a time window) is a separate problem that needs a stateful dedup operator backed by a state store. That pattern is its own page: [deduplication](https://www.conduktor.io/kafka-streams/deduplication).

The quick triage for cause 4 vs the others: consume the **input** topic with `--isolation-level read_committed` and look for the duplicate there. If it's already in the input, EOS on your app was never going to remove it: fix it upstream or add a dedup step. If the input is clean and only the output duplicates, you're looking at causes 1–3.

## The alarming errors that usually aren't a problem

EOS apps log a family of exceptions that look catastrophic and are, most of the time, the system working as designed:

- `InvalidProducerEpochException`
- `ProducerFencedException`
- `Producer attempted to produce with an old epoch`
- `the producer is fenced, indicating the task may be migrated out`

Here's the mechanism. Under `exactly_once_v2` the transactional producer lives per **stream thread**, not per task, and fencing works through consumer group metadata ([KIP-447](https://cwiki.apache.org/confluence/display/KAFKA/KIP-447%3A+Producer+scalability+for+exactly+once+semantics)): every transactional offset commit carries the thread's group member and generation ids. After a rebalance moves tasks to another instance, a stale instance trying to commit a lingering transaction fails that generation check (and the transaction coordinator has since bumped the producer epoch), so the broker **fences** it and rejects the write. That rejection surfaces as one of the exceptions above. This is the fencing mechanism that *prevents* the zombie old instance from writing duplicate output. It firing means EOS is protecting you.

Kafka Streams treats these as recoverable: the affected task is reset, the transaction is aborted, and processing resumes after the rebalance settles. You'll see the error, then normal operation. Tightened transaction handling in [KIP-890](https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense) (recent Kafka releases) makes this fencing more robust against the edge cases where a hung or partially-committed transaction used to slip through.

When is it a *real* problem? When it's not transient: a tight loop of fence-rebalance-fence that never settles. That's not a transaction bug; it's a rebalancing problem (sessions timing out, `max.poll.interval.ms` too low, pods churning) wearing an EOS error message. Diagnose it as a rebalance: check how often the group is rebalancing and why, not the transaction config. Repeated fencing is a symptom, not the disease.

## The upgrade hazard: when duplicates were real

Everything above is "EOS is fine, look elsewhere." There is one class of exception worth flagging honestly: **specific version upgrades have caused genuine duplicate processing**, not the illusory kind.

The notable case involved the migration from the original `exactly_once` (v1) to `exactly_once_v2`, and certain broker/client version combinations across the 2.x-to-3.x range. The migration to v2 is deliberately a **two-rolling-bounce** procedure precisely because v1 and v2 use incompatible transactional semantics, and during the transition a mishandled bounce could reprocess records. This isn't a reason to avoid v2 (v2 is correct and you want it), it's a reason to do the upgrade by the documented steps:

> **Version note.** `exactly_once_v2` needs Kafka brokers on 2.5+ and a Streams app on 2.6+ (the literal `exactly_once_v2` config value arrived in 3.0; 2.6–2.8 spelled it `exactly_once_beta`). EOS v1 was removed in Kafka 4.0. Migrating an app from v1 to v2 is a **two-phase rolling upgrade**: first bounce every instance onto a 3.0+ build still running `exactly_once`, then bounce again switching to `exactly_once_v2`. Skipping the intermediate bounce, or mixing v1 and v2 instances in one group for an extended window, is the path that has produced real duplicates. Check the upgrade guide for your exact source and target versions before you start.

If you saw a burst of duplicates *during* an upgrade and they stopped once every instance was on the same version, that was the migration, not your topology. If duplicates persist after the dust settles, you're back to causes 1–4.

## The broker config EOS depends on

EOS doesn't run on the app alone. Transactions live in an internal topic, `__transaction_state`, and that topic has durability requirements that bite when a broker is down.

The key setting is `transaction.state.log.min.isr` (broker config, default 2), alongside `transaction.state.log.replication.factor` (default 3). For transactions to commit, enough replicas of the transaction-state partitions must be in-sync. On a small or degraded cluster this is a real constraint:

```properties
# Broker side: transaction state durability
transaction.state.log.replication.factor=3
transaction.state.log.min.isr=2
```

The symptom when it's wrong is unmistakable: a Streams app starting up with EOS **gets stuck on `initTransactions()`** and never begins processing. `initTransactions()` blocks until it can find the transaction coordinator, timing out after `max.block.ms` (60 seconds by default). Streams catches the timeout and retries on every loop while the app sits in **REBALANCING**, never reaching RUNNING; after `task.timeout.ms` (5 minutes by default) the thread gives up with `Task <id> did not make progress within 300000 ms` and the app goes to ERROR. For those six minutes it looks like a hang, and a restart loop repeats it indefinitely. People read this as "Streams is stuck on startup" and look at their topology; the actual cause is a broker outage starving the transaction log.

The practical implications:

- **A single-broker or RF-1 dev cluster cannot run EOS** with default transaction-state settings: `__transaction_state` is never even created (internal topic creation fails until the cluster meets the replication factor), so `initTransactions` never finds a coordinator and times out. Either lower `transaction.state.log.replication.factor` for local dev or run a 3-broker setup.
- **Losing one broker on a 3-broker cluster** with `min.isr=2` can stall transactions if it drops the relevant partition's ISR to 1. EOS trades some availability for its guarantee.
- **A hung `initTransactions` on deploy** after a broker incident is a cluster-health problem, not an app problem. Check broker and `__transaction_state` health first.

**Why am I still seeing duplicates with exactly_once_v2 enabled?**

The flag is almost always working; exactly-once only makes the consume-process-produce cycle inside Kafka atomic. Duplicates come from outside that boundary: a downstream consumer reading uncommitted records, a non-idempotent external sink, non-deterministic processing logic, or genuinely distinct records an upstream producer published twice.

**Why do my downstream consumers see records that look like duplicates?**

A consumer's default `isolation.level` is `read_uncommitted`, so it reads records from aborted transactions that physically exist in the partition until compaction or retention removes them. Set `isolation.level=read_committed` on every reader of an EOS-produced topic, including any `kafka-console-consumer` you use to verify it.

**What causes ProducerFencedException or InvalidProducerEpochException?**

Under exactly_once_v2 the transactional producer is per stream thread, and fencing works through consumer group metadata: a stale instance that tries to commit a lingering transaction after a rebalance fails the group generation check and the broker fences it. This is the mechanism that prevents a zombie instance from writing duplicates: it firing means EOS is protecting you, and Kafka Streams treats it as recoverable.

**Why does my Kafka Streams app hang on startup with exactly-once?**

EOS stores transactions in the internal `__transaction_state` topic, and `initTransactions()` blocks until the transaction coordinator is found, timing out after `max.block.ms` (60s default). A single-broker or RF-1 dev cluster, or a broker outage that drops the ISR below `transaction.state.log.min.isr`, starves the transaction log: Streams retries while stuck in REBALANCING, then fails to ERROR after `task.timeout.ms` (5 minutes default), a cluster-health problem, not a topology one.

**Does exactly-once deduplicate records produced twice by an upstream producer?**

No. Two separate records with two offsets are distinct inputs as far as Kafka is concerned, and EOS faithfully processes each exactly once. De-duplicating by a business key needs a stateful dedup operator backed by a state store; triage it by consuming the input topic with `read_committed` to see if the duplicate is already there.

> **See it in practice with Conduktor**
> Most "EOS still duplicates" investigations are answered by looking at the cluster, not the code. [Conduktor Console](https://docs.conduktor.io/guide/manage-kafka/kafka-resources/topics) lets you consume a topic with `read_committed` to confirm whether duplicates are real or just aborted records a `read_uncommitted` reader is seeing, watch how often the app's consumer group is rebalancing (the source of repeated fencing errors), and check broker and internal-topic health when `initTransactions` hangs on startup: the signals that tell you whether the problem is your topology or the Kafka underneath it.

## Next steps

- [Kafka Streams exactly-once](https://www.conduktor.io/kafka-streams/exactly-once): what `exactly_once_v2` actually guarantees, and its scope
- [Deduplication patterns](https://www.conduktor.io/kafka-streams/deduplication): removing genuinely-duplicate records by business key
- [Dead letter queues in Kafka Streams](https://www.conduktor.io/kafka-streams/dead-letter-queue): handling the bad records that abort batches in the first place
