# Kafka Streams dead letter queue

*Stop poison pills from killing your stream.*

A single malformed record can take down a Kafka Streams application. Not degrade it: stop it. The default behavior on a record Streams can't deserialize is to throw, kill the `StreamThread`, and refuse to make progress until you intervene. Worse, because the bad record stays at the same offset, a naive restart reads it again and dies again: a poison-pill loop.

Kafka Streams has *three* separate error-handling hooks, and each one fires at a different stage of the read-process-write cycle. Knowing which handler catches what (and where the historical gap was) is the difference between a stream that quarantines bad data and one that pages you at 3am. This page covers all three, the manual dead-letter-queue (DLQ) pattern that fills the gap, and the native support that recently landed.

**What you'll learn:**
- The three error handlers and exactly where each one fires
- Why a poison pill causes a reprocessing loop, not a clean skip
- The manual DLQ pattern: route the bad record and its error metadata to a `.DLT` topic
- The native processing-exception handler and DLQ (KIP-1033 / KIP-1034), and the version to check

## Three handlers, three stages

A Streams record flows through three stages where things break: deserializing the input, running your processing logic, and producing the output. There is a distinct handler for each.

| Stage | Handler | Fires on | Default behavior |
|---|---|---|---|
| **Read** (deserialize) | `DeserializationExceptionHandler` | A record the configured serde can't deserialize (a poison pill) | `LogAndFailExceptionHandler`: log, then **kill the thread** |
| **Process** (your logic) | `ProcessingExceptionHandler` (KIP-1033, Kafka 3.9+) | An exception thrown inside `map`, `filter`, a processor, etc. | Historically none: the exception propagated up and killed the thread. Since 3.9: `LogAndFailProcessingExceptionHandler` |
| **Write** (produce) | `ProductionExceptionHandler` | The producer fails to write a record (e.g. `RecordTooLargeException`) | `DefaultProductionExceptionHandler`: fail the thread |

Mixing these up wastes hours. A `RecordTooLargeException` on output is *not* caught by the deserialization handler. A `NullPointerException` in your `mapValues` was, until recently, not caught by *any* DSL handler. Match the handler to the stage.

### Read-side: the deserialization handler

This is the one most people mean by "skip bad records." Kafka ships two implementations:

```properties
# Kill the thread on a record the serde can't decode (the DEFAULT)
deserialization.exception.handler=org.apache.kafka.streams.errors.LogAndFailExceptionHandler

# Log the record and keep going, survive the poison pill
deserialization.exception.handler=org.apache.kafka.streams.errors.LogAndContinueExceptionHandler
```

On Kafka 3.x the key is `default.deserialization.exception.handler`; 4.0 deprecated the `default.` prefix (KIP-1056). The old key still works on 4.x, but Kafka's own error messages now point you at the new one.

`LogAndContinueExceptionHandler` is the quick fix that keeps the app alive. But read the next section before you reach for it: "continue" silently drops the record, and a dropped record is gone unless you also route it somewhere.

> 🚫 *"A bad record just gets skipped, Streams handles it."*

By default it does **not** skip. The default deserialization handler is `LogAndFail`, which stops the thread. Switching to `LogAndContinue` makes it skip, but skip means *discard with a log line*, with no copy of the record kept. If that record mattered, you've lost it and you'll find out downstream.

## The poison-pill reprocessing loop

Here is the failure mode that surprises people running the default handler. A poison pill sits at, say, offset 4711. The `StreamThread` tries to deserialize it, throws, and dies, **before committing the offset**. Streams either replaces the thread or the app restarts, the consumer resumes from the last committed offset (4711), reads the same poison pill, and dies again. The app flaps in a loop, making zero progress, while the offset never advances past the bad record.

This is why `LogAndFail` plus an automatic restarter is the worst of both worlds: it neither stops cleanly nor recovers. You either skip the record (`LogAndContinue`, ideally routing it to a DLQ) or you manually advance the offset past it. There is no version where the loop resolves itself.

## The uncaught-exception handler, and its gotcha

Separate from the per-stage handlers, there is an application-level safety net for any exception that nothing else caught:

```java
streams.setUncaughtExceptionHandler(exception -> {
    // Choose what happens after an uncaught exception kills a StreamThread
    return StreamsUncaughtExceptionHandler.StreamThreadExceptionResponse.REPLACE_THREAD;
    // or .SHUTDOWN_CLIENT: shut down this instance
    // or .SHUTDOWN_APPLICATION: shut down every instance with the same application.id
});
```

- `REPLACE_THREAD`: spin up a fresh `StreamThread` and carry on. Fine for transient faults; a *loop* if the cause is a poison pill, because the new thread hits the same record.
- `SHUTDOWN_CLIENT`: stop this instance only.
- `SHUTDOWN_APPLICATION`: stop the whole application across all instances.

The gotcha: this handler runs **once the `StreamThread` has hit a fatal error and stopped processing**. It is invoked on the dying thread as it shuts down, before any replacement is spawned. It is a notification-and-decision hook, not a `try/catch` around your code. It cannot resume the failed record or repair in-flight state; processing on that thread is over and the decision is only "what next." Don't mistake it for in-line error handling inside your topology.

## The historical gap: no handler for business logic

Notice what the three stages *didn't* cover for most of Kafka Streams' life: an exception thrown by **your own processing logic**. A `NumberFormatException` parsing a field, a null dereference, a downstream call that throws: none of these were deserialization or production errors. There was no DSL-level hook for them. The exception propagated up, killed the thread, and you were back in restart-loop territory.

The workaround the community settled on is the **manual DLQ pattern**: wrap your risky logic in a `try/catch` inside the processor, and on failure, route the offending record (plus enough metadata to debug it) to a dedicated dead-letter topic, conventionally named `<source-topic>.DLT`.

## The manual DLQ pattern

The idea: never let a business-logic exception escape the processor. Catch it, emit the original record to a side topic with error context in the headers, and keep the main stream flowing. A `branch`/`split` keeps it readable in the DSL:

```java
// Parse risky input; on failure, tag the record instead of throwing
KStream<String, String> parsed = builder.stream("orders",
        Consumed.with(Serdes.String(), Serdes.String()));

Map<String, KStream<String, String>> branches = parsed
    .mapValues(DeadLetterExample::tryParseOrTag)   // returns OK:<json> or ERR:<message>::<raw>
    .split(Named.as("route-"))
    .branch((k, v) -> v.startsWith("ERR:"), Branched.as("dead"))
    .defaultBranch(Branched.as("ok"));

// Good records continue down the topology
branches.get("route-ok")
    .mapValues(v -> v.substring(3))   // strip the OK: tag
    .to("orders-validated", Produced.with(Serdes.String(), Serdes.String()));

// Bad records: enrich with error headers, send to the dead-letter topic
branches.get("route-dead")
    .process(() -> new Processor<String, String, String, String>() {
        private ProcessorContext<String, String> ctx;
        public void init(ProcessorContext<String, String> context) { this.ctx = context; }
        public void process(Record<String, String> rec) {
            String[] parts = rec.value().substring(4).split("::", 2);  // message :: raw
            Record<String, String> dead = new Record<>(rec.key(), parts.length > 1 ? parts[1] : "", rec.timestamp());
            dead.headers()
                .add("dlq-error", parts[0].getBytes(StandardCharsets.UTF_8))
                .add("dlq-source-topic", "orders".getBytes(StandardCharsets.UTF_8))
                .add("dlq-ts", Long.toString(System.currentTimeMillis()).getBytes(StandardCharsets.UTF_8));
            ctx.forward(dead);
        }
    })
    .to("orders.DLT", Produced.with(Serdes.String(), Serdes.String()));
```

The headers are the point. A bare copy of the bad value in a `.DLT` topic tells you nothing six weeks later. Record the exception message, the source topic and partition/offset if you have them, and a timestamp, so the dead-letter topic is something you can triage and replay, not just a graveyard.

For deserialization failures specifically (which happen *before* your code sees the record), the same routing belongs in a custom `DeserializationExceptionHandler`: it receives the original `ConsumerRecord` with the raw bytes, so it can ship them to the DLQ before telling Streams to keep going. Up to Kafka 4.1 that means producing inside `handle(...)` and returning `CONTINUE`; since 4.2 the method is `handleError(...)` returning `Response.resume()`, and the `Response` itself can carry dead-letter records for Streams to produce for you (KIP-1034). The `handle(...)` variants are deprecated in 4.3 but still functional.

### Retry tier vs dead-letter tier

Not every failure is a poison pill. A transient error (a downstream timeout, a momentary dependency blip) usually succeeds on a retry; a malformed record never will. Routing both to the same place sends recoverable records to manual triage and dilutes the DLT's signal.

A common refinement splits them into two topics:

- **`<topic>.retry`** for transient failures. A small daemon consumes it, waits a backoff (1s, 2s, 4s, 8s…), and republishes to the original topic, carrying an attempt count in a header and giving up to the DLT after N attempts. Most failures self-heal with nobody looking.
- **`<topic>.DLT`** for terminal/poison records. Quarantined for human inspection and explicit reinjection, never auto-retried.

This retry-daemon-plus-DLT shape is exactly the one that shows up in [multi-agent AI systems on Kafka](https://www.conduktor.io/kafka-streams/ai-agents), where it matters even more: a non-deterministic worker (an LLM) can fail a record once and succeed on replay, so bounded automatic retries recover far more than they would with deterministic code.

## Native support: KIP-1033 and KIP-1034

The manual pattern was load-bearing for years because the framework had no built-in answer for processing errors. That changed recently.

> **Version-gate this.** Native processing-error handling arrived through **KIP-1033** (a `ProcessingExceptionHandler` for exceptions thrown in user processing code), shipped in **Kafka 3.9**; native dead-letter-queue support arrived through **KIP-1034**, shipped in **Kafka 4.2**. Confirm your exact Kafka and Kafka Streams version before depending on either. On older versions the manual pattern above is still the way.

Where it's available, **KIP-1033** gives the missing third handler: a `ProcessingExceptionHandler` invoked when your own logic throws, with the same `FAIL` / `CONTINUE` contract as the deserialization handler, so a business-logic exception no longer has to mean a dead thread. **KIP-1034** builds on the handler contracts so a failed record can be routed to a dead-letter topic through configuration (`errors.dead.letter.queue.topic.name`, unset by default) and the handler return value, rather than the hand-rolled `branch`-and-`process` plumbing. The manual pattern still works and still gives you the most control over headers and routing; the native path removes boilerplate once you're on a version that has it.

## Putting it together

A robust Streams app usually layers these:

- **Read**: a `DeserializationExceptionHandler` that routes poison pills to a DLQ, not bare `LogAndContinue` (so nothing is silently lost).
- **Process**: `try/catch` in your processors routing to a `.DLT` topic with error headers, or the KIP-1033 `ProcessingExceptionHandler` if your version has it.
- **Write**: a `ProductionExceptionHandler` for produce-side failures you've decided are non-fatal.
- **Backstop**: a `StreamsUncaughtExceptionHandler` so a genuinely unexpected exception triggers a deliberate `REPLACE_THREAD` or shutdown, not an accidental flap.

And whatever you skip, route it somewhere. A skipped record with no DLQ is a data-loss bug wearing a "handled" label.

**How do I handle a bad record or poison pill in Kafka Streams?**

The default deserialization handler is `LogAndFailExceptionHandler`, which kills the thread on an undecodable record. Switch to `LogAndContinue` to survive it, but pair it with routing to a dead-letter topic: "continue" silently discards the record with only a log line, so anything that mattered is lost.

**What is the difference between LogAndContinue and LogAndFail in Kafka Streams?**

`LogAndFailExceptionHandler` (the default) logs the bad record and stops the `StreamThread`; `LogAndContinueExceptionHandler` logs it and skips it so the app keeps running. Skipping discards the record, so route it to a DLQ if you can't afford to lose it.

**How do I implement a dead letter queue in Kafka Streams?**

Wrap risky logic in a `try/catch` inside a processor and, on failure, forward the offending record to a dedicated topic conventionally named `<source-topic>.DLT`, attaching the exception message, source topic/partition/offset, and a timestamp as headers. The headers are the point: a bare copy of the value tells you nothing weeks later when you try to triage or replay it.

**What does the StreamsUncaughtExceptionHandler do and what are REPLACE_THREAD versus SHUTDOWN?**

It's an application-level safety net, invoked on the dying `StreamThread` once a fatal error has stopped processing, that decides what happens next: `REPLACE_THREAD` starts a fresh thread, `SHUTDOWN_CLIENT` stops the instance, and `SHUTDOWN_APPLICATION` stops every instance sharing the `application.id`. It is a notification-and-decision hook, not a `try/catch` around your code, and `REPLACE_THREAD` becomes a loop if the cause is a poison pill.

**What is the KIP-1033 processing exception handler in Kafka Streams?**

KIP-1033 added a `ProcessingExceptionHandler` for exceptions thrown in your own processing code, with the same `FAIL`/`CONTINUE` contract as the deserialization handler, and shipped in Kafka 3.9; native DLQ support followed in KIP-1034, shipped in Kafka 4.2. Confirm your exact version before depending on either. On older versions the manual `branch`-and-`process` DLQ pattern is still the way.

> **See it in practice with Conduktor**
> A dead-letter topic is only useful if you can read it. [Conduktor Console](https://docs.conduktor.io/guide/manage-kafka/kafka-resources/topics) lets you browse your `.DLT` topics, read the error headers you attached to each bad record, and watch whether the count is growing, so quarantined records get triaged and replayed instead of piling up unseen. It also surfaces the consumer group lag that tells you a poison-pill loop is stalling the app in the first place.

## Next steps

- [Kafka Streams serde errors](https://www.conduktor.io/kafka-streams/serdes): the deserialization failures that feed a DLQ
- [Processor API & punctuators](https://www.conduktor.io/kafka-streams/processor-api): the low-level hooks the manual DLQ pattern uses
- [Deduplication in Kafka Streams](https://www.conduktor.io/kafka-streams/deduplication): handling the duplicate records a reprocessing loop can create
