Kafka Streams dead letter queue
Build a Kafka Streams dead letter queue: the three error handlers, the poison-pill loop, the manual DLQ pattern, and native KIP-1033/1034 support.
Stop poison pills from killing your stream.
A single malformed record can take down a Kafka Streams application. Not degrade it: stop it. The default behavior on a record Streams can't deserialize is to throw, kill the StreamThread, and refuse to make progress until you intervene. Worse, because the bad record stays at the same offset, a naive restart reads it again and dies again: a poison-pill loop.
Kafka Streams has three separate error-handling hooks, and each one fires at a different stage of the read-process-write cycle. Knowing which handler catches what (and where the historical gap was) is the difference between a stream that quarantines bad data and one that pages you at 3am. This page covers all three, the manual dead-letter-queue (DLQ) pattern that fills the gap, and the native support that recently landed.
What you'll learn:
- The three error handlers and exactly where each one fires
- Why a poison pill causes a reprocessing loop, not a clean skip
- The manual DLQ pattern: route the bad record and its error metadata to a
.DLTtopic - The native processing-exception handler and DLQ (KIP-1033 / KIP-1034), and the version to check
Three handlers, three stages
A Streams record flows through three stages where things break: deserializing the input, running your processing logic, and producing the output. There is a distinct handler for each.
| Stage | Handler | Fires on | Default behavior |
|---|---|---|---|
| Read (deserialize) | DeserializationExceptionHandler | A record the configured serde can't deserialize (a poison pill) | LogAndFailExceptionHandler: log, then kill the thread |
| Process (your logic) | ProcessingExceptionHandler (KIP-1033, Kafka 3.9+) | An exception thrown inside map, filter, a processor, etc. | Historically none: the exception propagated up and killed the thread. Since 3.9: LogAndFailProcessingExceptionHandler |
| Write (produce) | ProductionExceptionHandler | The producer fails to write a record (e.g. RecordTooLargeException) | DefaultProductionExceptionHandler: fail the thread |
RecordTooLargeException on output is not caught by the deserialization handler. A NullPointerException in your mapValues was, until recently, not caught by any DSL handler. Match the handler to the stage. Read-side: the deserialization handler
This is the one most people mean by "skip bad records." Kafka ships two implementations:
# Kill the thread on a record the serde can't decode (the DEFAULT)
deserialization.exception.handler=org.apache.kafka.streams.errors.LogAndFailExceptionHandler
# Log the record and keep going, survive the poison pill
deserialization.exception.handler=org.apache.kafka.streams.errors.LogAndContinueExceptionHandler On Kafka 3.x the key is default.deserialization.exception.handler; 4.0 deprecated the default. prefix (KIP-1056). The old key still works on 4.x, but Kafka's own error messages now point you at the new one.
LogAndContinueExceptionHandler is the quick fix that keeps the app alive. But read the next section before you reach for it: "continue" silently drops the record, and a dropped record is gone unless you also route it somewhere.
š« "A bad record just gets skipped, Streams handles it."
By default it does not skip. The default deserialization handler is LogAndFail, which stops the thread. Switching to LogAndContinue makes it skip, but skip means discard with a log line, with no copy of the record kept. If that record mattered, you've lost it and you'll find out downstream.
The poison-pill reprocessing loop
Here is the failure mode that surprises people running the default handler. A poison pill sits at, say, offset 4711. The StreamThread tries to deserialize it, throws, and dies, before committing the offset. Streams either replaces the thread or the app restarts, the consumer resumes from the last committed offset (4711), reads the same poison pill, and dies again. The app flaps in a loop, making zero progress, while the offset never advances past the bad record.
This is why LogAndFail plus an automatic restarter is the worst of both worlds: it neither stops cleanly nor recovers. You either skip the record (LogAndContinue, ideally routing it to a DLQ) or you manually advance the offset past it. There is no version where the loop resolves itself.
The uncaught-exception handler, and its gotcha
Separate from the per-stage handlers, there is an application-level safety net for any exception that nothing else caught:
streams.setUncaughtExceptionHandler(exception -> {
// Choose what happens after an uncaught exception kills a StreamThread
return StreamsUncaughtExceptionHandler.StreamThreadExceptionResponse.REPLACE_THREAD;
// or .SHUTDOWN_CLIENT: shut down this instance
// or .SHUTDOWN_APPLICATION: shut down every instance with the same application.id
}); REPLACE_THREAD: spin up a freshStreamThreadand carry on. Fine for transient faults; a loop if the cause is a poison pill, because the new thread hits the same record.SHUTDOWN_CLIENT: stop this instance only.SHUTDOWN_APPLICATION: stop the whole application across all instances.
The gotcha: this handler runs once the StreamThread has hit a fatal error and stopped processing. It is invoked on the dying thread as it shuts down, before any replacement is spawned. It is a notification-and-decision hook, not a try/catch around your code. It cannot resume the failed record or repair in-flight state; processing on that thread is over and the decision is only "what next." Don't mistake it for in-line error handling inside your topology.
The historical gap: no handler for business logic
Notice what the three stages didn't cover for most of Kafka Streams' life: an exception thrown by your own processing logic. A NumberFormatException parsing a field, a null dereference, a downstream call that throws: none of these were deserialization or production errors. There was no DSL-level hook for them. The exception propagated up, killed the thread, and you were back in restart-loop territory.
The workaround the community settled on is the manual DLQ pattern: wrap your risky logic in a try/catch inside the processor, and on failure, route the offending record (plus enough metadata to debug it) to a dedicated dead-letter topic, conventionally named .
The manual DLQ pattern
The idea: never let a business-logic exception escape the processor. Catch it, emit the original record to a side topic with error context in the headers, and keep the main stream flowing. A branch/split keeps it readable in the DSL:
// Parse risky input; on failure, tag the record instead of throwing
KStream<String, String> parsed = builder.stream("orders",
Consumed.with(Serdes.String(), Serdes.String()));
Map<String, KStream<String, String>> branches = parsed
.mapValues(DeadLetterExample::tryParseOrTag) // returns OK:<json> or ERR:<message>::<raw>
.split(Named.as("route-"))
.branch((k, v) -> v.startsWith("ERR:"), Branched.as("dead"))
.defaultBranch(Branched.as("ok"));
// Good records continue down the topology
branches.get("route-ok")
.mapValues(v -> v.substring(3)) // strip the OK: tag
.to("orders-validated", Produced.with(Serdes.String(), Serdes.String()));
// Bad records: enrich with error headers, send to the dead-letter topic
branches.get("route-dead")
.process(() -> new Processor<String, String, String, String>() {
private ProcessorContext<String, String> ctx;
public void init(ProcessorContext<String, String> context) { this.ctx = context; }
public void process(Record<String, String> rec) {
String[] parts = rec.value().substring(4).split("::", 2); // message :: raw
Record<String, String> dead = new Record<>(rec.key(), parts.length > 1 ? parts[1] : "", rec.timestamp());
dead.headers()
.add("dlq-error", parts[0].getBytes(StandardCharsets.UTF_8))
.add("dlq-source-topic", "orders".getBytes(StandardCharsets.UTF_8))
.add("dlq-ts", Long.toString(System.currentTimeMillis()).getBytes(StandardCharsets.UTF_8));
ctx.forward(dead);
}
})
.to("orders.DLT", Produced.with(Serdes.String(), Serdes.String())); The headers are the point. A bare copy of the bad value in a .DLT topic tells you nothing six weeks later. Record the exception message, the source topic and partition/offset if you have them, and a timestamp, so the dead-letter topic is something you can triage and replay, not just a graveyard.
For deserialization failures specifically (which happen before your code sees the record), the same routing belongs in a custom DeserializationExceptionHandler: it receives the original ConsumerRecord with the raw bytes, so it can ship them to the DLQ before telling Streams to keep going. Up to Kafka 4.1 that means producing inside handle(...) and returning CONTINUE; since 4.2 the method is handleError(...) returning Response.resume(), and the Response itself can carry dead-letter records for Streams to produce for you (KIP-1034). The handle(...) variants are deprecated in 4.3 but still functional.
Retry tier vs dead-letter tier
Not every failure is a poison pill. A transient error (a downstream timeout, a momentary dependency blip) usually succeeds on a retry; a malformed record never will. Routing both to the same place sends recoverable records to manual triage and dilutes the DLT's signal.
A common refinement splits them into two topics:
for transient failures. A small daemon consumes it, waits a backoff (1s, 2s, 4s, 8sā¦), and republishes to the original topic, carrying an attempt count in a header and giving up to the DLT after N attempts. Most failures self-heal with nobody looking..retry for terminal/poison records. Quarantined for human inspection and explicit reinjection, never auto-retried..DLT
This retry-daemon-plus-DLT shape is exactly the one that shows up in multi-agent AI systems on Kafka, where it matters even more: a non-deterministic worker (an LLM) can fail a record once and succeed on replay, so bounded automatic retries recover far more than they would with deterministic code.
Native support: KIP-1033 and KIP-1034
The manual pattern was load-bearing for years because the framework had no built-in answer for processing errors. That changed recently.
Version-gate this. Native processing-error handling arrived through KIP-1033 (a
ProcessingExceptionHandlerfor exceptions thrown in user processing code), shipped in Kafka 3.9; native dead-letter-queue support arrived through KIP-1034, shipped in Kafka 4.2. Confirm your exact Kafka and Kafka Streams version before depending on either. On older versions the manual pattern above is still the way.
Where it's available, KIP-1033 gives the missing third handler: a ProcessingExceptionHandler invoked when your own logic throws, with the same FAIL / CONTINUE contract as the deserialization handler, so a business-logic exception no longer has to mean a dead thread. KIP-1034 builds on the handler contracts so a failed record can be routed to a dead-letter topic through configuration (errors.dead.letter.queue.topic.name, unset by default) and the handler return value, rather than the hand-rolled branch-and-process plumbing. The manual pattern still works and still gives you the most control over headers and routing; the native path removes boilerplate once you're on a version that has it.
Putting it together
A robust Streams app usually layers these:
- Read: a
DeserializationExceptionHandlerthat routes poison pills to a DLQ, not bareLogAndContinue(so nothing is silently lost). - Process:
try/catchin your processors routing to a.DLTtopic with error headers, or the KIP-1033ProcessingExceptionHandlerif your version has it. - Write: a
ProductionExceptionHandlerfor produce-side failures you've decided are non-fatal. - Backstop: a
StreamsUncaughtExceptionHandlerso a genuinely unexpected exception triggers a deliberateREPLACE_THREADor shutdown, not an accidental flap.
And whatever you skip, route it somewhere. A skipped record with no DLQ is a data-loss bug wearing a "handled" label.
How do I handle a bad record or poison pill in Kafka Streams?
The default deserialization handler is LogAndFailExceptionHandler, which kills the thread on an undecodable record. Switch to LogAndContinue to survive it, but pair it with routing to a dead-letter topic: "continue" silently discards the record with only a log line, so anything that mattered is lost.
What is the difference between LogAndContinue and LogAndFail in Kafka Streams?
LogAndFailExceptionHandler (the default) logs the bad record and stops the StreamThread; LogAndContinueExceptionHandler logs it and skips it so the app keeps running. Skipping discards the record, so route it to a DLQ if you can't afford to lose it.
How do I implement a dead letter queue in Kafka Streams?
Wrap risky logic in a try/catch inside a processor and, on failure, forward the offending record to a dedicated topic conventionally named , attaching the exception message, source topic/partition/offset, and a timestamp as headers. The headers are the point: a bare copy of the value tells you nothing weeks later when you try to triage or replay it.
What does the StreamsUncaughtExceptionHandler do and what are REPLACE_THREAD versus SHUTDOWN?
It's an application-level safety net, invoked on the dying StreamThread once a fatal error has stopped processing, that decides what happens next: REPLACE_THREAD starts a fresh thread, SHUTDOWN_CLIENT stops the instance, and SHUTDOWN_APPLICATION stops every instance sharing the application.id. It is a notification-and-decision hook, not a try/catch around your code, and REPLACE_THREAD becomes a loop if the cause is a poison pill.
What is the KIP-1033 processing exception handler in Kafka Streams?
KIP-1033 added a ProcessingExceptionHandler for exceptions thrown in your own processing code, with the same FAIL/CONTINUE contract as the deserialization handler, and shipped in Kafka 3.9; native DLQ support followed in KIP-1034, shipped in Kafka 4.2. Confirm your exact version before depending on either. On older versions the manual branch-and-process DLQ pattern is still the way.
See it in practice with Conduktor
A dead-letter topic is only useful if you can read it. Conduktor Console lets you browse your
.DLTtopics, read the error headers you attached to each bad record, and watch whether the count is growing, so quarantined records get triaged and replayed instead of piling up unseen. It also surfaces the consumer group lag that tells you a poison-pill loop is stalling the app in the first place.
Next steps
- Kafka Streams serde errors: the deserialization failures that feed a DLQ
- Processor API & punctuators: the low-level hooks the manual DLQ pattern uses
- Deduplication in Kafka Streams: handling the duplicate records a reprocessing loop can create