Chaos Engineering for Kafka: Test Recovery Before You Need It

Nicole Bouchard March 4, 2026 8 min read

Chaos engineering for Kafka is a structured way to answer one question: will our DR plan actually work when we need it? Not in theory, not on paper, but in practice with real traffic patterns and real failure modes.

In the complete strategy post, we made the case that a disaster recovery plan that hasn't been tested isn't a plan, it's a hypothesis. We covered how to prepare across six technical areas and how to execute a failover in minutes, but the step between preparation and execution - validation - is the one most teams skip entirely.

Think back to the PagerDuty outage: nine hours of downtime, 95% of events rejected, and the incident management platform that couldn't manage its own incident. The secondary infrastructure existed so it wasn't a replication failure, it was a validation failure. The issue was the plan had never been proven under pressure.

This post covers the fundamentals of chaos testing applied to Kafka DR validation, from first principles to your first experiment.

Series context: This builds on the three-phase framework from The Complete Strategy Beyond Replication. Start there if you haven't read it.

Why untested DR plans fail

The DR strategy guide called out testing and continuous validation as "the most commonly skipped step" in DR readiness. There are specific reasons why skipping it is so dangerous for Kafka infrastructure.

Configuration drift is silent. Topic configs, ACLs, schemas, and quotas drift between primary and secondary clusters over weeks and months. We called this out directly: configuration drift between clusters is one of the most common silent DR failures. Without testing, you discover the drift during the real failover alongside everything else that's going wrong.

Runbooks decay. A runbook written by someone who left six months ago is a liability, not an asset. Chaos testing is how you prove the runbook still matches reality. It either works or it doesn't, and you'd rather find out during a controlled experiment than at 3 AM.

Assumptions compound. Teams assume producer retry logic works under real broker failure conditions. They assume consumer group rebalancing completes within the RTO window. They assume monitoring will fire when the infrastructure it monitors is the thing that just failed. Each untested assumption multiplies recovery time.

The RTO gap is invisible until you measure it. As we put it in the strategy post: "If your RTO target is 15 minutes but your drill takes 90, you don't have a disagreement, you have a gap." Chaos testing is how you measure that gap before a real incident does it for you.

From the DR Checklist: "Pre-stage and verify DR configurations in both regions. Topic configs, ACLs, schemas, certificates, quotas should all be in place and verified before you need it." Chaos testing is how you verify.

The experiment cycle (and why Kafka makes it harder)

Chaos engineering is deliberately injecting controlled failures to discover how your system actually behaves, not how you think it behaves. It's not random destruction, it's hypothesis-driven experimentation with a repeatable four-step cycle.

Hypothesize. State what you expect to happen: "If 20% of produce requests fail, our producer retries successfully and no messages are lost."
Inject. Introduce the failure condition in a controlled, scoped way.
Observe. Watch what actually happens: consumer lag, error rates, rebalance timing, message counts.
Learn. Compare observed behavior to the hypothesis. Document the delta and fix what broke.

Kafka makes this harder than stateless services. A web service either returns a 200 or a 500. A Kafka consumer might silently fall behind, double-process messages, or get stuck in an infinite rebalance loop, and none of those produce an immediate, obvious error. Stateful partitions, consumer group coordination, offset management, and exactly-once semantics create failure modes that simply don't exist in request/response systems.

Traditional chaos tools don't fit well. Killing broker pods or injecting network partitions requires infrastructure access, risks actual data loss, and is hard to scope narrowly. Protocol-aware proxies solve this by sitting between Kafka clients and brokers, intercepting the Kafka protocol and injecting failures at the application layer while brokers remain untouched.

This is the approach Conduktor Gateway takes. Its chaos interceptors simulate broker errors, latency, leader elections, and message corruption without SSH access, Kubernetes privileges, or risk to production data. Experiments can be scoped to specific virtual clusters, topics, or consumer groups, which means you can target a single workload without affecting everything else.

Five failure scenarios every DR plan should test

These five scenarios use Gateway's chaos interceptors to test the technical areas from our DR framework. Each one tells you what to inject, what to watch, and what it proves about your DR readiness.

1. Simulate broker unavailability to validate client switching

Inject UNKNOWN_SERVER_ERROR and CORRUPT_MESSAGE responses via the SimulateBrokenBrokersPlugin interceptor at 10-25% of requests, targeting both produce and fetch paths. Watch producer retry rates, consumer error handling, and whether clients reconnect cleanly or get stuck. This proves whether your client topology can handle the transition period during a real failover. As we covered in the strategy post, "there is no mechanism built into Kafka to redirect running clients during failover."

2. Trigger leader election storms to validate observability

Inject LEADER_NOT_AVAILABLE, NOT_LEADER_OR_FOLLOWER, and BROKER_NOT_AVAILABLE errors via the SimulateLeaderElectionsErrorsPlugin at 30%. Watch consumer rebalance timing, partition assignment churn, and whether your alerting can distinguish a brief election (normal, no action needed) from a sustained leadership crisis (requires intervention).

3. Inject latency spikes to validate capacity planning

Inject 500ms-2s delays on produce and fetch requests via SimulateSlowBrokerPlugin at 50%. Watch request.timeout.ms and session.timeout.ms behavior, backpressure propagation, and whether consumers get kicked from their groups. This tests whether your timeout configurations match your RTO expectations. The DR strategy guide flagged the "two-minute default" for delivery.timeout.ms. If your detection-plus-decision window is longer than that, producers will silently discard data.

4. Force schema registry failures to validate schema parity

Inject invalid schema IDs via SimulateInvalidSchemaIdPlugin on fetch requests for a specific topic. Watch consumer behavior when schema lookups fail: does the consumer gracefully degrade? Fall back to a cached schema? Crash? This tests whether your schema registry replication and failover actually work. Our six-area framework flagged schema parity as a critical DR dimension, and an unreachable registry will break every Avro/Protobuf consumer even if the brokers are fine. For a complementary test, use FetchSimulateMessageCorruptionPlugin to inject random bytes and verify that consumers handle raw deserialization errors without crashing.

5. Duplicate messages to validate application resilience

Inject duplicate message delivery via the DuplicateMessagesPlugin interceptor to test idempotency handling. Watch downstream processing: are payments double-charged? Are events counted twice? Are deduplication mechanisms actually working? This tests whether your exactly-once semantics hold under the conditions that actually occur during failover: producers retrying after connection resets, consumers re-fetching uncommitted batches.

Walking through your first experiment

That's the full list, but you don't need to run all five at once. Let's walk through scenario #1 end to end so you can see what a single experiment looks like in practice.

Start with broker unavailability at 10% on a non-critical topic in staging. It's the most common real-world failure mode, the easiest to reason about, and the results are immediately useful.

Define the hypothesis: "If 10% of produce requests to the staging-orders topic return UNKNOWN_SERVER_ERROR, our producer will retry successfully and no messages will be lost."

Configure the interceptor:

pluginClass: io.conduktor.gateway.interceptor.chaos.SimulateBrokenBrokersPlugin
config:
  rateInPercent: 10
  errorMap:
    PRODUCE: UNKNOWN_SERVER_ERROR
    FETCH: UNKNOWN_SERVER_ERROR

Scope this to a specific virtual cluster or topic to limit the blast radius.

Establish baselines: Record current producer success rate, consumer lag, and end-to-end latency. Set abort criteria before starting: if consumer lag exceeds 5 minutes or producer error rate exceeds 50%, disable the interceptor immediately.

Run for 15-30 minutes. Monitor producer metrics (retries, failures, record-send-rate), consumer metrics (lag, rebalance count), and application logs.

Document everything: What you tested, what you expected, what actually happened, and what you'll change. Common findings from first experiments include:

Retry storms from overly aggressive configurations
Timeout values that don't match actual recovery times
Error handling that logs but never alerts
Consumer groups that churn through rebalances instead of stabilizing

Then graduate. Increase the failure rate. Move to a more critical topic. Try it in a pre-production environment that mirrors production configuration. Eventually, run it as a game day with the full on-call team.

Time investment: About 2 hours for setup, execution, and documentation. Compare that to the hours you'll spend during an unvalidated failover.

Once you've completed this first experiment, you have the pattern. The remaining four scenarios follow the same cycle: hypothesize, inject, observe, learn. Each one targets a different failure mode, but the process stays the same.

The plan is only as good as the last test

A DR plan that was tested six months ago is a stale hypothesis. Your system has changed since then with new topics, new consumers, updated configurations, and different team members on call. Chaos engineering makes validation continuous rather than one-off. Plan quarterly experiments for mission-critical workloads, automated tests on infrastructure changes, and game days that exercise the full decision chain.

Testing recovery tells you whether your plan works, but the deeper value of chaos testing is what it reveals about your system that monitoring never will. In the next post, we'll look at what Kafka chaos tests actually teach you: the hidden dependencies, timing assumptions, and recovery bottlenecks that only surface under controlled failure.

Read the full whitepaper: Kafka Disaster Recovery Beyond Replication for the full DR framework, the chaos testing methodology behind these five scenarios, and a step-by-step failover runbook.

Download the Disaster Recovery Readiness Checklist and bring it to your next design review or disaster recovery audit.

Explore Gateway's chaos testing interceptors to see the full list of failure modes you can test without touching your brokers.

Book a Disaster Recovery Workshop. 45 minutes to review your Kafka estate and design a disaster recovery plan.

This is part of a series on Kafka Disaster Recovery.

Previously: The Complete Strategy Beyond Replication