Chaos Engineering for Kafka

Chaos engineering for Kafka is the practice of deliberately injecting failures (broker outages, latency spikes, message corruption) into streaming systems to verify resilience before production incidents occur. Proxy-based approaches enable this testing without infrastructure access or risk to actual data.

Streaming platforms fail in ways that batch systems never experience. A Kafka broker goes down during peak traffic, consumers lag behind as partitions rebalance, or corrupted messages propagate through downstream services before anyone notices. Chaos engineering for Kafka means proactively injecting these failures to discover weaknesses before they cause outages.

The challenge with traditional chaos engineering approaches: killing brokers or introducing network partitions requires infrastructure access and risks impacting production data. Protocol-aware proxies solve this by intercepting Kafka traffic and injecting failures at the client-visible layer, without touching actual brokers.

Why Kafka Needs Chaos Testing

Kafka Failure Points for Chaos Testing

Kafka's distributed architecture creates failure scenarios that don't exist in simpler systems:

Broker failures during rebalancing: When a broker goes down, Kafka elects new partition leaders and consumers rebalance. Applications must handle temporary NOT_LEADER_OR_FOLLOWER errors and connection resets gracefully.
Consumer lag cascades: A slow consumer triggers rebalancing, which causes other consumers to pause, creating a lag spiral that can take minutes to recover.
Exactly-once semantics under failure: Idempotent producers and transactions behave correctly during normal operation, but what happens when brokers return stale INVALID_PRODUCER_EPOCH errors?
Schema evolution failures: A consumer receives a message with an unknown schema ID. Does it crash, skip the record, or enter an infinite retry loop?

Testing these scenarios in production risks data loss or service degradation. Testing them in staging often misses configuration differences. Chaos engineering bridges this gap by injecting failures in a controlled way.

Common Kafka Failure Scenarios to Test

Effective chaos engineering targets specific failure modes rather than random disruption:

Broker unavailability: Simulate UNKNOWN_SERVER_ERROR or CORRUPT_MESSAGE responses to test producer retry logic and consumer error handling
Leader election delays: Inject LEADER_NOT_AVAILABLE, NOT_LEADER_OR_FOLLOWER, and BROKER_NOT_AVAILABLE errors to validate client behavior during partition leadership changes
Latency spikes: Add 500ms-2s delays to produce and fetch requests to test timeout configurations and backpressure handling
Message corruption: Append random bytes to message payloads to verify consumer deserialization error handling
Duplicate messages: Return the same message multiple times to test idempotency in downstream processing
Invalid schema IDs: Overwrite schema registry IDs with invalid values to test deserialization failure paths

Each scenario maps to a real-world failure. Broker unavailability mirrors hardware failures or network partitions. Latency spikes replicate cross-AZ network degradation. Duplicate messages simulate what happens when consumers crash before committing offsets.

Proxy-Based Chaos Testing with Conduktor Gateway

Traditional chaos tools require infrastructure access to kill processes or inject network faults. Conduktor Gateway takes a different approach: it sits between Kafka clients and brokers, intercepting the Kafka protocol and injecting failures at the application layer.

This architecture enables chaos testing without:

SSH access to broker nodes
Kubernetes privileges to kill pods
Network-level fault injection tools
Risk of corrupting actual topic data

Gateway's chaos interceptors simulate failures that clients experience, not infrastructure-level outages. The brokers remain healthy while clients receive error responses that match real failure scenarios. For complete interceptor documentation, see Chaos Testing in Conduktor Gateway.

Simulating Broken Brokers

Test how producers and consumers handle broker errors:

pluginClass: io.conduktor.gateway.interceptor.chaos.SimulateBrokenBrokersPlugin
config:
  rateInPercent: 25
  errorMap:
    FETCH: UNKNOWN_SERVER_ERROR
    PRODUCE: CORRUPT_MESSAGE

25% of requests return errors. Monitor your application's retry behavior and alert thresholds.

Simulating Latency

Test timeout configurations under network degradation:

pluginClass: io.conduktor.gateway.interceptor.chaos.SimulateSlowBrokerPlugin
config:
  rateInPercent: 50
  minLatencyMs: 200
  maxLatencyMs: 2000

Half of requests experience 200ms-2s latency. Validates request.timeout.ms and session.timeout.ms settings.

Simulating Leader Elections

Test client resilience to partition leadership changes:

pluginClass: io.conduktor.gateway.interceptor.chaos.SimulateLeaderElectionsErrorsPlugin
config:
  rateInPercent: 30

Clients receive LEADER_NOT_AVAILABLE, NOT_LEADER_OR_FOLLOWER, and BROKER_NOT_AVAILABLE errors.

Simulating Message Corruption

Verify consumer deserialization error handling:

pluginClass: io.conduktor.gateway.interceptor.chaos.FetchSimulateMessageCorruptionPlugin
config:
  topic: "orders.*"
  rateInPercent: 10
  sizeInBytes: 50

10% of messages have random bytes appended. Tests whether consumers log errors or crash.

For all chaos interceptors, see Chaos Testing in Conduktor Gateway.

Running Chaos Experiments Safely

Chaos engineering requires discipline to avoid causing the very outages you're trying to prevent:

Start with low failure rates: Begin at 5-10% and increase gradually. A 100% failure rate in production isn't chaos engineering, it's an outage.
Scope experiments narrowly: Apply interceptors to specific virtual clusters, topics, or consumer groups rather than all traffic. Gateway's scope configuration limits blast radius.
Define success metrics before starting: "End-to-end latency stays below 2s" or "Consumer lag recovers within 60 seconds." Without measurable criteria, you can't determine if the experiment succeeded.
Monitor during experiments: Watch broker metrics, consumer lag, producer error rates, and application logs in real-time. Abort if metrics exceed acceptable thresholds.
Document findings: Each experiment should produce actionable results: configuration changes, code fixes, or increased confidence in existing resilience.

A typical experiment workflow:

Define hypothesis: "If 20% of produce requests fail, producer retries will succeed and no messages will be lost"
Configure interceptor with 20% failure rate on a non-critical topic
Run load test while monitoring producer success rate and topic message count
Verify message counts match (no data loss) and producer metrics show retries
Remove interceptor and document results

Summary

Chaos engineering for Kafka validates that streaming applications survive the failures they'll inevitably encounter in production. Broker outages, latency spikes, leader elections, and message corruption all occur in real deployments. The choice is whether to discover resilience gaps during controlled experiments or during 3 AM incidents.

Proxy-based chaos testing with tools like Conduktor Gateway enables failure injection without infrastructure access, making it practical to run experiments regularly as configurations and applications evolve. Start with the failure scenarios that match your operational experience: if broker restarts have caused issues before, test leader election handling first.

The goal isn't to prove systems are perfectly resilient. It's to find the configuration values, retry policies, and error handling paths that need improvement before production traffic reveals them.

Zero Trust Architecture for Kafka: Security framework where chaos testing validates continuous verification mechanisms
Chaos Engineering for Streaming Systems: Broader chaos engineering principles for Kafka, Flink, and other streaming platforms
Disaster Recovery Strategies for Kafka Clusters: Planning for and recovering from cluster-level failures
Exactly-Once Semantics in Kafka: Understanding delivery guarantees under failure conditions

Sources and References

Conduktor Gateway Chaos Testing Documentation - Complete reference for chaos interceptors including configuration options and error types
Apache Kafka Documentation - Design and Reliability Guarantees - Kafka's built-in failure handling and delivery semantics
Principles of Chaos Engineering - Foundational principles from Netflix's chaos engineering practice
Rosenthal, C., & Hochstein, L. (2020). Chaos Engineering: System Resiliency in Practice. O'Reilly Media

Written by Stéphane Derosiaux · Last updated February 18, 2026