PagerDuty's Kafka Outage: Lessons from 4.2 Million Rogue Producers
A code pattern created 4.2M Kafka producers per hour, crashed the cluster, and silenced alerts for 9+ hours. Dissecting the postmortem.

On August 28, 2025, PagerDuty's Kafka cluster collapsed under 4.2 million producers created per hour. The alerting platform that tells you when things break couldn't tell anyone it was broken. For 9+ hours, 95% of events were rejected at peak.
The official postmortem reveals a single code pattern that caused cascading failure.
We've seen this anti-pattern repeatedly. Producer-per-request is the most expensive mistake in Kafka client usage.
Platform Engineer at a financial services firm
The Bug: One Producer Per Request
A new auditing feature introduced code that created a fresh KafkaProducer for every API request instead of reusing a shared instance:
// WRONG: Creates a new producer per request
val settings = ProducerSettings(config, new StringSerializer, new StringSerializer)
Source(records).runWith(Producer.plainSink(settings)) The fix:
// CORRECT: Create once, reuse everywhere
val kafkaProducer = producerSettings.createKafkaProducer()
val settingsWithProducer = producerSettings.withProducer(kafkaProducer)
Source(records).runWith(Producer.plainSink(settingsWithProducer)) The absence of an explicit new keyword made this "a blind spot visually" during code review.
The Math: 84x Normal Load
Each KafkaProducer instance:
| Resource | Per Producer |
|---|---|
| Buffer memory | 32 MB default |
| Network threads | Thread pool |
| Broker connections | 1+ per broker |
| Metadata requests | Periodic refresh |
The Cascade
- JVM heap pressure → GC thrashing
- Heap exhaustion → single broker crash
- Partition reassignment → load shifts
- Other brokers exhaust → cluster collapse
The feature had been incrementally rolled out (1% → 5% → 25% → 75%). Both incidents correlated with rollout percentage increases.
Lesson: Gradual rollouts protect against immediate failures. They don't protect against resource exhaustion that scales with traffic percentage.
What They Didn't Monitor
- Producer count anomalies
- Producer creation rate
- JVM heap trends tied to producer metrics
A 84x spike in producers went undetected until brokers crashed. Proactive alerting on connection counts and resource metrics catches these anomalies before they cascade.
Key Takeaways
Never create producers per request:
// ANTI-PATTERN
public void sendMessage(String message) {
try (KafkaProducer producer = new KafkaProducer<>(props)) {
producer.send(new ProducerRecord<>(topic, message));
}
} KafkaProducer is thread-safe. One instance handles concurrent requests efficiently. Creating producers is expensive; sending messages is cheap.
Monitor producer counts:
| Metric | Purpose |
|---|---|
kafka.server:type=socket-server-metrics,name=connection-count | Active connections |
jvm_memory_bytes_used{area="heap"} | Heap consumption |
Know your client library's lifecycle management. Wrappers that hide producer creation can hide expensive bugs.
Book a demo to see how Conduktor Console provides real-time producer metrics and anomaly detection.