PagerDuty's Kafka Outage: Lessons from 4.2 Million Rogue Producers

A code pattern created 4.2M Kafka producers per hour, crashed the cluster, and silenced alerts for 9+ hours. Dissecting the postmortem.

Stéphane Derosiaux · June 28, 2024 ·

PagerDuty's Kafka Outage: Lessons from 4.2 Million Rogue Producers

On August 28, 2025, PagerDuty's Kafka cluster collapsed under 4.2 million producers created per hour. The alerting platform that tells you when things break couldn't tell anyone it was broken. For 9+ hours, 95% of events were rejected at peak.

The official postmortem reveals a single code pattern that caused cascading failure.

We've seen this anti-pattern repeatedly. Producer-per-request is the most expensive mistake in Kafka client usage.
Platform Engineer at a financial services firm

The Bug: One Producer Per Request

A new auditing feature introduced code that created a fresh KafkaProducer for every API request instead of reusing a shared instance:

// WRONG: Creates a new producer per request
val settings = ProducerSettings(config, new StringSerializer, new StringSerializer)
Source(records).runWith(Producer.plainSink(settings))

The fix:

// CORRECT: Create once, reuse everywhere
val kafkaProducer = producerSettings.createKafkaProducer()
val settingsWithProducer = producerSettings.withProducer(kafkaProducer)
Source(records).runWith(Producer.plainSink(settingsWithProducer))

The absence of an explicit new keyword made this "a blind spot visually" during code review.

The Math: 84x Normal Load

Each KafkaProducer instance:

Resource	Per Producer
Buffer memory	32 MB default
Network threads	Thread pool
Broker connections	1+ per broker
Metadata requests	Periodic refresh

At 4.2M producers/hour, brokers tracked 4.2M producer IDs per hour in memory.

The Cascade

JVM heap pressure → GC thrashing
Heap exhaustion → single broker crash
Partition reassignment → load shifts
Other brokers exhaust → cluster collapse

The feature had been incrementally rolled out (1% → 5% → 25% → 75%). Both incidents correlated with rollout percentage increases.

Lesson: Gradual rollouts protect against immediate failures. They don't protect against resource exhaustion that scales with traffic percentage.

What They Didn't Monitor

Producer count anomalies
Producer creation rate
JVM heap trends tied to producer metrics

A 84x spike in producers went undetected until brokers crashed. Proactive alerting on connection counts and resource metrics catches these anomalies before they cascade.

Key Takeaways

Never create producers per request:

// ANTI-PATTERN
public void sendMessage(String message) {
    try (KafkaProducer producer = new KafkaProducer<>(props)) {
        producer.send(new ProducerRecord<>(topic, message));
    }
}

KafkaProducer is thread-safe. One instance handles concurrent requests efficiently. Creating producers is expensive; sending messages is cheap.

Monitor producer counts:

Metric	Purpose
`kafka.server:type=socket-server-metrics,name=connection-count`	Active connections
`jvm_memory_bytes_used{area="heap"}`	Heap consumption

Set anomaly detection. A 84x spike should trigger immediately.

Know your client library's lifecycle management. Wrappers that hide producer creation can hide expensive bugs.

Book a demo to see how Conduktor Console provides real-time producer metrics and anomaly detection.