PagerDuty's Kafka Outage: Lessons from 4.2 Million Rogue Producers

A code pattern created 4.2M Kafka producers per hour, crashed the cluster, and silenced alerts for 9+ hours. Dissecting the postmortem.

Stéphane DerosiauxStéphane Derosiaux · June 28, 2024 ·
PagerDuty's Kafka Outage: Lessons from 4.2 Million Rogue Producers

On August 28, 2025, PagerDuty's Kafka cluster collapsed under 4.2 million producers created per hour. The alerting platform that tells you when things break couldn't tell anyone it was broken. For 9+ hours, 95% of events were rejected at peak.

The official postmortem reveals a single code pattern that caused cascading failure.

We've seen this anti-pattern repeatedly. Producer-per-request is the most expensive mistake in Kafka client usage.

Platform Engineer at a financial services firm

The Bug: One Producer Per Request

A new auditing feature introduced code that created a fresh KafkaProducer for every API request instead of reusing a shared instance:

// WRONG: Creates a new producer per request
val settings = ProducerSettings(config, new StringSerializer, new StringSerializer)
Source(records).runWith(Producer.plainSink(settings))

The fix:

// CORRECT: Create once, reuse everywhere
val kafkaProducer = producerSettings.createKafkaProducer()
val settingsWithProducer = producerSettings.withProducer(kafkaProducer)
Source(records).runWith(Producer.plainSink(settingsWithProducer))

The absence of an explicit new keyword made this "a blind spot visually" during code review.

The Math: 84x Normal Load

Each KafkaProducer instance:

ResourcePer Producer
Buffer memory32 MB default
Network threadsThread pool
Broker connections1+ per broker
Metadata requestsPeriodic refresh
At 4.2M producers/hour, brokers tracked 4.2M producer IDs per hour in memory.

The Cascade

  1. JVM heap pressure → GC thrashing
  2. Heap exhaustion → single broker crash
  3. Partition reassignment → load shifts
  4. Other brokers exhaust → cluster collapse

The feature had been incrementally rolled out (1% → 5% → 25% → 75%). Both incidents correlated with rollout percentage increases.

Lesson: Gradual rollouts protect against immediate failures. They don't protect against resource exhaustion that scales with traffic percentage.

What They Didn't Monitor

  • Producer count anomalies
  • Producer creation rate
  • JVM heap trends tied to producer metrics

A 84x spike in producers went undetected until brokers crashed. Proactive alerting on connection counts and resource metrics catches these anomalies before they cascade.

Key Takeaways

Never create producers per request:

// ANTI-PATTERN
public void sendMessage(String message) {
    try (KafkaProducer producer = new KafkaProducer<>(props)) {
        producer.send(new ProducerRecord<>(topic, message));
    }
}

KafkaProducer is thread-safe. One instance handles concurrent requests efficiently. Creating producers is expensive; sending messages is cheap.

Monitor producer counts:

MetricPurpose
kafka.server:type=socket-server-metrics,name=connection-countActive connections
jvm_memory_bytes_used{area="heap"}Heap consumption
Set anomaly detection. A 84x spike should trigger immediately.

Know your client library's lifecycle management. Wrappers that hide producer creation can hide expensive bugs.

Book a demo to see how Conduktor Console provides real-time producer metrics and anomaly detection.