Kafka Latency: Why Averages Lie
Kafka latency averages hide tail problems. Optimize p99, not p50, and measure end-to-end from producer commit to consumer processing.

"Our Kafka latency is 5ms" tells you nothing.
Is that p50 (median)? p99 (99th percentile)? p999? Producer latency from send to acknowledgment? Consumer latency from produce to consume? End-to-end latency including application processing? The mean is meaningless for understanding tail latency—benchmarking Kafka correctly requires focusing on P95 and P99 percentiles for real-world performance.
Latency distributions matter more than averages because user experience is dominated by outliers. If p50 latency is 5ms but p99 latency is 500ms, then 1 in 100 requests takes 100x longer than typical. For high-throughput systems processing millions of requests per day, thousands of requests experience this degraded performance. Users notice.
Real-world example: At Allegro, median produce latency was single-digit milliseconds, but p99 latency hit 1 second and p999 latency reached 3 seconds. After filesystem optimization (migrating to XFS), produce requests exceeding 65ms dropped by 82%. The fix didn't change average latency significantly, but it eliminated tail latency spikes that were causing incidents.
Latency Components
Kafka latency has multiple stages. Understanding which stage is slow reveals where to optimize.
Producer latency measures time from send to acknowledgment. This includes:
- Application serialization (converting object to bytes)
- Network transfer (producer to broker)
- Broker write (appending to log)
- Replication (broker to ISR replicas if
acks=all) - Acknowledgment (broker to producer)
Each stage adds latency. A tier-1 bank achieved sub-5ms p99 end-to-end latency for trading pipelines, sustaining this at 1.6 million messages/sec with sub-5KB messages. They optimized every stage: fast serialization (Protobuf), low-latency network (10GbE), tuned broker settings, and minimal replication delay.
Broker latency measures time from receiving a produce request to acknowledgment. This includes:
- Request parsing
- Log append (disk write)
- Replication to ISR replicas (if
acks=all) - Sending acknowledgment
Broker latency is dominated by disk write and replication. SSDs reduce write latency. More ISR replicas increase replication latency.
Consumer latency measures time from broker commit to consumer processing. This includes:
- Fetch request (consumer to broker)
- Network transfer (broker to consumer)
- Deserialization (bytes to object)
- Application processing
Consumer latency is dominated by application processing. If consuming a message involves database writes, external API calls, or complex computation, this swamps Kafka fetch latency.
End-to-end latency measures time from producer send to consumer processing complete. This combines all stages. Optimizing end-to-end latency requires understanding which stage dominates.
If producer-to-broker takes 50ms but consumer processing takes 500ms, optimizing producer settings won't improve end-to-end latency. Fix consumer processing first.
p50 vs p99 vs p999
Percentiles reveal latency distribution. p50 (median) is the latency half of requests experience. p99 is the latency 1% of requests exceed. p999 is the latency 0.1% of requests exceed.
p50 latency shows typical performance. If p50 is 10ms, half of requests complete in under 10ms. This is useful for understanding baseline but hides outliers.
p99 latency shows tail performance. The 99th percentile captures the rarest and largest outliers—moments when brokers, disk, or network hiccups create unpredictable latency. If p99 is 200ms while p50 is 10ms, 1% of requests take 20x longer than typical. For systems processing 100,000 requests/day, 1,000 requests experience this degradation.
p999 latency shows worst-case performance. If p999 is 3 seconds while p50 is 10ms, 0.1% of requests take 300x longer than typical. These extreme outliers are often caused by garbage collection pauses, disk stalls, or network retries.
Why p99 matters more than p50: User experience is defined by outliers. If your application serves 1 million requests/day, 10,000 requests will experience p99 latency. If that's 1 second instead of 10ms, 10,000 users per day experience slow responses. They're the ones who file support tickets and churn.
When benchmarking, it's important to collect metrics including percentile latencies: p50, p95, p99, and p99.9 to understand tail latencies.
Latency vs. Throughput Tradeoffs
Latency and throughput are inversely related. Optimizing one often degrades the other.
Low-latency configuration prioritizes speed: linger.ms=0 (send immediately, don't wait for batches), acks=1 (leader acknowledgment only, skip ISR replication wait), compression.type=none (skip compression overhead). This minimizes latency but reduces throughput because messages are sent individually instead of batched.
High-throughput configuration prioritizes volume: linger.ms=100 (wait up to 100ms to fill batches), acks=all (wait for ISR replication), compression.type=snappy (compress batches to reduce network transfer). This maximizes throughput but increases latency because messages wait for batching and replication.
The choice depends on workload:
- Real-time fraud detection needs low latency. Process transactions in milliseconds to block fraudulent charges before they clear.
- Clickstream analytics needs high throughput. Processing millions of events per second matters more than processing each event in 5ms vs 50ms.
For workloads that need both (low latency AND high throughput), the solution is parallelism: scale horizontally to handle high volume while maintaining low latency per message.
Configuration Tuning for Low Latency
Producer settings that reduce latency:
linger.ms=0: Send messages immediately without waiting for batches. Latency drops to network round-trip time, but throughput decreases because messages aren't batched.acks=1: Leader acknowledgment only, skip waiting for ISR replication. Latency is 2-3x lower thanacks=all, but durability is weaker—if leader fails before replication, messages are lost.compression.type=none: Skip compression overhead. Saves CPU cycles on producer and broker, reducing latency, but increases network transfer size.batch.size=16KB: Small batch size limits how many messages accumulate before sending. Combined withlinger.ms=0, this ensures messages send quickly instead of waiting.
Broker settings that reduce latency:
num.replica.fetchers=4: Increase replication threads. Faster replication reduces time waiting for ISR acknowledgment when usingacks=all.replica.lag.time.max.ms=30000: Increase ISR timeout. Prevents replicas from falling out of ISR due to temporary lag spikes, which would trigger rebalancing and increase latency.- Fast storage (NVMe SSDs): Broker latency is dominated by disk writes. Faster disks directly reduce p99 latency. AWS Graviton R8g instances showed P99 produce latency improved by nearly 20%, cutting almost 23 milliseconds from tail latency.
Consumer settings that reduce latency:
fetch.min.bytes=1: Don't wait for large batches. Consumers fetch data as soon as any data is available, reducing wait time.fetch.max.wait.ms=100: Limit how long consumers wait forfetch.min.bytesto accumulate. This ensures consumers get data within 100ms even if volume is low.
Monitoring and Alerting on Latency
Configure latency alerts to track distributions, not averages. Dashboards showing "average latency: 10ms" hide tail latency problems.
Percentile-based metrics show the full picture:
- p50: 5ms (typical case)
- p95: 20ms (slightly slower)
- p99: 150ms (tail latency)
- p999: 2000ms (outliers)
This reveals that while most requests complete quickly, 1% take 30x longer and 0.1% take 400x longer than typical. Without percentiles, you'd only see "average: 8ms" and miss the problem.
Alert on p99, not average. If p99 latency exceeds 200ms for 5+ minutes, something is degrading. This catches issues that average latency hides: garbage collection pauses, disk stalls, network congestion.
Latency heatmaps visualize distributions over time. Tools like HdrHistogram show how latency changes throughout the day: is p99 latency spiking during peak traffic? During garbage collection? During rebalancing?
Heatmaps reveal patterns that single metrics miss: if p99 latency spikes every 5 minutes, that correlates with something periodic (cron job, compaction, replication lag recovery). Investigate the correlation.
Common Latency Culprits
Garbage collection pauses cause latency spikes on broker JVMs. When GC runs, brokers pause all processing. Produce requests queue up, waiting for GC to finish. When GC completes, queued requests process, but with added latency from waiting.
Solution: Tune JVM settings (G1GC with appropriate heap size), monitor GC logs, alert on long GC pauses (exceeding 100ms). Modern JVMs (Java 11+) have lower GC pause times than older versions.
Disk I/O stalls happen when disks can't keep up with write rate. Kafka appends all messages to disk. If disk write latency spikes (due to saturation, competing workloads, or hardware issues), produce latency spikes correspondingly.
Solution: Use SSDs (HDDs are too slow for production Kafka), monitor disk I/O latency, provision sufficient I/O capacity for peak throughput.
Network congestion increases transfer latency. If broker network saturates (10 Gbps link fully utilized), new requests queue waiting for bandwidth. Latency increases from network queuing, not Kafka processing.
Solution: Upgrade network interfaces (1 Gbps to 10 Gbps), use compression to reduce bytes transferred, monitor network utilization and alert at 70% capacity.
Rebalancing pauses consumer processing. When consumer instances join or leave, partitions reassign. During reassignment, affected partitions stop processing. Consumer lag grows, and when processing resumes, latency includes accumulated lag.
Solution: Stabilize consumer instances (fix crashlooping pods, tune health checks), increase session timeouts to prevent unnecessary rebalancing, monitor rebalancing frequency.
Under-replicated partitions slow replication. If ISR replicas fall behind, the leader waits longer for replication to complete (for acks=all). Produce latency increases because requests wait for slow replicas.
Solution: Monitor under-replicated partitions (should be zero), investigate broker resource usage if replicas lag consistently, scale broker capacity if needed.
Measuring Latency Correctly
OpenMessaging Benchmark (OMB) is an open source test harness built for scalable, reproducible workload generation and latency/throughput observability, capturing high-fidelity latency histograms.
Use benchmarking tools to measure latency under load. Don't measure latency on an idle cluster—measure it under realistic traffic patterns, realistic message sizes, and realistic consumer counts.
Benchmark methodology:
- Warm up the cluster (run traffic for 5 minutes to fill page cache, stabilize GC)
- Run benchmark at target throughput (e.g., 100k msg/sec for 10 minutes)
- Collect p50, p95, p99, p999 latency from producer and consumer
- Repeat with different configs (batch size, compression, acks)
- Compare percentiles to identify which config reduces tail latency
Key: measure percentiles, not averages. Compare configs based on p99 impact, not p50 impact.
The Path Forward
Kafka latency optimization starts with measurement: track percentiles (p50, p99, p999), identify which stage dominates latency (producer, broker, consumer), then tune. Most latency problems are infrastructure issues (slow disks, network congestion, GC pauses) or configuration tradeoffs (batching for throughput vs immediate send for latency).
Conduktor provides latency monitoring with percentile-based alerting, configuration recommendations based on workload patterns, and correlation with cluster events (rebalancing, GC, under-replicated partitions) via cluster insights. Teams diagnose latency issues through visibility into what's slow and when, not guesswork.
If your latency reporting is "average: 10ms" without percentiles, the problem isn't Kafka—it's measurement hiding tail latency.
Related: Consumer Group Rebalancing → · Kafka Performance → · Conduktor Console →