Kafka Latency: Measuring p50, p99, p99.9 in Production

"Our Kafka latency is 5ms" tells you nothing.
An average latency stat tells you nothing about whether your users are happy. The 1% of requests that take 100x longer are the ones filing tickets, not the median.
Is that p50, p99, p999? Producer-to-broker, or end-to-end including app processing? "Average: 5ms" can hide a p99 of 500ms and a p999 of 3 seconds. Allegro saw exactly that: single-digit ms median, 1s at p99, 3s at p999. Migrating to XFS cut produce requests >65ms by 82%. The average barely moved.
The mean is meaningless for understanding tail latency. Benchmarking Kafka correctly means looking at the 95th and 99th percentiles, not the average.
Where latency actually lives
Kafka latency happens in stages. Knowing which stage is slow tells you where to tune.
Producer latency is the time from send() to acknowledgment. It covers serialization, the network hop to the broker, the broker's log append, replication to ISR replicas (if acks=all), and the ack coming back. A tier-1 bank held sub-5ms p99 end-to-end at 1.6M messages/sec with sub-5KB messages by tuning every one of those stages: Protobuf serialization, 10GbE network, broker settings, minimal replication delay.
Broker latency is the time from receiving a produce request to sending the ack. Dominated by disk write and ISR replication. SSDs cut write latency; more ISR replicas extend it.
Consumer latency is the time from broker commit to consumer processing complete. Dominated by what the application does with the message. Database writes, external API calls, or expensive compute swamp the Kafka fetch itself.
End-to-end latency is the sum. If producer-to-broker is 50ms but consumer processing is 500ms, no amount of producer tuning helps. Fix the slow stage.
What percentiles actually tell you
| Percentile | What it represents | What it hides | Use it for |
|---|---|---|---|
| p50 (median) | Typical request | The 50% that are slower | Baseline reporting |
| p95 | The 5% slowest | Tail outliers | SLO targets |
| p99 | The 1% slowest, where GC pauses, disk hiccups, and network retries show up | Catastrophic outliers | Real user experience |
| p999 | The 0.1% slowest, worst-case stalls | Almost nothing, this IS the worst case | Capacity planning, SLA design |
The 99th percentile captures the rarest and largest outliers: moments when brokers, disks, or networks hiccup and create unpredictable latency. When benchmarking, collect p50, p95, p99, and p99.9 together. Anything less is theater.
Low-latency vs high-throughput: pick one
You can't optimize both at once. Same dials, opposite directions.
| Setting | Low-latency | High-throughput | Why |
|---|---|---|---|
linger.ms | 0 | 100 | Wait time before sending a batch |
acks | 1 | all | Leader-only vs full ISR replication |
compression.type | none | snappy or zstd | CPU cycles vs network bytes |
batch.size | 16KB | 1MB | Per-request payload size |
| Expected p99 | 5–20ms | 50–200ms | At equal load |
| Tradeoff | Lower throughput, weaker durability | Higher latency, batching delay |
- Fraud detection or trading: low-latency. A 50ms delay misses the auction.
- Clickstream or log aggregation: high-throughput. 100ms per event is invisible; the number you care about is events/sec.
If you genuinely need both, the answer isn't a magic setting. It's horizontal scale: more partitions, more consumers, smaller load per instance.
Minimum viable low-latency config
Producer (Java/Kotlin properties):
# Send immediately, don't wait for batches
linger.ms=0
batch.size=16384
# Skip ISR replication wait: 2-3x lower latency than acks=all.
# Tradeoff: if the leader dies before replication, those messages are gone.
acks=1
# Skip compression CPU overhead
compression.type=none
# Fail fast: don't sit on retries that add tail latency
delivery.timeout.ms=5000
request.timeout.ms=2000 Broker (server.properties):
# More replication threads = faster ISR ack when using acks=all
num.replica.fetchers=4
# Tolerate temporary replica lag without triggering rebalance
replica.lag.time.max.ms=30000 Use NVMe SSDs. Broker p99 is dominated by disk write latency. AWS Graviton R8g cut p99 produce latency by ~20%, 23ms off the tail, just from the storage profile.
Consumer (Java properties):
# Don't wait for large batches to fill
fetch.min.bytes=1
fetch.max.wait.ms=100 Where p99 latency actually comes from
| Culprit | Why it spikes p99 | Fix | Watch in monitoring |
|---|---|---|---|
| JVM GC pauses | Broker pauses all processing during collection; produce requests queue | G1GC with sized heap, Java 11+ (lower pause times than older JVMs), alert on GC >100ms | jvm_gc_pause_seconds |
| Disk I/O stalls | Write rate exceeds disk capacity | NVMe SSDs (HDDs are too slow for production Kafka), monitor write latency | kafka.log.LogFlushStats |
| Network congestion | 10Gbps link near saturation, requests queue for bandwidth | Upgrade 1Gbps → 10Gbps NICs, compression, alert at 70% utilisation | node_network_transmit_bytes |
| Consumer rebalances | Partitions stop processing during reassignment; lag accumulates and shows up as latency when processing resumes | Stabilize pods, raise session.timeout.ms, fix crashloops | kafka_consumer_rebalance_rate |
| Under-replicated partitions | Leader waits for slow ISR when acks=all | Investigate broker resources, scale | UnderReplicatedPartitions (should be 0) |
Monitor distributions, not averages
Alert on percentiles, not means. A dashboard showing "average latency: 10ms" hides every tail problem you have.
A useful per-topic snapshot:
| Percentile | Value | What it tells you |
|---|---|---|
| p50 | 5ms | Typical case, looks healthy |
| p95 | 20ms | Slightly slower, still fine |
| p99 | 150ms | 1% of users wait 30x longer than typical |
| p999 | 2000ms | 0.1% wait 400x longer, this is what generates support tickets |
Latency heatmaps (HdrHistogram, similar tools) show the distribution over time. They reveal periodic patterns single metrics hide: p99 spikes every 5 minutes correlates with cron, compaction, or replication catch-up. Find the period, find the cause.
Measuring under realistic load
OpenMessaging Benchmark (OMB) is the standard open-source tool for reproducible Kafka workloads and high-fidelity latency histograms.
Don't measure on an idle cluster. Measure under realistic traffic, message sizes, and consumer counts. A method that works:
- Warm up for 5 minutes (fills page cache, stabilizes GC).
- Run at target throughput for 10+ minutes (e.g. 100k msg/sec).
- Collect p50, p95, p99, p999 from both producer and consumer.
- Change one config (batch size, compression, acks), repeat.
- Compare percentiles. Decisions belong on p99 deltas, not p50.
What to actually measure
Three things, in order:
- Percentile alerting on p50/p95/p99/p999, producer-side AND broker-side. Drop the average dashboards.
- Find the dominant stage. Producer ack, broker write/replication, or consumer fetch+process: the fix lives in whichever one is slow. Tuning the others is wasted effort.
- Benchmark under realistic load with OMB. 5min warmup, 10min run, at your real target throughput.
The most common pattern I've seen: an "average: 10ms" dashboard, a happy SRE, and a support backlog full of "the app feels slow." Once you start looking at p999, the support tickets stop being surprising.
If you can answer "what's our p99?" but not "what's our p999?", you're reporting averages dressed up as percentiles. Real tail latency lives in the 0.1%.
Related: Consumer Group Rebalancing · Kafka Performance · Conduktor Console