Kafka Latency: Measuring p50, p99, p99.9 in Production

Stéphane Derosiaux January 30, 2026 6 min read
Kafka Latency: Measuring p50, p99, p99.9 in Production

"Our Kafka latency is 5ms" tells you nothing.

An average latency stat tells you nothing about whether your users are happy. The 1% of requests that take 100x longer are the ones filing tickets, not the median.

Is that p50, p99, p999? Producer-to-broker, or end-to-end including app processing? "Average: 5ms" can hide a p99 of 500ms and a p999 of 3 seconds. Allegro saw exactly that: single-digit ms median, 1s at p99, 3s at p999. Migrating to XFS cut produce requests >65ms by 82%. The average barely moved.

The mean is meaningless for understanding tail latency. Benchmarking Kafka correctly means looking at the 95th and 99th percentiles, not the average.

Where latency actually lives

Kafka latency happens in stages. Knowing which stage is slow tells you where to tune.

Producer latency is the time from send() to acknowledgment. It covers serialization, the network hop to the broker, the broker's log append, replication to ISR replicas (if acks=all), and the ack coming back. A tier-1 bank held sub-5ms p99 end-to-end at 1.6M messages/sec with sub-5KB messages by tuning every one of those stages: Protobuf serialization, 10GbE network, broker settings, minimal replication delay.

Broker latency is the time from receiving a produce request to sending the ack. Dominated by disk write and ISR replication. SSDs cut write latency; more ISR replicas extend it.

Consumer latency is the time from broker commit to consumer processing complete. Dominated by what the application does with the message. Database writes, external API calls, or expensive compute swamp the Kafka fetch itself.

End-to-end latency is the sum. If producer-to-broker is 50ms but consumer processing is 500ms, no amount of producer tuning helps. Fix the slow stage.

What percentiles actually tell you

PercentileWhat it representsWhat it hidesUse it for
p50 (median)Typical requestThe 50% that are slowerBaseline reporting
p95The 5% slowestTail outliersSLO targets
p99The 1% slowest, where GC pauses, disk hiccups, and network retries show upCatastrophic outliersReal user experience
p999The 0.1% slowest, worst-case stallsAlmost nothing, this IS the worst caseCapacity planning, SLA design
If you serve 1M requests/day, p99 = 10 000 requests, p999 = 1 000 requests. At p99=1s, 10 000 users/day wait a full second. They're the ones who churn.

The 99th percentile captures the rarest and largest outliers: moments when brokers, disks, or networks hiccup and create unpredictable latency. When benchmarking, collect p50, p95, p99, and p99.9 together. Anything less is theater.

Low-latency vs high-throughput: pick one

You can't optimize both at once. Same dials, opposite directions.

SettingLow-latencyHigh-throughputWhy
linger.ms0100Wait time before sending a batch
acks1allLeader-only vs full ISR replication
compression.typenonesnappy or zstdCPU cycles vs network bytes
batch.size16KB1MBPer-request payload size
Expected p995–20ms50–200msAt equal load
TradeoffLower throughput, weaker durabilityHigher latency, batching delay
A few real workloads to anchor this:
  • Fraud detection or trading: low-latency. A 50ms delay misses the auction.
  • Clickstream or log aggregation: high-throughput. 100ms per event is invisible; the number you care about is events/sec.

If you genuinely need both, the answer isn't a magic setting. It's horizontal scale: more partitions, more consumers, smaller load per instance.

Minimum viable low-latency config

Producer (Java/Kotlin properties):

# Send immediately, don't wait for batches
linger.ms=0
batch.size=16384

# Skip ISR replication wait: 2-3x lower latency than acks=all.
# Tradeoff: if the leader dies before replication, those messages are gone.
acks=1

# Skip compression CPU overhead
compression.type=none

# Fail fast: don't sit on retries that add tail latency
delivery.timeout.ms=5000
request.timeout.ms=2000

Broker (server.properties):

# More replication threads = faster ISR ack when using acks=all
num.replica.fetchers=4

# Tolerate temporary replica lag without triggering rebalance
replica.lag.time.max.ms=30000

Use NVMe SSDs. Broker p99 is dominated by disk write latency. AWS Graviton R8g cut p99 produce latency by ~20%, 23ms off the tail, just from the storage profile.

Consumer (Java properties):

# Don't wait for large batches to fill
fetch.min.bytes=1
fetch.max.wait.ms=100

Where p99 latency actually comes from

CulpritWhy it spikes p99FixWatch in monitoring
JVM GC pausesBroker pauses all processing during collection; produce requests queueG1GC with sized heap, Java 11+ (lower pause times than older JVMs), alert on GC >100msjvm_gc_pause_seconds
Disk I/O stallsWrite rate exceeds disk capacityNVMe SSDs (HDDs are too slow for production Kafka), monitor write latencykafka.log.LogFlushStats
Network congestion10Gbps link near saturation, requests queue for bandwidthUpgrade 1Gbps → 10Gbps NICs, compression, alert at 70% utilisationnode_network_transmit_bytes
Consumer rebalancesPartitions stop processing during reassignment; lag accumulates and shows up as latency when processing resumesStabilize pods, raise session.timeout.ms, fix crashloopskafka_consumer_rebalance_rate
Under-replicated partitionsLeader waits for slow ISR when acks=allInvestigate broker resources, scaleUnderReplicatedPartitions (should be 0)
The fix column is mostly known. The hard part is figuring out WHICH row caused last Tuesday's spike. That's correlation work: GC logs, broker metrics, and consumer rebalance events all on the same timeline. Conduktor cluster insights does that overlay; well-configured Grafana dashboards do the same if you wire up the right exporters.

Monitor distributions, not averages

Alert on percentiles, not means. A dashboard showing "average latency: 10ms" hides every tail problem you have.

A useful per-topic snapshot:

PercentileValueWhat it tells you
p505msTypical case, looks healthy
p9520msSlightly slower, still fine
p99150ms1% of users wait 30x longer than typical
p9992000ms0.1% wait 400x longer, this is what generates support tickets
Alert on p99 staying above your threshold (say, 200ms) for 5+ minutes. Average dashboards miss this completely.

Latency heatmaps (HdrHistogram, similar tools) show the distribution over time. They reveal periodic patterns single metrics hide: p99 spikes every 5 minutes correlates with cron, compaction, or replication catch-up. Find the period, find the cause.

Measuring under realistic load

OpenMessaging Benchmark (OMB) is the standard open-source tool for reproducible Kafka workloads and high-fidelity latency histograms.

Don't measure on an idle cluster. Measure under realistic traffic, message sizes, and consumer counts. A method that works:

  1. Warm up for 5 minutes (fills page cache, stabilizes GC).
  2. Run at target throughput for 10+ minutes (e.g. 100k msg/sec).
  3. Collect p50, p95, p99, p999 from both producer and consumer.
  4. Change one config (batch size, compression, acks), repeat.
  5. Compare percentiles. Decisions belong on p99 deltas, not p50.

What to actually measure

Three things, in order:

  1. Percentile alerting on p50/p95/p99/p999, producer-side AND broker-side. Drop the average dashboards.
  2. Find the dominant stage. Producer ack, broker write/replication, or consumer fetch+process: the fix lives in whichever one is slow. Tuning the others is wasted effort.
  3. Benchmark under realistic load with OMB. 5min warmup, 10min run, at your real target throughput.

The most common pattern I've seen: an "average: 10ms" dashboard, a happy SRE, and a support backlog full of "the app feels slow." Once you start looking at p999, the support tickets stop being surprising.

If you can answer "what's our p99?" but not "what's our p999?", you're reporting averages dressed up as percentiles. Real tail latency lives in the 0.1%.


Related: Consumer Group Rebalancing · Kafka Performance · Conduktor Console