JVM Tuning for Kafka Brokers: G1GC vs ZGC in Production

Configure G1GC and ZGC for Kafka brokers. Heap sizing, pause time targets, and when to switch collectors in production.

Stéphane DerosiauxStéphane Derosiaux · December 10, 2024 ·
JVM Tuning for Kafka Brokers: G1GC vs ZGC in Production

A 500ms GC pause can trigger consumer rebalances, cause ZooKeeper session timeouts, and create cascading failures across your cluster.

I've debugged GC-related Kafka outages more times than I'd like. The fix is usually straightforward once you understand what's happening.

Our production cluster had random 2-second latency spikes. Turned out to be Full GC pauses. Fixed the heap sizing and haven't had an incident in 8 months.

SRE at a payments company

The Standard G1GC Configuration

export KAFKA_HEAP_OPTS="-Xms6G -Xmx6G"
export KAFKA_JVM_PERFORMANCE_OPTS="-XX:+UseG1GC \
  -XX:MaxGCPauseMillis=20 \
  -XX:InitiatingHeapOccupancyPercent=35 \
  -XX:+ExplicitGCInvokesConcurrent \
  -XX:G1HeapRegionSize=16M"

This is the battle-tested Confluent/LinkedIn configuration. Works for most workloads without tuning.

Heap Sizing: Keep It Small

Kafka brokers don't need large heaps. Data sits in the OS page cache, not the JVM.

WorkloadHeap Size
Development1-2 GB
Standard production6 GB
High partition count (>10k)8 GB
Always set -Xms equal to -Xmx. Prevents heap resizing during operation. Monitor broker health metrics to correlate GC behavior with cluster performance.

Tradeoff: Larger heaps mean longer pauses. A 32GB heap with G1GC can have 100-200ms pauses.

When G1GC Causes Problems

Symptom: ZooKeeper session timeouts (ZK mode)

INFO Session expired; client is trying to reconnect to ZooKeeper

GC pause longer than session timeout. Fix: reduce pauses or increase zookeeper.session.timeout.ms.

For KRaft clusters: Similar issues manifest as controller election problems. Monitor controller.quorum.election.timeout.ms.

Symptom: Consumer rebalances during pauses

Marking the coordinator dead for group my-group

Symptom: Full GC

GC(45) Pause Full (Allocation Failure) 7800M->7500M(8192M) 12500.000ms

A 12-second Full GC will definitely cause broker disconnections. Increase heap or investigate memory usage.

When to Switch to ZGC

ZGC promises sub-millisecond pauses regardless of heap size. Netflix switched to Generational ZGC in 2024.

# Java 21+
export KAFKA_HEAP_OPTS="-Xms12G -Xmx12G"
export KAFKA_JVM_PERFORMANCE_OPTS="-XX:+UseZGC -XX:+ZGenerational"
FactorG1GCZGC
Heap size< 16GB> 16GB
Pause target< 50ms< 10ms
CPU overheadLowerHigher (~5-10%)
Memory overheadLower~20% more needed
Java version8+15+ (21+ for Generational ZGC)
Choose G1GC: Heap under 16GB, Java 8-16, CPU constrained.

Choose ZGC: Heap over 16GB, need sub-10ms pauses, Java 17+.

Monitoring GC Health

# Parse GC log for pause times
grep -oP 'Pause Young.*?\K[\d.]+(?=ms)' /var/log/kafka/kafkaServer-gc.log | \
  awk '{sum+=$1; count++; if($1>max)max=$1} END {print "avg:",sum/count,"ms, max:",max,"ms"}'
MetricWarningCritical
GC pause P99> 50ms> 200ms
GC frequency> 10/min> 30/min
Heap after GC> 70%> 85%
Look for:
  • No "Pause Full" entries (bad)
  • No "Humongous Allocation" warnings
  • Pause times under your target

Quick Fixes

Humongous objects: Large Kafka messages trigger inefficient allocation. Increase G1HeapRegionSize=32M.

Concurrent mode failure: Marking didn't finish before heap filled. Lower InitiatingHeapOccupancyPercent=25.

GC tuning is iterative. Start with recommended settings, monitor under load, adjust based on behavior. Premature optimization often makes things worse.

Book a demo to see how Conduktor Console surfaces GC health alongside Kafka metrics.