Kafka Offset Management: How Consumer Groups Track Progress

Understand Kafka offset commits, auto-commit pitfalls, CommitFailedException, and offset reset behavior. Stop losing messages.

Stéphane DerosiauxStéphane Derosiaux · November 1, 2025 ·
Kafka Offset Management: How Consumer Groups Track Progress

Every "lost message" incident I've debugged traced back to offset management. Auto-commit combined with crashes. Offset reset misconfiguration. Offset expiration during maintenance windows.

Offsets are simple in concept—Kafka stores your position so you can resume where you left off. The failure modes are subtle.

We lost 10,000 orders during a deploy. Auto-commit was enabled, and our consumer crashed between commit and processing. Switching to manual commit with idempotent handlers eliminated these incidents.

Platform Engineer at an e-commerce company

Auto-Commit: The Silent Data Loss

enable.auto.commit=true
auto.commit.interval.ms=5000

With auto-commit, Kafka commits offsets during poll() at the configured interval. The failure scenario:

  1. Consumer calls poll(), receives messages 100-199
  2. Auto-commit fires, commits offset 200
  3. Consumer crashes while processing message 150
  4. Consumer restarts, fetches from offset 200
  5. Messages 150-199 are lost

Auto-commit provides at-most-once delivery. If you can't tolerate message loss, disable it.

Manual Commit for At-Least-Once

props.put("enable.auto.commit", "false");

while (true) {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
    for (ConsumerRecord<String, String> record : records) {
        processOrder(record);  // Process first
    }
    consumer.commitSync();  // Commit after success
}

Tradeoff: At-least-once means duplicates on crashes. Your processing logic must be idempotent—processing the same order twice shouldn't charge the customer twice.

CommitFailedException: You're Too Slow

CommitFailedException: the group has already rebalanced and assigned
the partitions to another member.

Your consumer took too long between poll() calls. Kafka assumed it was dead.

FixSetting
More time between pollsmax.poll.interval.ms=600000
Fewer messages per pollmax.poll.records=100
Commit more frequentlyCommit every N records
Tradeoff: Higher max.poll.interval.ms means Kafka takes longer to detect actually dead consumers.

Offset Reset: earliest vs latest

When a consumer group starts fresh or its committed offset is gone (data deleted by retention), Kafka needs a starting point:

auto.offset.reset=earliest  # Start from oldest available message
auto.offset.reset=latest    # Skip to newest message
auto.offset.reset=none      # Throw exception instead

The dangerous scenario: Consumer group runs for months. You shut it down for 8-day maintenance. Topic retention is 7 days. All known messages are deleted.

  • With earliest: Reprocesses from oldest available (duplicates)
  • With latest: Skips everything produced during 8 days (data loss)

There's no universally correct choice.

Offset Expiration: The Silent Killer

Kafka doesn't keep offsets forever. Default: 7 days after a consumer group goes inactive.

# Broker config (default: 10080 = 7 days)
offsets.retention.minutes=10080
  1. Consumer group commits offsets
  2. You deploy a new version that stops consuming
  3. More than 7 days pass
  4. Kafka deletes the offsets silently
  5. Consumer restarts from auto.offset.reset

No error logged. Consumer starts "successfully" but from the wrong position.

Fix: Increase to 30 days: offsets.retention.minutes=43200

Manual Offset Reset

Stop all consumers first, then:

# Preview changes
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --group order-processor --topic orders \
  --reset-offsets --to-earliest --dry-run

# Execute reset
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --group order-processor --topic orders \
  --reset-offsets --to-datetime 2026-02-01T00:00:00.000 --execute

Monitoring Offset Lag

MetricWarningCritical
records-lag-max> 10,000> 100,000
records-lag-avgGrowing trendGrowing 5+ min
commit-latency-avg> 100ms> 500ms
Consumer lag is your primary health indicator. Monitor it continuously, alert on sustained growth. Consumer group monitoring gives you real-time visibility into offset positions and lag across all partitions.

Offset management is deceptively simple. The defaults favor convenience over safety—which is wrong for most production workloads.

Book a demo to see how Conduktor Console provides real-time offset monitoring with alerts for lag thresholds and inactive consumer groups.