Kafka Offset Management: How Consumer Groups Track Progress

Stéphane Derosiaux November 1, 2025 3 min read

Every "lost message" incident I've debugged traced back to offset management. Auto-commit combined with crashes. Offset reset misconfiguration. Offset expiration during maintenance windows.

Offsets are simple in concept—Kafka stores your position so you can resume where you left off. The failure modes are subtle.

We lost 10,000 orders during a deploy. Auto-commit was enabled, and our consumer crashed between commit and processing. Switching to manual commit with idempotent handlers eliminated these incidents.
Platform Engineer at an e-commerce company

Auto-Commit: The Silent Data Loss

enable.auto.commit=true
auto.commit.interval.ms=5000

With auto-commit, Kafka commits offsets during poll() at the configured interval. The failure scenario:

Consumer calls poll(), receives messages 100-199
Auto-commit fires, commits offset 200
Consumer crashes while processing message 150
Consumer restarts, fetches from offset 200
Messages 150-199 are lost

Auto-commit provides at-most-once delivery. If you can't tolerate message loss, disable it.

Manual Commit for At-Least-Once

props.put("enable.auto.commit", "false");

while (true) {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
    for (ConsumerRecord<String, String> record : records) {
        processOrder(record);  // Process first
    }
    consumer.commitSync();  // Commit after success
}

Tradeoff: At-least-once means duplicates on crashes. Your processing logic must be idempotent—processing the same order twice shouldn't charge the customer twice.

CommitFailedException: You're Too Slow

CommitFailedException: the group has already rebalanced and assigned
the partitions to another member.

Your consumer took too long between poll() calls. Kafka assumed it was dead.

Fix	Setting
More time between polls	`max.poll.interval.ms=600000`
Fewer messages per poll	`max.poll.records=100`
Commit more frequently	Commit every N records

Tradeoff: Higher max.poll.interval.ms means Kafka takes longer to detect actually dead consumers.

Offset Reset: earliest vs latest

When a consumer group starts fresh or its committed offset is gone (data deleted by retention), Kafka needs a starting point:

auto.offset.reset=earliest  # Start from oldest available message
auto.offset.reset=latest    # Skip to newest message
auto.offset.reset=none      # Throw exception instead

The dangerous scenario: Consumer group runs for months. You shut it down for 8-day maintenance. Topic retention is 7 days. All known messages are deleted.

With earliest: Reprocesses from oldest available (duplicates)
With latest: Skips everything produced during 8 days (data loss)

There's no universally correct choice.

Offset Expiration: The Silent Killer

Kafka doesn't keep offsets forever. Default: 7 days after a consumer group goes inactive.

# Broker config (default: 10080 = 7 days)
offsets.retention.minutes=10080

Consumer group commits offsets
You deploy a new version that stops consuming
More than 7 days pass
Kafka deletes the offsets silently
Consumer restarts from auto.offset.reset

No error logged. Consumer starts "successfully" but from the wrong position.

Fix: Increase to 30 days: offsets.retention.minutes=43200

Manual Offset Reset

Stop all consumers first, then:

# Preview changes
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --group order-processor --topic orders \
  --reset-offsets --to-earliest --dry-run

# Execute reset
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --group order-processor --topic orders \
  --reset-offsets --to-datetime 2026-02-01T00:00:00.000 --execute

Monitoring Offset Lag

Metric	Warning	Critical
`records-lag-max`	> 10,000	> 100,000
`records-lag-avg`	Growing trend	Growing 5+ min
`commit-latency-avg`	> 100ms	> 500ms

Consumer lag is your primary health indicator. Monitor it continuously, alert on sustained growth. Consumer group monitoring gives you real-time visibility into offset positions and lag across all partitions.

Offset management is deceptively simple. The defaults favor convenience over safety—which is wrong for most production workloads.

Book a demo to see how Conduktor Console provides real-time offset monitoring with alerts for lag thresholds and inactive consumer groups.