Kafka Consumer Lag Monitoring

The cheapest signal you have for a stuck Kafka consumer. Measure it, alert on it, and stop running a Prometheus stack just to read one number.

What Kafka consumer lag actually is

Consumer lag is the number of messages a consumer group still has to read on a partition. For partition p, lag is LogEndOffset(p) − CommittedOffset(group, p). It is reported per partition; the lag of a group is the sum across the partitions it owns.

It's the most direct answer to "is my pipeline keeping up?". A flat non-zero lag is fine — the consumer is steady-state behind. Lag that climbs and never drains means the consumer is backed up. Zero lag is ambiguous: the consumer is either caught up, the topic is idle, or the consumer is dead. You can't tell from lag alone, so watch LogEndOffset rate alongside it.

Conduktor Console showing offset lag and time lag for a consumer group, with consume rate and historical graphs

Measuring lag with the Kafka CLI

Kafka ships kafka-consumer-groups.sh. Run it against any group to see the lag per partition:

bin/kafka-consumer-groups.sh \
  --bootstrap-server localhost:9092 \
  --describe --group payments-processor

# GROUP               TOPIC     PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG    CONSUMER-ID  HOST           CLIENT-ID
# payments-processor  payments  0          15243           15243            0     consumer-1   /10.0.1.4      worker-1
# payments-processor  payments  1          14817           18402         3585     consumer-2   /10.0.1.5      worker-2
# payments-processor  payments  2          15102           15102            0     consumer-3   /10.0.1.6      worker-3

That gives you one point in time. To watch lag, you either script the command on a cron, or you publish JMX metrics from the broker and consumer JVM and scrape them with Prometheus. Both work. Both are work you have to maintain.

What a healthy consumer lag looks like

There is no single threshold. Lag is meaningful relative to the consumer's throughput and the SLA you owe downstream.

Monitoring consumer lag with Conduktor Console

Conduktor Console connects to your Kafka clusters and reads consumer lag through the same admin API that kafka-consumer-groups.sh uses — no agents on your brokers, no consumer-side instrumentation. What you get on top of the raw number:

  1. A live view of lag per group and per partition across every cluster you connect, with the lag trend over time rather than a single snapshot.
  2. Threshold-based alerts routed to Slack, Microsoft Teams, email, or a webhook (the webhook covers PagerDuty, Opsgenie, and anything else with an HTTP endpoint). Set a threshold on offset lag (messages) or on time lag (seconds). Teams own their own group alerts; the platform team owns cluster-level ones.
  3. Ownership context. Consumer groups registered through the application catalog can be mapped to the application and team that runs them, so the alert lands with the person who can fix it — not the platform on-call.

For the broader monitoring picture (broker health, partition replication, connector failures), see Kafka monitoring with Conduktor. For a step-by-step walkthrough of setting an alert, see the alerting documentation.

Conduktor vs Prometheus + Grafana for consumer lag

Both options work. The right choice depends on what you already run, how many teams own consumers, and how much glue code you want to maintain.

ConcernPrometheus + GrafanaConduktor Console
Setupkafka_exporter (or similar) + Prometheus scrape config + Grafana dashboard. Add a JMX exporter on each consumer if you also want client-side records-lagConnect cluster, lag is visible immediately
Lag granularityPer-group, per-partition (with the right exporter)Per-group, per-partition
AlertingPrometheus AlertManager rules in PromQLUI thresholds, per-group, owned by the team
Ownership / per-team alertsDIY with labels and routing treesFirst-class: groups mapped to applications and owners
Multi-clusterPer-cluster scrape, federate or remote-write into Thanos / Cortex / a central PrometheusSingle dashboard across every connected cluster
Works alongside the otherYes — Conduktor exposes its own metrics endpoint to Prometheus
Conduktor does not replace Prometheus for infrastructure-level metrics (node CPU, disk, network). It replaces the part of your Prometheus setup that exists only to scrape Kafka consumer lag and render it in Grafana.

Common consumer lag problems and where to look

For the concept-level primer, see the glossary entry on consumer lag monitoring. For a deeper take on choosing thresholds, see Kafka Consumer Lag Alerting Thresholds.

Frequently asked questions

What is Kafka consumer lag?

Consumer lag is the number of messages a consumer group still has to process on a partition. It equals the partition's log-end offset minus the group's last committed offset. Group-level lag is the sum across the partitions the group owns.

How do I check consumer lag from the command line?

Use kafka-consumer-groups.sh --bootstrap-server --describe --group . The output includes CURRENT-OFFSET, LOG-END-OFFSET, and LAG per partition.

What's a normal consumer lag threshold?

There isn't a universal one. Express the threshold in seconds-of-traffic relative to your consumer's throughput rather than a raw message count, then alert when lag exceeds the latency budget you owe downstream.

How does Conduktor monitor consumer lag?

Conduktor Console reads lag through the Kafka admin API the same way the CLI does, then renders it per group and per partition across every connected cluster. You set thresholds per group on either offset lag or time lag, and route alerts to Slack, Teams, email, or any webhook endpoint (which covers PagerDuty, Opsgenie, and similar).

Do I need to install agents on my brokers or consumers?

No. Conduktor connects to the cluster as a Kafka client and uses the standard admin API. There is nothing to install on the brokers or the consumer applications.

Can Conduktor work alongside Prometheus and Grafana?

Yes. Conduktor exposes its own metrics endpoint that Prometheus can scrape. Teams who already have Grafana dashboards keep them; Conduktor adds the per-team ownership and self-service alerting that raw metrics dashboards don't provide.

I have more questions.

Drop us a line and we'll get back to you.

Monitor Consumer Lag in Minutes

Connect your cluster and see lag per group, per partition, across every environment. Alerts routed to the team that owns the consumer, not the platform on-call.

Try Conduktor Console Book a Demo →