Houston, We Have 7 CLI Tools and Zero Answers

You've accepted that debugging matters more than dashboards. Then you open a terminal and realize Kafka's native tools were never built for this.

Ron Kapoor · February 6, 2026 ·

Houston, We Have 7 CLI Tools and Zero Answers

In the last post, we made the case that most Kafka teams are stuck in a loop: alert fires, dashboard confirms the symptom, nobody can find the cause, someone restarts the consumer. The fix starts with debugging capability: looking at actual data, not just metrics about it.

So you accept that. You're ready to debug. You open a terminal.

And that's where the second problem starts.

What debugging actually looks like

Monday morning. Consumer lag alert fired. You know the topic, you know the consumer group. Time to figure out what's wrong.

Find the consumer group.

kafka-consumer-groups.sh --bootstrap-server broker:9092 --list

order-processing-v2
order-processing-v2-dlq
payment-events-consumer
payment-events-consumer-old
inventory-sync
inventory-sync-BACKUP-DO-NOT-DELETE
analytics-pipeline
analytics-pipeline-test
...

Forty-seven consumer groups. No descriptions, no labels, no indication of which service owns which group. Hope your naming conventions are good, or that you remember the exact group ID for the service that's paging you.

Describe the group.

kafka-consumer-groups.sh --bootstrap-server broker:9092 \
  --describe --group order-processing-v2

GROUP                  TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
order-processing-v2    orders          0          4892710         4892712         2
order-processing-v2    orders          1          5765813         5765813         0
order-processing-v2    orders          2          5102449         5102449         0
order-processing-v2    orders          3          4998301         5003782         5481
order-processing-v2    orders          4          5234567         5234567         0
order-processing-v2    orders          5          4871092         4871092         0
order-processing-v2    orders          6          5345678         5345678         0
order-processing-v2    orders          7          5012345         5012345         0

Partition 3 has lag of 5,481. Raw numbers. No timestamps. Is this lag growing or recovering? When did it start? Was there a spike or has it been climbing for hours? The output can't tell you. This is a snapshot with no history.

You also can't tell which consumer instance owns partition 3 from this view. You'd need to add --members --verbose to get a different table with member assignments, then cross-reference it yourself against the lag output. Two queries, two different output formats, and you're the join engine.

Look at the stuck messages.

kafka-console-consumer.sh --bootstrap-server broker:9092 \
  --topic orders --partition 3 --offset 4998301 \
  --max-messages 3

If you're lucky and messages are JSON:

{"orderId":"abc-123","status":"PENDING","amount":42.50,"ts":"2026-02-06T01:47:00Z"}

If your team uses Avro (and most production teams do):

??orders?abc-123PENDING?L?????

Binary garbage. You need a different tool entirely. kafka-avro-console-consumer lives in the Confluent distribution, not the Apache one, and it needs its own configuration:

kafka-avro-console-consumer --bootstrap-server broker:9092 \
  --topic orders --partition 3 --offset 4998301 \
  --property schema.registry.url=http://schema-registry:8081 \
  --property print.key=true \
  --property key.deserializer=org.apache.kafka.common.serialization.StringDeserializer \
  --max-messages 3

Got Protobuf instead of Avro? Different tool again: kafka-protobuf-console-consumer. Different flags. Same friction.

And once you can finally read the messages, you still can't filter them. You can't say "show me messages where status=FAILED" or "show me messages with this key from the last hour." The console consumer dumps everything sequentially. Your options are piping to grep and hoping the data is on one line, or writing a custom consumer. During an active incident.

Check if the schema changed.

There's no CLI for this in the Kafka distribution. You're hitting the Schema Registry REST API:

# Get latest version
curl -s http://schema-registry:8081/subjects/orders-value/versions/latest | jq .

# Get previous version
curl -s http://schema-registry:8081/subjects/orders-value/versions/3 | jq .

Now diff the two. Mentally. Or pipe both to files and run diff. You can't check if the new version is actually compatible with existing consumers. The compatibility endpoint exists, but nobody remembers the syntax off the top of their head:

curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" \
  --data '{"schema": "{...escaped json...}"}' \
  http://schema-registry:8081/compatibility/subjects/orders-value/versions/latest

All this during an incident where Slack is blowing up and someone just asked "ETA?" for the third time.

Check if rebalancing is the issue.

kafka-consumer-groups.sh --bootstrap-server broker:9092 \
  --describe --group order-processing-v2 --state

GROUP                  COORDINATOR (ID)   ASSIGNMENT-STRATEGY   STATE
order-processing-v2    broker-2 (2)       range                 Stable

State says Stable. But was it rebalancing ten minutes ago when the lag started? No way to know. There's no rebalance history in the CLI. You'd need to grep broker logs, if you have access to the broker host. If you're on a managed service like MSK or Confluent Cloud, you probably don't.

Check the connector.

The orders topic is populated by a JDBC source connector. Maybe the problem isn't the consumer. Maybe it's the source. Let's check:

curl -s http://connect:8083/connectors/orders-jdbc-source/status | jq .

{
  "name": "orders-jdbc-source",
  "connector": { "state": "RUNNING", "worker_id": "connect-1:8083" },
  "tasks": [
    { "id": 0, "state": "RUNNING", "worker_id": "connect-1:8083" },
    { "id": 1, "state": "FAILED", "worker_id": "connect-2:8083", "trace": "org.apache.kafka.connect.errors.ConnectException: java.sql.SQLException..." },
    { "id": 2, "state": "RUNNING", "worker_id": "connect-1:8083" }
  ]
}

Task 1 is FAILED. The connector status shows RUNNING because the connector process itself is up. It's only one of its tasks that died. If you'd checked just the connector state without expanding tasks, you'd have missed it entirely. And to restart just that one task:

curl -X POST http://connect:8083/connectors/orders-jdbc-source/tasks/1/restart

All of this is REST API calls. No dedicated CLI. No way to validate a connector config before deploying it. You push and see what breaks.

Check topic configuration.

kafka-configs.sh --describe --topic orders \
  --bootstrap-server broker:9092

Dynamic configs for topic orders:
  retention.ms=604800000
  cleanup.policy=delete
  segment.bytes=1073741824

Is retention.ms the reason messages are disappearing? You need to know when those messages were produced, which takes you back to the consumer command, where you'd need --property print.timestamp=true (a flag you might not know exists), and then you'd need to manually compare message timestamps against retention.

That's seven tools, seven separate contexts, to investigate a single consumer lag alert.

And you still might not have the answer.

The mismatch

These tools were designed when Kafka was a LinkedIn internal project. The assumptions were reasonable for that era:

2011 Assumption	2026 Reality
One team runs Kafka	Dozens of teams produce and consume
Operators debug Kafka	Application developers debug Kafka
You know the cluster intimately	You barely know which broker is which
Debugging = cluster operations	Debugging = understanding application data
Plain text messages	Avro, Protobuf, JSON Schema with evolution
Few topics, few consumer groups	Hundreds of topics, complex dependency graphs

The CLI tools were built for cluster operators who already knew the answer and needed to confirm it. They weren't built for a developer who's trying to find the answer across a system they interact with through an abstraction layer.

So you're not debugging. You're doing archaeology. Digging through layers of raw output, piecing together what happened from disconnected fragments, hoping you know which commands to run.

The one person who knows

Every team running Kafka has that one engineer. The person who knows that --property print.timestamp=true exists. Who remembers that kafka-avro-console-consumer is a separate binary from a separate distribution. Who can tell you the Schema Registry API uses /subjects/{name}/versions/ and not /schemas/. Who knows that a connector showing RUNNING can still have failed tasks.

When that person is around, incidents get resolved. When they're not (vacation, sick day, different timezone, already left the company), the team stalls.

The tooling makes this inevitable:

kafka-consumer-groups.sh has nearly 30 flags. The ones you need during an incident (--describe, --state, --members, --verbose) aren't obvious from the help text.
kafka-console-consumer.sh has separate --property flags for printing keys, timestamps, headers, and partition info. None are on by default. The format changes depending on which properties you enable.
Schema operations require knowing the REST API of a separate service, with different URL patterns for subjects vs. schemas vs. compatibility checks vs. config.
Connect operations require a different REST API on a different port, with its own conventions for status, config, tasks, and restarts.
Connecting the dots requires running all of these in sequence and mentally correlating consumer lag to specific messages to schema versions to connector state.

Kafka's core concepts (topics, partitions, offsets, consumer groups) are well-designed and well-documented. Understanding Kafka isn't the hard part. The hard part is that debugging requires memorizing a disconnected set of CLI tools, REST APIs, and output formats that were never designed to work together.

Most developers just escalate to the platform team for issues they could investigate themselves, if the tools didn't require an afternoon of Stack Overflow to figure out. On-call becomes a lottery where resolution speed depends on who's holding the pager, not what happened. And when that one engineer moves on, months of accumulated debugging intuition walk out the door. No runbook replaces it because the runbook would be "run these 12 commands in sequence, mentally track state across them, and interpret the output based on experience."

What you just can't do

Everything above is painful but possible. Then there's the stuff the native tools just don't support.

There is no --where status=FAILED flag on the console consumer. If you want to find messages matching a condition, you write a consumer application with custom logic, deploy it, run it, and parse its output. For a one-time debugging query.

The CLI gives you lag right now. Not lag five minutes ago, not lag trending over the last hour. If you want history, you need to have already set up a Prometheus exporter, a metrics pipeline, and a dashboard before the incident started.

When the fix is "reset the consumer group to re-process from an hour ago," the process is:

kafka-consumer-groups.sh --bootstrap-server broker:9092 \
  --group order-processing-v2 --topic orders \
  --reset-offsets --to-datetime 2026-02-06T01:00:00.000 \
  --dry-run

The --dry-run flag shows you the target offsets but not what you're about to skip or reprocess. How many messages? Which partitions are affected? You won't know until after you --execute it. On production. With fingers crossed.

You can register a new schema version. If it's incompatible with existing consumers, those consumers start failing. The Schema Registry has a compatibility check endpoint, but it's a POST request with escaped JSON that nobody types correctly on the first try. So most teams skip it and find out the hard way.

The connector config tells you its intended source, but not which topics it has actually touched. For sink connectors consuming from regex topic patterns, the actual topic list lives nowhere accessible.

Every one of these is a place where an investigation stalls because the tools just won't show you what you need.

It's not Kafka. It's the tooling.

Kafka itself is well-designed. It went from internal infrastructure project to the backbone of event-driven architecture across thousands of companies, and the protocol and ecosystem have kept up. The debugging experience hasn't.

The out-of-the-box tools still assume a world where one team manages a handful of topics and debugging means checking if the broker process is healthy. That world ended years ago.

Hard debugging makes incidents longer, which makes teams build more defensive alerts, which creates more noise, which buries the real issues. And when the real issue hits, the team is back in the CLI, trying to remember whether it's --offset earliest or --from-beginning.

(It's --from-beginning. Unless you're using kafka-avro-console-consumer, where it's --from-beginning too, but you also need --property schema.registry.url or it defaults to localhost:8081 and throws an unhelpful TimeoutException when nothing's there. Obviously.)

The cycle doesn't break by adding more monitoring. It breaks when the questions from the first article (what's in this topic, which consumer is stuck, what changed) have answers that any developer can find in under a minute. Without memorizing incantations, without seven CLI tools, without grepping broker logs they may not even have access to.

This is part of a series on Kafka debugging. Previously: Why Every Kafka Incident Ends with "Restart It".

The Conduktor Community Slack is full of engineers swapping war stories about CLI debugging sessions and the workarounds they've cobbled together. Come vent.

Next up: What debugging Kafka should actually look like, and why the answer isn't more dashboards or better CLI skills.