Kafka Cluster Management: Beyond SSH

Managing 10+ Kafka clusters via SSH and scripts wastes senior engineering time. Centralized tooling gives visibility and control at scale.

Stéphane DerosiauxStéphane Derosiaux · September 19, 2025 ·
Kafka Cluster Management: Beyond SSH

SSH commands don't scale past three clusters.

Managing one Kafka cluster means: SSH to brokers, run kafka-topics.sh to check status, manually adjust configurations, and remember which commands work on which cluster. Managing ten clusters means: repeating these operations ten times, keeping mental context about which cluster is which, and hoping you don't accidentally run production commands in staging.

This operational model wastes senior engineering time. Tasks that should take minutes (check consumer lag across all clusters, verify replication factor consistency) take hours of SSH jumping between brokers and manual aggregation of results.

Real cluster management provides unified control across all clusters: see health scores at a glance, drill down to specific clusters for details, apply configuration changes consistently, and detect drift automatically. A unified control plane makes this possible. Platform teams manage infrastructure, not spend time SSHing between brokers.

Why Multiple Clusters

Organizations run multiple Kafka clusters for isolation, compliance, and resilience.

Environment separation (dev, staging, prod) isolates testing from production. Developers experiment in dev without risk of breaking production. Staging mirrors production for realistic testing. Production runs customer-facing workloads with strict SLAs.

Single-cluster model (shared dev/prod) creates risk: dev experiment consumes broker resources, degrading production performance. Or dev misconfiguration breaks ACLs, accidentally granting production access.

Geographic distribution places clusters near users for low latency. US cluster serves US customers, EU cluster serves EU customers, Asia cluster serves Asian customers. Latency drops from 200ms cross-region to 20ms in-region.

Cross-region replication provides disaster recovery: US cluster fails, fail over to EU cluster. Services continue with acceptable latency until US cluster recovers.

Compliance and data residency require keeping data within jurisdiction. GDPR mandates EU customer data stays in EU. HIPAA requires US healthcare data in US. Separate clusters per region satisfies data sovereignty requirements.

Separation of concerns isolates use cases. High-throughput batch processing uses dedicated cluster (prevents impacting real-time workloads). Multi-tenant SaaS uses separate clusters per large customer (isolation, security, billing separation).

Result: enterprise organizations run 10-50+ Kafka clusters. Managing them without centralized tooling doesn't scale.

The Multi-Cluster Management Challenge

Managing multiple clusters manually creates operational burden.

Configuration drift happens when clusters start identical but diverge over time. Someone changes retention policy in production but forgets staging. Another engineer enables TLS in staging but production still uses PLAINTEXT.

Drift causes deployment failures: code works in staging (no TLS) but fails in production (requires TLS). Or performance differs between environments because configurations don't match.

Operational repetition wastes time. Checking consumer lag means: SSH to cluster-1, run command, note results. SSH to cluster-2, run command, note results. Repeat for 10 clusters. What should take 30 seconds takes 30 minutes.

Applying changes is worse: update topic configuration across 10 clusters requires 10 manual operations, 10 verification steps, and high error risk (typos, wrong cluster, forgotten clusters).

Lack of visibility means you can't answer "are all clusters healthy?" without checking each individually. No aggregate view, no cross-cluster correlation, no early warning that issues are spreading across infrastructure.

Knowledge concentration happens when only senior engineers know cluster topologies, configurations, and quirks. "SSH to broker-3 in prod-us-west and run X" works when you know which IP is broker-3. New engineers are blocked waiting for seniors to execute routine operations.

Centralized Management Benefits

Unified control plane provides single interface for all clusters.

Aggregate health dashboard shows: all clusters, their status (healthy, degraded, critical), key metrics (total throughput, under-replicated partitions, consumer lag).

At-a-glance view answers: "Are all clusters healthy?" If aggregate status is green, move on. If one cluster is red, drill down for details.

Cross-cluster search finds resources: "Which clusters have topic named customer-events?" Search once, see results across all clusters. No SSHing to each cluster running grep commands.

Consistent operations apply changes uniformly: enable TLS across all production clusters, update retention policy for analytics topics everywhere, or rotate certificates cluster-wide.

Single operation replaces 10 manual operations. Lower error rate (consistency), faster execution (parallel), and audit trail (who changed what across which clusters).

Configuration drift detection compares clusters: "Show me configuration differences between prod-us-west and prod-eu-west." Drift is surfaced automatically instead of discovered during incidents.

Remediation: "Make prod-eu-west match prod-us-west" applies differences in single operation.

Cluster Provisioning and Lifecycle Management

Clusters have lifecycles: provisioning, operation, upgrade, decommission.

Provisioning automation creates clusters from templates. Instead of manually configuring brokers, ZooKeeper/KRaft, security settings, monitoring, specify cluster requirements (region, size, security level) and automation provisions infrastructure.

Templates ensure consistency: all production clusters use same security settings, same monitoring configuration, same operational standards. No cluster starts as unique snowflake requiring special operational knowledge.

Version management tracks Kafka versions across clusters. Dashboard shows: 5 clusters on version 3.6, 3 clusters on 3.5, 2 clusters on 3.4 (EOL warning).

This enables coordinated upgrades: plan upgrade cycle targeting EOL clusters first, test in dev, roll out to staging, then production.

Capacity planning predicts when clusters need scaling. Current disk usage, growth rate (GB/day), and retention policies predict: "cluster-prod-us-west will reach 80% disk capacity in 45 days."

Early warning allows proactive scaling before hitting capacity limits. Reactive scaling (adding capacity after hitting limits) causes incidents.

Decommissioning removes clusters safely. Before deletion, verify: zero active consumers, zero active producers, topics replicated elsewhere if needed. Automated checks prevent accidental deletion of active clusters.

Multi-Cluster Monitoring and Alerting

Health monitoring needs aggregate view and per-cluster drill-down.

Aggregate metrics (via alerting) show: total throughput across all clusters, worst p99 latency (any cluster), total under-replicated partitions, consumer lag exceeding SLA across any cluster.

These answer: "Is Kafka infrastructure healthy overall?" Yes/no at glance, not checking 10 dashboards.

Per-cluster drill-down shows: which cluster has issues, which brokers within cluster, which topics/partitions affected. Navigate from aggregate problem (something is wrong) to specific issue (broker-5 in cluster-prod-us-east is saturated).

Cross-cluster correlation detects widespread issues. If all clusters experience elevated latency simultaneously, root cause is shared infrastructure (network, cloud provider) not individual cluster issues.

Single-cluster monitoring wouldn't reveal correlation. Centralized monitoring surfaces it immediately.

Alerting hierarchy routes alerts appropriately: critical alerts (under-replicated partitions across multiple clusters) page immediately, high-severity alerts (single cluster degradation) alert during business hours, low-severity (resource utilization trends) log for analysis.

Alerts include cluster context: "Under-replicated partitions in prod-us-west cluster" vs. "Under-replicated partitions" (which cluster?).

Configuration Management

Consistent configuration across clusters prevents drift and ensures reliable deployments.

Configuration as code stores cluster configurations in Git (using Terraform or CLI): broker settings, topic configurations, security settings, monitoring configs. Changes are pull requests, reviews are code reviews, deployment is merge.

Benefits: version control (track changes over time), audit trails (who changed what), rollback capability (revert commit), and validation (CI checks before merge).

Configuration drift detection compares actual cluster state against declared state in Git. If production cluster has settings not in Git (manual changes), drift is flagged for correction.

Automation brings actual state to match declared state (GitOps reconciliation loop).

Environment-specific overrides allow customization while maintaining consistency. Dev clusters use lower replication factors (cost savings), production uses higher (availability). Base config is shared, overrides apply per environment.

This balances consistency (same operational standards) with flexibility (environment-appropriate settings).

Security and Access Control

Multi-cluster security requires consistent policy enforcement.

Unified RBAC assigns permissions across clusters: Developer role grants topic creation in dev/staging clusters but read-only access in production. Permissions apply consistently—no per-cluster permission management.

Certificate management rotates TLS certificates cluster-wide. Certificates expiring in 30 days trigger renewal across all clusters. Centralized tracking prevents outages from forgotten certificate expirations.

Audit log aggregation consolidates security events: authentication failures, authorization denials, admin operations across all clusters. Security teams search once instead of checking logs per cluster.

Cross-cluster correlation reveals: same attacker probing multiple clusters, credential theft spreading across infrastructure, or policy violations happening organization-wide.

Measuring Management Efficiency

Track operational efficiency through: time spent on routine operations, number of clusters per platform engineer, incident response time.

Time on routine operations measures engineering hours spent on: checking status, applying configurations, investigating issues. Target: under 20% of platform team time on routine operations.

If engineers spend 60% of time SSHing between clusters, automation hasn't scaled. If under 10%, automation is effective and engineers focus on strategic work.

Clusters per platform engineer measures scaling efficiency. How many clusters can one engineer effectively manage?

Without automation: 2-3 clusters per engineer (full-time firefighting and manual operations). With automation: 10-20 clusters per engineer (routine operations are self-service or automated).

Incident response time measures MTTR (mean time to resolution). Centralized monitoring and management reduces MTTR by surfacing root causes faster and enabling quick remediation across affected clusters.

The Path Forward

Kafka cluster management scales through centralized control planes (single interface for all clusters), configuration management (drift detection and remediation), automated provisioning (consistent cluster creation), and unified monitoring (aggregate health with drill-down).

Conduktor provides multi-cluster unified management, configuration drift detection, centralized monitoring, and GitOps integration. Platform teams manage 20+ clusters without SSH jumping between brokers.

If your cluster management strategy is SSH and tribal knowledge, the problem isn't Kafka—it's operational tooling that hasn't scaled with your infrastructure.


Related: Kafka Control Plane → · Multi-Cloud Management → · Kafka DR →