Strimzi on Kubernetes: From Zero to Production Kafka

Deploy production-grade Kafka on Kubernetes with Strimzi. Separate node pools, KRaft mode, and the configuration choices that matter.

Stéphane DerosiauxStéphane Derosiaux · April 19, 2025 ·
Strimzi on Kubernetes: From Zero to Production Kafka

Running Kafka on Kubernetes used to be a bad idea. Stateful workloads and container orchestration didn't mix well. Strimzi changed that.

I've deployed Strimzi clusters across AWS, GCP, and on-prem environments. The operator handles rolling upgrades, certificate rotation, and rack awareness automatically. You declare what you want, and Strimzi makes it happen.

We moved to Strimzi because we wanted Kafka to be as declarative as the rest of our infrastructure. GitOps for Kafka wasn't possible before.

Platform Engineer at a Fortune 500 retailer

Install the Operator

kubectl create namespace kafka
helm repo add strimzi https://strimzi.io/charts/
helm install strimzi strimzi/strimzi-kafka-operator \
  --namespace kafka --set replicas=2

The replicas=2 gives you operator high availability.

The Two Resources That Matter

Strimzi 0.46+ runs Kafka in KRaft mode. No ZooKeeper. Two resources define your cluster:

ResourcePurpose
KafkaCluster-wide config: listeners, security, entity operator
KafkaNodePoolNode groups: replicas, storage, roles, resources
For production, always separate controllers and brokers. Controllers shouldn't compete with broker workloads.

Production Configuration

apiVersion: kafka.strimzi.io/v1
kind: KafkaNodePool
metadata:
  name: controllers
  labels:
    strimzi.io/cluster: prod-cluster
spec:
  replicas: 3
  roles: [controller]
  storage:
    type: jbod
    volumes:
      - id: 0
        type: persistent-claim
        size: 10Gi
        class: fast-ssd
  resources:
    requests: { memory: 2Gi, cpu: "1" }
---
apiVersion: kafka.strimzi.io/v1
kind: KafkaNodePool
metadata:
  name: brokers
  labels:
    strimzi.io/cluster: prod-cluster
spec:
  replicas: 3
  roles: [broker]
  storage:
    type: jbod
    volumes:
      - id: 0
        type: persistent-claim
        size: 500Gi
        class: fast-ssd
  resources:
    requests: { memory: 8Gi, cpu: "2" }
  jvmOptions:
    -Xms: 4096m
    -Xmx: 4096m

JVM heap rule: Set heap to 25-50% of container memory. Kafka relies on OS page cache.

Storage: The Critical Decision

Kafka requires low-latency block storage. NFS and EFS are not recommended due to high latency and potential consistency issues under load.

CloudStorage Class
AWSgp3, io2
GCPpd-ssd
Azuremanaged-premium
Always set deleteClaim: false. You don't want kubectl delete kafka to wipe your data.

External Access

listeners:
  - name: external
    port: 9094
    type: loadbalancer
    tls: true
    authentication:
      type: scram-sha-512

Cost note: A 3-broker cluster creates 4 load balancers (1 bootstrap + 3 per-broker). That's $60/month on AWS. Use NodePort for cost-sensitive environments.

User and Topic Management

The Entity Operator manages topics and users as Kubernetes resources:

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
  name: app-producer
  labels:
    strimzi.io/cluster: prod-cluster
spec:
  authentication:
    type: scram-sha-512
  authorization:
    type: simple
    acls:
      - resource: { type: topic, name: events, patternType: literal }
        operations: [Write, Describe]

Get the password: kubectl get secret app-producer -o jsonpath='{.data.password}' | base64 -d

Monitoring

Key metrics to alert on:

MetricCritical Threshold
kafka_server_replicamanager_underreplicatedpartitions> 0 for 5 minutes
kafka_controller_kafkacontroller_offlinepartitionscount> 0
Under-replicated partitions is the single most important health indicator. Unified cluster visibility across Strimzi, MSK, and Confluent environments makes monitoring multiple clusters practical.

Upgrades

Kafka version upgrades are one YAML change:

spec:
  kafka:
    version: 3.9.0  # Changed from 3.8.0

The operator handles rolling upgrades: brokers first, then follower controllers, then the active controller last. Upgrade Strimzi operator first, then Kafka.

Pod Disruption Budget

Kubernetes cluster upgrades drain nodes, which can evict multiple brokers simultaneously. Add to your KafkaNodePool:

template:
  pod:
    podDisruptionBudget:
      maxUnavailable: 1

This ensures node drains wait for one broker to restart before evicting the next.

Common Issues

Pods stuck in Pending: Storage class doesn't exist or can't provision. Check kubectl get pvc -n kafka.

Connection refused from external clients: Verify LoadBalancer provisioning with kubectl get svc -n kafka.

Topics not created: Check Entity Operator logs and verify strimzi.io/cluster label matches.

Strimzi makes Kafka deployment repeatable and version-controlled. The operational complexity of rolling upgrades, certificate rotation, and config changes is managed by software that's better at it than manual runbooks.

Book a demo to see how Conduktor Console provides unified visibility across your Strimzi clusters, MSK, and Confluent environments.