# Strimzi on Kubernetes: From Zero to Production Kafka

Running Kafka on Kubernetes used to be a bad idea. Stateful workloads and container orchestration didn't mix well. Strimzi changed that.

I've deployed Strimzi clusters across AWS, GCP, and on-prem environments. The operator handles rolling upgrades, certificate rotation, and rack awareness automatically. You declare what you want, and Strimzi makes it happen.

> *We moved to Strimzi because we wanted Kafka to be as declarative as the rest of our infrastructure. GitOps for Kafka wasn't possible before.*
>
> *Platform Engineer at a Fortune 500 retailer*

## Install the Operator

```bash
kubectl create namespace kafka
helm repo add strimzi https://strimzi.io/charts/
helm install strimzi strimzi/strimzi-kafka-operator \
  --namespace kafka --set replicas=2
```

The `replicas=2` gives you operator high availability.

## The Two Resources That Matter

Strimzi 0.46+ runs Kafka in KRaft mode. No ZooKeeper. Two resources define your cluster:

| Resource | Purpose |
|----------|---------|
| `Kafka` | Cluster-wide config: listeners, security, entity operator |
| `KafkaNodePool` | Node groups: replicas, storage, roles, resources |

For production, always separate controllers and brokers. Controllers shouldn't compete with broker workloads.

## Production Configuration

```yaml
apiVersion: kafka.strimzi.io/v1
kind: KafkaNodePool
metadata:
  name: controllers
  labels:
    strimzi.io/cluster: prod-cluster
spec:
  replicas: 3
  roles: [controller]
  storage:
    type: jbod
    volumes:
      - id: 0
        type: persistent-claim
        size: 10Gi
        class: fast-ssd
  resources:
    requests: { memory: 2Gi, cpu: "1" }
---
apiVersion: kafka.strimzi.io/v1
kind: KafkaNodePool
metadata:
  name: brokers
  labels:
    strimzi.io/cluster: prod-cluster
spec:
  replicas: 3
  roles: [broker]
  storage:
    type: jbod
    volumes:
      - id: 0
        type: persistent-claim
        size: 500Gi
        class: fast-ssd
  resources:
    requests: { memory: 8Gi, cpu: "2" }
  jvmOptions:
    -Xms: 4096m
    -Xmx: 4096m
```

**JVM heap rule:** Set heap to 25-50% of container memory. Kafka relies on OS page cache.

## Storage: The Critical Decision

Kafka requires low-latency block storage. NFS and EFS are not recommended due to high latency and potential consistency issues under load.

| Cloud | Storage Class |
|-------|--------------|
| AWS | gp3, io2 |
| GCP | pd-ssd |
| Azure | managed-premium |

Always set `deleteClaim: false`. You don't want `kubectl delete kafka` to wipe your data.

## External Access

```yaml
listeners:
  - name: external
    port: 9094
    type: loadbalancer
    tls: true
    authentication:
      type: scram-sha-512
```

**Cost note:** A 3-broker cluster creates 4 load balancers (1 bootstrap + 3 per-broker). That's $60/month on AWS. Use NodePort for cost-sensitive environments.

## User and Topic Management

The Entity Operator manages topics and users as Kubernetes resources:

```yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
  name: app-producer
  labels:
    strimzi.io/cluster: prod-cluster
spec:
  authentication:
    type: scram-sha-512
  authorization:
    type: simple
    acls:
      - resource: { type: topic, name: events, patternType: literal }
        operations: [Write, Describe]
```

Get the password: `kubectl get secret app-producer -o jsonpath='{.data.password}' | base64 -d`

## Monitoring

Key metrics to alert on:

| Metric | Critical Threshold |
|--------|-------------------|
| `kafka_server_replicamanager_underreplicatedpartitions` | > 0 for 5 minutes |
| `kafka_controller_kafkacontroller_offlinepartitionscount` | > 0 |

Under-replicated partitions is the single most important health indicator. [Unified cluster visibility](https://docs.conduktor.io/guide) across Strimzi, MSK, and Confluent environments makes monitoring multiple clusters practical.

## Upgrades

Kafka version upgrades are one YAML change:

```yaml
spec:
  kafka:
    version: 3.9.0  # Changed from 3.8.0
```

The operator handles rolling upgrades: brokers first, then follower controllers, then the active controller last. Upgrade Strimzi operator first, then Kafka.

## Pod Disruption Budget

Kubernetes cluster upgrades drain nodes, which can evict multiple brokers simultaneously. Add to your KafkaNodePool:

```yaml
template:
  pod:
    podDisruptionBudget:
      maxUnavailable: 1
```

This ensures node drains wait for one broker to restart before evicting the next.

## Common Issues

**Pods stuck in Pending:** Storage class doesn't exist or can't provision. Check `kubectl get pvc -n kafka`.

**Connection refused from external clients:** Verify LoadBalancer provisioning with `kubectl get svc -n kafka`.

**Topics not created:** Check Entity Operator logs and verify `strimzi.io/cluster` label matches.

Strimzi makes Kafka deployment repeatable and version-controlled. The operational complexity of rolling upgrades, certificate rotation, and config changes is managed by software that's better at it than manual runbooks.

[Book a demo](https://www.conduktor.io/contact/demo) to see how Conduktor Console provides unified visibility across your Strimzi clusters, MSK, and Confluent environments.
