Disk Full: Emergency Recovery When Kafka Runs Out of Space

Emergency runbook for Kafka disk full scenarios. Immediate triage commands, safe segment deletion, recovery steps, and retention tuning to prevent recur...

Stéphane DerosiauxStéphane Derosiaux · January 17, 2024 ·
Disk Full: Emergency Recovery When Kafka Runs Out of Space

Your broker just crashed with java.io.IOException: No space left on device. The logs show Exit.halt(1). Kafka didn't gracefully shut down—it terminated immediately, skipping shutdown hooks entirely.

I've been paged for this exact scenario more times than I'd like to admit. The panic is real, but the fix is straightforward if you work through it systematically.

Our disk-full incident turned into a 4-hour outage because we didn't have a runbook. Now we drill this quarterly.

SRE at a payments company

Assess First (2 Minutes)

Before touching anything, understand the scope.

# From any healthy broker
kafka-broker-api-versions.sh --bootstrap-server kafka1:9092,kafka2:9092,kafka3:9092
# Timeout = broker is down

SSH to the affected broker:

df -h
# /dev/sda1       500G  500G    0  100% /var/kafka

du -sh /var/kafka/* | sort -rh | head -10
# 180G    /var/kafka/data/high-volume-topic-0

Decision point: If only one broker is down and replication factor >= 2, your cluster is still serving traffic. You have time.

Free Space Immediately

Pick the fastest option for your situation.

Option A: Delete Old Segments (Fastest, Most Risk)

⚠️ CRITICAL: Stop the broker first. Deleting segment files while Kafka is running causes immediate data corruption and broker crashes. Always run kafka-server-stop.sh before proceeding.

# ONLY after broker is stopped - verify with: ps aux | grep kafka
find /var/kafka/data -name "*.log" -mtime +7 -type f -delete
find /var/kafka/data -name "*.index" -mtime +7 -type f -delete
find /var/kafka/data -name "*.timeindex" -mtime +7 -type f -delete

Never delete the active segment (newest .log file in each partition). Deleting it corrupts the partition.

Option B: Reduce Retention Dynamically (Safer)

You can also adjust topic retention settings through Conduktor Console's UI.

kafka-configs.sh --bootstrap-server kafka2:9092 \
  --alter --entity-type topics --entity-name high-volume-topic \
  --add-config retention.ms=3600000,retention.bytes=10737418240

This sets 1-hour retention and 10 GB per partition. The log cleaner runs every 5 minutes by default.

SettingEmergencyNormal
retention.ms3600000 (1h)604800000 (7d)
retention.bytes10GB-1 (unlimited)

Option C: Expand Disk (Cloud)

# AWS EBS
aws ec2 modify-volume --volume-id vol-xxxx --size 1000
sudo growpart /dev/xvda 1
sudo resize2fs /dev/xvda1

Restart the Broker

Once you have 10-20% free space:

kafka-server-start.sh -daemon /etc/kafka/server.properties
tail -f /var/log/kafka/server.log

Common Startup Failures

Corrupt index files:

ERROR Found a corrupted index file /var/kafka/data/my-topic-0/00000000000012345.index

Delete the corrupt indexes. Kafka rebuilds them:

rm /var/kafka/data/my-topic-0/00000000000012345.index
rm /var/kafka/data/my-topic-0/00000000000012345.timeindex

Empty snapshot files:

find /var/kafka/data -name "*.snapshot" -size 0 -delete

All log dirs failed (JBOD): Temporarily exclude the failed disk in server.properties:

# Original: log.dirs=/data1/kafka,/data2/kafka,/data3/kafka
# Temporary: log.dirs=/data1/kafka,/data2/kafka

Partitions on the excluded disk become under-replicated. Reassign them after recovery.

Verify Recovery

# Check for under-replicated partitions
kafka-topics.sh --bootstrap-server kafka1:9092 --describe --under-replicated-partitions
# Output should be empty once caught up

Recovery time depends on data volume. 100 GB at 100 MB/s network = ~17 minutes per replica.

Prevent Recurrence

Configure disk usage alerts to catch problems before they become emergencies.

MetricWarningCritical
Disk usage %70%85%
OfflineLogDirectoryCount> 0> 0
Set size-based retention as a backstop:
log.retention.bytes=107374182400  # 100 GB per partition
log.retention.check.interval.ms=60000  # Check every minute

Disk full is recoverable if you have replication. Without replicas, you lose data. The real fix is alerting at 70%, not recovering at 100%.

Book a demo to see how Conduktor Console monitors disk usage across all your Kafka clusters.