# Disk Full: Emergency Recovery When Kafka Runs Out of Space

Your broker just crashed with `java.io.IOException: No space left on device`. The logs show `Exit.halt(1)`. Kafka didn't gracefully shut down—it terminated immediately, skipping shutdown hooks entirely.

I've been paged for this exact scenario more times than I'd like to admit. The panic is real, but the fix is straightforward if you work through it systematically.

> *Our disk-full incident turned into a 4-hour outage because we didn't have a runbook. Now we drill this quarterly.*
>
> *SRE at a payments company*

## Assess First (2 Minutes)

Before touching anything, understand the scope.

```bash
# From any healthy broker
kafka-broker-api-versions.sh --bootstrap-server kafka1:9092,kafka2:9092,kafka3:9092
# Timeout = broker is down
```

SSH to the affected broker:

```bash
df -h
# /dev/sda1       500G  500G    0  100% /var/kafka

du -sh /var/kafka/* | sort -rh | head -10
# 180G    /var/kafka/data/high-volume-topic-0
```

**Decision point:** If only one broker is down and replication factor >= 2, your cluster is still serving traffic. You have time.

## Free Space Immediately

Pick the fastest option for your situation.

### Option A: Delete Old Segments (Fastest, Most Risk)

**⚠️ CRITICAL: Stop the broker first.** Deleting segment files while Kafka is running causes immediate data corruption and broker crashes. Always run `kafka-server-stop.sh` before proceeding.

```bash
# ONLY after broker is stopped - verify with: ps aux | grep kafka
find /var/kafka/data -name "*.log" -mtime +7 -type f -delete
find /var/kafka/data -name "*.index" -mtime +7 -type f -delete
find /var/kafka/data -name "*.timeindex" -mtime +7 -type f -delete
```

**Never delete the active segment** (newest `.log` file in each partition). Deleting it corrupts the partition.

### Option B: Reduce Retention Dynamically (Safer)

You can also adjust [topic retention settings](https://docs.conduktor.io/guide/manage-kafka/kafka-resources/topics) through Conduktor Console's UI.

```bash
kafka-configs.sh --bootstrap-server kafka2:9092 \
  --alter --entity-type topics --entity-name high-volume-topic \
  --add-config retention.ms=3600000,retention.bytes=10737418240
```

This sets 1-hour retention and 10 GB per partition. The log cleaner runs every 5 minutes by default.

| Setting | Emergency | Normal |
|---------|-----------|--------|
| `retention.ms` | 3600000 (1h) | 604800000 (7d) |
| `retention.bytes` | 10GB | -1 (unlimited) |

### Option C: Expand Disk (Cloud)

```bash
# AWS EBS
aws ec2 modify-volume --volume-id vol-xxxx --size 1000
sudo growpart /dev/xvda 1
sudo resize2fs /dev/xvda1
```

## Restart the Broker

Once you have 10-20% free space:

```bash
kafka-server-start.sh -daemon /etc/kafka/server.properties
tail -f /var/log/kafka/server.log
```

### Common Startup Failures

**Corrupt index files:**
```text
ERROR Found a corrupted index file /var/kafka/data/my-topic-0/00000000000012345.index
```

Delete the corrupt indexes. Kafka rebuilds them:

```bash
rm /var/kafka/data/my-topic-0/00000000000012345.index
rm /var/kafka/data/my-topic-0/00000000000012345.timeindex
```

**Empty snapshot files:**
```bash
find /var/kafka/data -name "*.snapshot" -size 0 -delete
```

**All log dirs failed (JBOD):** Temporarily exclude the failed disk in `server.properties`:

```properties
# Original: log.dirs=/data1/kafka,/data2/kafka,/data3/kafka
# Temporary: log.dirs=/data1/kafka,/data2/kafka
```

Partitions on the excluded disk become under-replicated. Reassign them after recovery.

## Verify Recovery

```bash
# Check for under-replicated partitions
kafka-topics.sh --bootstrap-server kafka1:9092 --describe --under-replicated-partitions
# Output should be empty once caught up
```

Recovery time depends on data volume. 100 GB at 100 MB/s network = ~17 minutes per replica.

## Prevent Recurrence

Configure [disk usage alerts](https://docs.conduktor.io/guide/monitor-brokers-apps/alerts) to catch problems before they become emergencies.

| Metric | Warning | Critical |
|--------|---------|----------|
| Disk usage % | 70% | 85% |
| OfflineLogDirectoryCount | > 0 | > 0 |

Set size-based retention as a backstop:

```properties
log.retention.bytes=107374182400  # 100 GB per partition
log.retention.check.interval.ms=60000  # Check every minute
```

Disk full is recoverable if you have replication. Without replicas, you lose data. The real fix is alerting at 70%, not recovering at 100%.

[Book a demo](https://www.conduktor.io/contact/demo) to see how Conduktor Console monitors disk usage across all your Kafka clusters.