# Kafka producer retries

*Learn how to configure producer retries for reliable message delivery in 14 minutes*

Kafka producers can automatically retry failed requests to improve reliability and handle transient failures in distributed systems.

**What you'll learn:**
- How producer retries work and when they're triggered
- The difference between retriable and non-retriable errors
- How to configure retry behavior for idempotent producers
- Best practices for retry timeout and backoff settings

## Why retries matter

In distributed systems, temporary failures are common:
- Network connectivity issues
- Broker leadership changes
- Temporary resource constraints
- Replication delays

Without retries, these transient issues would result in lost messages. Retries provide resilience against such failures.

## Retry configuration

### Basic retry settings

```properties
# Number of retry attempts (default varies by Kafka version)
retries=2147483647

# Time to wait between retries (default: 100ms)
retry.backoff.ms=100

# Maximum time to wait for acknowledgment (default: 30s)
request.timeout.ms=30000

# Maximum time to deliver a message including retries (default: 2 minutes)
delivery.timeout.ms=120000
```

### Kafka version differences

**Kafka < 3.0:**
- `retries=0` (no retries by default)
- Must explicitly enable retries

**Kafka >= 3.0:**
- `retries=Integer.MAX_VALUE` (unlimited retries)
- Retries enabled by default with idempotent producers

## Types of errors

### Retriable errors
These errors can potentially be resolved by retrying:

- **TimeoutException**: Request timed out
- **NotEnoughReplicasException**: Not enough in-sync replicas
- **NotEnoughReplicasAfterAppendException**: Replication issues
- **RetriableException**: Generic retriable error
- **LeaderNotAvailableException**: Leader election in progress
- **NetworkException**: Network connectivity issues

### Non-retriable errors
These errors indicate permanent failures that won't be resolved by retrying:

- **RecordTooLargeException**: Message exceeds size limits
- **SerializationException**: Message serialization failed
- **OffsetMetadataTooLarge**: Offset metadata too large
- **InvalidTopicException**: Topic doesn't exist or is invalid
- **UnknownTopicOrPartitionException**: Topic or partition invalid
- **AuthorizationException**: Authentication/authorization failure

### Error handling decision tree

This decision tree helps you understand how to handle different types of producer errors:

![](https://www.conduktor.io/assets/kafka/error-retries.png)

### Retry decision flowchart

This flowchart shows how the producer decides whether to retry a failed request:

![](https://www.conduktor.io/assets/kafka/retries-decision.png)

> **Idempotent producers (Kafka 2.4+):** With `enable.idempotence=true` and `acks=all`, you get unlimited retries by default without risk of duplicates, making retry configuration much simpler.

## Retry backoff strategies

### Fixed backoff (default)
Waits a fixed amount of time between retries:

```properties
retry.backoff.ms=100  # Always wait 100ms between retries
```

**Pattern:** Wait → Retry → Wait → Retry → Wait → Retry

### Exponential backoff
Not natively supported by Kafka producer, but can be implemented at the application level:

```
Attempt 1: Wait 100ms
Attempt 2: Wait 200ms
Attempt 3: Wait 400ms
Attempt 4: Wait 800ms
```

## Impact on message ordering

### With retries enabled
Retries can affect message ordering within a partition:

```
Message A sent → Fails → Retry scheduled
Message B sent → Succeeds immediately
Message A retry → Succeeds

Result: Message B appears before Message A in partition
```

### Preserve order
To maintain strict ordering, configure:

```properties
# Limit in-flight requests to preserve order
max.in.flight.requests.per.connection=1

# Or use idempotent producer (recommended)
enable.idempotence=true
max.in.flight.requests.per.connection=5  # Up to 5 with idempotency
```

## Delivery timeout vs request timeout

### Request timeout
Time to wait for a single request attempt:

```properties
request.timeout.ms=30000  # 30 seconds per attempt
```

### Delivery timeout
Total time limit for delivering a message (including all retries):

```properties
delivery.timeout.ms=120000  # 2 minutes total
```

**Relationship:**
```
delivery.timeout.ms >= request.timeout.ms + (retries × retry.backoff.ms)
```

## Configuration examples

### High reliability (recommended)
```properties
# Unlimited retries with delivery timeout
retries=2147483647
delivery.timeout.ms=300000      # 5 minutes total
request.timeout.ms=30000        # 30 seconds per attempt
retry.backoff.ms=100            # 100ms between retries
enable.idempotence=true         # Preserve ordering and avoid duplicates
```

### Fast failure
```properties
# Limited retries for quick feedback
retries=3
delivery.timeout.ms=10000       # 10 seconds total
request.timeout.ms=5000         # 5 seconds per attempt
retry.backoff.ms=100            # 100ms between retries
```

### No retries (not recommended for production)
```properties
retries=0
request.timeout.ms=30000
```

## Monitor retry behavior

### Key metrics to track
- **retry-rate**: Rate of retry attempts
- **retry-total**: Total number of retries
- **error-rate**: Rate of failed requests (after all retries)
- **request-latency**: Time taken for requests (including retries)

### JMX metrics
```
kafka.producer:type=producer-metrics,client-id=<client-id>
- retry-rate
- retry-total
- request-rate
- request-latency-avg
```

## Error handling strategies

### Synchronous error handling
```java
Properties props = new Properties();
props.put("retries", 5);
props.put("retry.backoff.ms", 100);

Producer<String, String> producer = new KafkaProducer<>(props);

try {
    ProducerRecord<String, String> record = new ProducerRecord<>("topic", "key", "value");
    RecordMetadata metadata = producer.send(record).get();
    System.out.println("Message sent to " + metadata.topic() + ":" + metadata.partition());
} catch (Exception e) {
    System.err.println("Failed after all retries: " + e.getMessage());
}
```

### Asynchronous error handling
```java
ProducerRecord<String, String> record = new ProducerRecord<>("topic", "key", "value");

producer.send(record, (metadata, exception) -> {
    if (exception != null) {
        System.err.println("Failed after all retries: " + exception.getMessage());
    } else {
        System.out.println("Message sent successfully");
    }
});
```

## Best practices

### Production recommendations
1. **Enable unlimited retries**: Set `retries=Integer.MAX_VALUE`
2. **Use delivery timeout**: Set `delivery.timeout.ms` to control total time
3. **Enable idempotency**: Prevents duplicates during retries
4. **Monitor retry metrics**: Track retry rates and error patterns
5. **Handle non-retriable errors**: Implement proper error handling for permanent failures

### Configuration checklist
- ✅ `retries=Integer.MAX_VALUE` (unlimited retries)
- ✅ `delivery.timeout.ms=120000` (reasonable total timeout)
- ✅ `request.timeout.ms=30000` (reasonable per-request timeout)
- ✅ `retry.backoff.ms=100` (reasonable delay between retries)
- ✅ `enable.idempotence=true` (prevent duplicates)

### Common mistakes to avoid
- Setting `retries=0` in production
- Not handling non-retriable errors
- Setting delivery timeout too low
- Ignoring retry metrics and error rates

Starting with Kafka 3.0, producers have sensible retry defaults: unlimited retries with idempotency enabled, a 2-minute delivery timeout, and proper error handling for most use cases. Note that retries can affect message ordering within partitions—use `enable.idempotence=true` or `max.in.flight.requests.per.connection=1` if strict ordering is required.

> **See it in practice with Conduktor**
> [Conduktor Console](https://docs.conduktor.io/guide/monitor-brokers-apps/) displays producer retry metrics and error rates in real-time. Monitor retry attempts, successful retries, and failed messages to validate your retry configuration and identify patterns in transient versus permanent failures.

## Next steps

- [Enable idempotent producers](https://www.conduktor.io/kafka/idempotent-kafka-producer) to avoid duplicates on retry
- [Understand acknowledgment settings](https://www.conduktor.io/kafka/kafka-producer-acks-deep-dive) for delivery guarantees
- [Optimize producer batching](https://www.conduktor.io/kafka/kafka-producer-batching) for throughput
