# Multi-Region Kafka: Active-Active vs Active-Passive

Your Kafka cluster will fail. Region outages happen. The question is whether you lose minutes or hours of data, and whether recovery takes seconds or days.

I've seen both extremes. A retail company lost four hours of order data because their DR test was a checkbox exercise. A fintech failed over in under a minute because they practiced quarterly. The difference wasn't technology—it was architecture choice and operational discipline.

Multi-region Kafka falls into two patterns: **active-passive** (one cluster serves traffic, one stands by) and **active-active** (both serve traffic). Each has different tradeoffs.

> *We thought active-passive was simpler until our first real failover. Updating configs across 40 services took longer than the actual outage.*
>
> *SRE Lead at a logistics company*

## Active-Passive: The Starting Point

One primary cluster handles production traffic. A secondary cluster receives replicated data and waits.

```text
US-EAST (Primary) ──MirrorMaker 2──> US-WEST (Standby)
     ▲
 All traffic                          No traffic
```

When the primary fails, you switch traffic to the secondary. Consumers resume from replicated offsets.

Basic MirrorMaker 2 configuration:

```properties
clusters = us-east, us-west
us-east.bootstrap.servers = kafka-us-east:9092
us-west.bootstrap.servers = kafka-us-west:9092

us-east->us-west.enabled = true
us-east->us-west.topics = orders, payments, events

# Critical for failover
us-east->us-west.sync.group.offsets.enabled = true
```

Topics appear on the secondary with a prefix: `us-east.orders`, `us-east.payments`.

## The Offset Translation Problem

When MirrorMaker replicates messages, the target cluster assigns new offsets. Message at offset 1000 in `orders` might land at offset 998 in `us-east.orders`.

The checkpoint connector maintains a mapping. With `sync.group.offsets.enabled=true`, MirrorMaker writes translated offsets to the target cluster's `__consumer_offsets`.

**Warning:** Offset sync lags by `emit.checkpoints.interval.seconds` (default 60s). In a failover, consumers may reprocess messages produced in the last checkpoint interval.

## Active-Passive Failover

When you fail over:

1. Stop producers on the primary (if reachable)
2. Wait for replication to catch up
3. Update client configurations to point to secondary
4. Restart consumers

Step 3 is where it hurts. Updating dozens of services—environment variables, Kubernetes secrets, config files—takes time. The coordination complexity hits when you can least afford it.

## Active-Active: Both Clusters Serve Traffic

Active-active runs both clusters simultaneously. US-East producers write to the US-East cluster. US-West producers write to US-West. MirrorMaker replicates bidirectionally.

```text
US-EAST (Active) ←──MirrorMaker 2──→ US-WEST (Active)
     ▲                                    ▲
 US-East traffic                     US-West traffic
```

Each cluster has local topics (`orders`) and replicated topics from the other region (`us-west.orders`).

```properties
us-east->us-west.enabled = true
us-west->us-east.enabled = true
us-east->us-west.topics = orders, payments
us-west->us-east.topics = orders, payments
```

MirrorMaker prevents infinite loops by not replicating topics that already have a cluster prefix.

## The Conflict Problem

Active-active introduces write conflicts:

```text
T=0: US-East writes order-123 status="SHIPPED"
T=0: US-West writes order-123 status="CANCELLED"
T=1: Both messages replicate to both clusters
```

Both clusters now have both messages. Which status is correct?

**MirrorMaker does not resolve conflicts.** It replicates both. Your application must handle this.

### Conflict Resolution Strategies

**Regional authority:** Designate one region as authoritative for specific entities. Orders prefixed `US-*` always process in US-East. This avoids conflicts entirely.

**Last-writer-wins:** Include timestamps in events. Consumers keep the latest version. Simple but can silently drop legitimate updates if clocks drift.

**Idempotent events:** Instead of `SET status=SHIPPED`, emit `ORDER_SHIPPED` events. Let downstream consumers determine final state from the event sequence.

## Comparing the Patterns

| Aspect | Active-Passive | Active-Active |
|--------|----------------|---------------|
| **RPO** | Replication lag (seconds to minutes) | Near-zero for regional data |
| **RTO** | Minutes to hours (config changes) | Seconds (traffic already flows) |
| **Complexity** | Lower normally, higher during failover | Higher always |
| **Consistency** | Strong (single source of truth) | Eventual (conflicts possible) |

### When to Use Active-Passive

- Application can't handle conflicts
- Single source of truth required
- Failover is rare (quarterly or less)
- Team lacks operational maturity for active-active

### When to Use Active-Active

- Users geographically distributed
- Regional autonomy matters
- Data can partition by region or entity
- Application designed for eventual consistency

## Testing Your DR

Set up [monitoring alerts](https://docs.conduktor.io/guide/monitor-brokers-apps/alerts) for replication lag and consumer health to detect issues before they become outages.

Don't wait for a real outage. Quarterly failover drills:

1. Simulate primary failure
2. Time detection and decision
3. Execute failover
4. Verify consumers resume without data loss
5. Measure actual RTO and RPO

Monitor MirrorMaker's heartbeat topic. If heartbeats stop, replication is broken:

```bash
kafka-console-consumer --bootstrap-server kafka-us-west:9092 \
  --topic heartbeats --timeout-ms 30000
```

The technology works. Whether your DR works depends on whether you've practiced.

[Book a demo](https://www.conduktor.io/contact/demo) to see how Conduktor Console provides visibility into replication lag and consumer health across multi-region deployments.
