Multi-Region Kafka: Active-Active vs Active-Passive

Compare Kafka DR patterns. MirrorMaker 2 setup, offset translation, conflict resolution, and when each architecture makes sense.

Stéphane DerosiauxStéphane Derosiaux · January 11, 2024 ·
Multi-Region Kafka: Active-Active vs Active-Passive

Your Kafka cluster will fail. Region outages happen. The question is whether you lose minutes or hours of data, and whether recovery takes seconds or days.

I've seen both extremes. A retail company lost four hours of order data because their DR test was a checkbox exercise. A fintech failed over in under a minute because they practiced quarterly. The difference wasn't technology—it was architecture choice and operational discipline.

Multi-region Kafka falls into two patterns: active-passive (one cluster serves traffic, one stands by) and active-active (both serve traffic). Each has different tradeoffs.

We thought active-passive was simpler until our first real failover. Updating configs across 40 services took longer than the actual outage.

SRE Lead at a logistics company

Active-Passive: The Starting Point

One primary cluster handles production traffic. A secondary cluster receives replicated data and waits.

US-EAST (Primary) ──MirrorMaker 2──> US-WEST (Standby)
     ▲
 All traffic                          No traffic

When the primary fails, you switch traffic to the secondary. Consumers resume from replicated offsets.

Basic MirrorMaker 2 configuration:

clusters = us-east, us-west
us-east.bootstrap.servers = kafka-us-east:9092
us-west.bootstrap.servers = kafka-us-west:9092

us-east->us-west.enabled = true
us-east->us-west.topics = orders, payments, events

# Critical for failover
us-east->us-west.sync.group.offsets.enabled = true

Topics appear on the secondary with a prefix: us-east.orders, us-east.payments.

The Offset Translation Problem

When MirrorMaker replicates messages, the target cluster assigns new offsets. Message at offset 1000 in orders might land at offset 998 in us-east.orders.

The checkpoint connector maintains a mapping. With sync.group.offsets.enabled=true, MirrorMaker writes translated offsets to the target cluster's __consumer_offsets.

Warning: Offset sync lags by emit.checkpoints.interval.seconds (default 60s). In a failover, consumers may reprocess messages produced in the last checkpoint interval.

Active-Passive Failover

When you fail over:

  1. Stop producers on the primary (if reachable)
  2. Wait for replication to catch up
  3. Update client configurations to point to secondary
  4. Restart consumers

Step 3 is where it hurts. Updating dozens of services—environment variables, Kubernetes secrets, config files—takes time. The coordination complexity hits when you can least afford it.

Active-Active: Both Clusters Serve Traffic

Active-active runs both clusters simultaneously. US-East producers write to the US-East cluster. US-West producers write to US-West. MirrorMaker replicates bidirectionally.

US-EAST (Active) ←──MirrorMaker 2──→ US-WEST (Active)
     ▲                                    ▲
 US-East traffic                     US-West traffic

Each cluster has local topics (orders) and replicated topics from the other region (us-west.orders).

us-east->us-west.enabled = true
us-west->us-east.enabled = true
us-east->us-west.topics = orders, payments
us-west->us-east.topics = orders, payments

MirrorMaker prevents infinite loops by not replicating topics that already have a cluster prefix.

The Conflict Problem

Active-active introduces write conflicts:

T=0: US-East writes order-123 status="SHIPPED"
T=0: US-West writes order-123 status="CANCELLED"
T=1: Both messages replicate to both clusters

Both clusters now have both messages. Which status is correct?

MirrorMaker does not resolve conflicts. It replicates both. Your application must handle this.

Conflict Resolution Strategies

Regional authority: Designate one region as authoritative for specific entities. Orders prefixed US-* always process in US-East. This avoids conflicts entirely.

Last-writer-wins: Include timestamps in events. Consumers keep the latest version. Simple but can silently drop legitimate updates if clocks drift.

Idempotent events: Instead of SET status=SHIPPED, emit ORDER_SHIPPED events. Let downstream consumers determine final state from the event sequence.

Comparing the Patterns

AspectActive-PassiveActive-Active
RPOReplication lag (seconds to minutes)Near-zero for regional data
RTOMinutes to hours (config changes)Seconds (traffic already flows)
ComplexityLower normally, higher during failoverHigher always
ConsistencyStrong (single source of truth)Eventual (conflicts possible)

When to Use Active-Passive

  • Application can't handle conflicts
  • Single source of truth required
  • Failover is rare (quarterly or less)
  • Team lacks operational maturity for active-active

When to Use Active-Active

  • Users geographically distributed
  • Regional autonomy matters
  • Data can partition by region or entity
  • Application designed for eventual consistency

Testing Your DR

Set up monitoring alerts for replication lag and consumer health to detect issues before they become outages.

Don't wait for a real outage. Quarterly failover drills:

  1. Simulate primary failure
  2. Time detection and decision
  3. Execute failover
  4. Verify consumers resume without data loss
  5. Measure actual RTO and RPO

Monitor MirrorMaker's heartbeat topic. If heartbeats stop, replication is broken:

kafka-console-consumer --bootstrap-server kafka-us-west:9092 \
  --topic heartbeats --timeout-ms 30000

The technology works. Whether your DR works depends on whether you've practiced.

Book a demo to see how Conduktor Console provides visibility into replication lag and consumer health across multi-region deployments.