How Replication and ISR work in Apache Kafka

Apache Kafka

Initially devised as a messaging queue, Kafka evolved quickly to be a real-time event streaming platform. It is a distributed system used by thousands of organizations across the world to process transactions in real-time, to collect events, to collect metrics and do real-time analysis, to monitor logs, and much more.

Kafka works on the basic model of decoupling the source system and target system, so instead of Source systems (Producers) directly sending data to the target system, now Source systems publish data to Kafka and Target systems or Consumers subscribes to these events to get data directly from Kafka.

alt text

Why Kafka

Kafka is a distributed system, which means its cluster consists of one or more servers running Kafka and each server holds a subset of data. This distributed placement of data is very important for scalability as it allows applications to both read and write data from/to many brokers at the same time. In these distributed systems it is easy to scale in by adding more machines or scale-out by removing few servers from the cluster as per load on our system.

In Kafka, all the data are stored in the form of Topics, These topics are further divided into a number of partitions that allow parallelizing a topic by splitting the data in a particular topic across multiple brokers and to make the data fault-tolerant. Each topic can be replicated across different servers, so that there are always multiple servers having a copy of data in case few brokers go down.

overview screen

Here, we can see in Conduktor UI, that we have a cluster with:

  • 3 brokers
  • 1 topic with 8 partitions

We have created our topic with a replication factor of 1x, this means these 8 partitions are divided among these 3 brokers, so we can say the load of this topic is divided among these 3 brokers.

Conduktor shows you how the repartition is made among the brokers:

partitions per broker

The first broker 1001 has 2 partitions, and the remaining brokers 1002 and 1003 have 3 partitions each.

Replication in Kafka and ISR

Replication is the process of having multiple copies of the data available across different servers for purpose of availability in case one of the brokers goes down. In Kafka, replication happens at the partition level i.e. copies of the partition are maintained at multiple broker instances.

When we say a topic has a replication factor of 3, this means we will be having three copies of each of its partitions. Kafka considers that a record is committed when all replicas in the In-Sync Replica set (ISR) have confirmed that they have taking the record into account.

While creating a Kafka topic, we can define the number of copies we want to have for the data. We define this using the replication-factor config setting.

Let’s say we have created another topic with a 3x replication factor, here is how it is dispatched among brokers:

partitions details

Here in Conduktor we can see:

  • under the Replicas column, each partition has 3 replicas
  • under the ISR column, we have 3 replicas which mean all the replicas are In Sync with the partition leader, i.e. those followers that have the same data as the leader.

It’s not mandatory to have ISR equal to the number of replicas. By default, if a replica is or has been fully caught up with the leader in the last 10 seconds, it is said to be “in-sync”. The setting for this time period has a server default of 10 sec that can be overridden on a per topic basis.

What is a Partition Leader in Kafka?

In Kafka, there is a concept of leader for each partition.

At any point in time, a partition can have only one broker as the leader. And only that leader can serve the data for the partition. Followers will sync the data from the leader.

In the case of partition 4 and 1, broker 1001 is the leader and broker 1002 and 1003 are in sync with this leader, and replication happens from broker 1001 to broker 1002 and broker 1003.

What if a Kafka broker goes down?

Now an obvious question arises, what if one of our broker dies? or what if we have to bring one broker down for maintenance? what happens to the partition of that broker?

In the above image, Broker 1001 is the leader of partition 4 and 1, what if broker 1001 dies? what would happen to these partitions?

partitions details in error

As we lost broker 1001, it is removed from ISR after 10 seconds if not configured, and Broker 1002, and 1003 still have a copy of data, and can still serve the data.

Leader election will happen for partition 1 and 4. Now partition 1 has broker 1003 and partition 4 has broker 1002 as leaders as they were In Sync with the partition leader earlier.

If broker 1001 comes back to life, it will try to become a leader again after replicating the data. This is handled by Kafka itself, there is no manual intervention required.

partitions details back to a normal situation

In case of any unforeseen events like hardware failure or for maintenance purposes, if we want to bring the broker down, it is recommended to have at least two brokers in the ISR (the leader + a replica) so no loss of data is guaranteed by Kafka.

Conduktor, a Kafka Desktop client, made it easy to manage the replication status, in-sync replicas, broker status of your cluster at one place in an easy to use GUI without running multiple commands.