What is Apache Kafka? Part 2

Definition of Core Apache Kafka Concepts

Now that we've learned about Apache Kafka at a high level, let's dive in and learn how to use the tool. In this lesson, we will cover the basics about Kafka topics, producers, and consumers.

Apache Kafka Components. An overview of the relationship between Kafka clusters, Kafka topics, Kafka producers and Kafka Consumers.

What is a Kafka Topic?

Kafka topics organize related events. For example, we may have a topic called logs, which contains logs from an application. Topics are roughly analogous to SQL tables. However, unlike SQL tables, Kafka topics are not queryable. Instead, we must create Kafka producers and consumers to utilize the data. The data in the topics are stored in the key-value form in binary format.

Kafka Topics

Read more in Kafka Topics, Partitions & Offsets page.

What is a Kafka Producer?

Once a topic is created in Kafka, the next step is to send data into the topic. Applications that send data into a topic are known as Kafka producers. There are many ways to produce events to Kafka, but applications typically integrate with Kafka client libraries in languages like Java, Python, Go, as well as many other languages.

Note that Kafka producers are deployed outside Kafka and only interact with Apache Kafka by sending data directly into the Kafka topics.

Kafka Producers

Read more in Kafka Producers.

What is a Kafka Consumer?

Once a topic has been created and data produced into the topic, we can have applications that make use of the data stream. Applications that pull event data from one or more Kafka topics are known as Kafka consumers. There are many ways to consume events from Kafka, but applications typically integrate with Kafka client libraries in languages like Java, Python, Go, as well as many other languages. By default consumers only consume data that was produced after the consumer first connected to the topic.

Note that Kafka consumers are deployed outside Kafka and only interact with Apache Kafka by reading data directly from Kafka topics.

Read more in Kafka Consumers.

What is Kafka Streams?

Once we have produced data from external systems into Kafka, we may want to process them using stream processing applications. Stream processing applications make use of streaming data stores like Apache Kafka to provide real-time analytics.

For example, let's assume we are having a Kafka topic named twitter_tweets that is a data streaming of all tweets on Twitter. From this topic, we may want to:

  • Filter only tweets that have over 10 likes or replies, to capture important tweets

  • Count the number of tweets received for each hashtag every 1 minute

  • Combine the two to get trending topics and hashtags in real-time!

An illustrated example of how Apache Kafka and Kafka Streams can support stream processing applications for real-time analytics and other use cases.

In order to perform topic-level transformation within Apache Kafka, we can use streaming libraries that are meant for this use case instead of writing very complicated producer & consumer code.

In that case, we can leverage the Kafka Streams library, which is a stream processing framework that is released alongside Apache Kafka. Alternatives you may have heard of for Kafka Streams are Apache Spark, or Apache Flink.

What is Kafka Connect?

In order to get data into Apache Kafka, we have seen that we need to leverage Kafka producers. Over time, it has been noticed that many companies shared the same data source types (databases, systems, etc...) and so writing open-source standardized code could be helpful for the greater good. The same thinking goes for Kafka Consumers.

Kafka Connect is a tool that allows us to integrate popular systems with Kafka. It allows us to re-use existing components to source data into Kafka and sink data out from Kafka into other data stores.

Example of popular Kafka Connectors include:

  • Kafka Connect Source Connectors (producers): Databases (through the Debezium connector), JDBC, Couchbase, GoldenGate, SAP HANA, Blockchain, Cassandra, DynamoDB, FTP, IOT, MongoDB, MQTT, RethinkDB, Salesforce, Solr, SQS, Twitter, etc…

  • Kafka Connect Sink Connectors (consumers): S3, ElasticSearch, HDFS, JDBC, SAP HANA, DocumentDB, Cassandra, DynamoDB, HBase, MongoDB, Redis, Solr, Splunk, Twitter

An overview of how Apache Kafka with Kafka Connect helps to stream data between sources and sinks.

Was this content helpful?
PreviousWhat is Apache Kafka? Part 1
NextWhat is Apache Kafka? Part 3