# Kafka Connect CLI tutorial

*Learn how to use Kafka Connect in standalone mode*

Kafka Connect provides a scalable way to move data between Kafka and external systems. This tutorial demonstrates running a connector in standalone mode using real-time Wikipedia changes.

**What you'll learn:**
- What Kafka Connect is and when to use it
- How to configure a standalone connector
- How to set up the required properties files
- How to verify data flowing into Kafka

> **Live data stream**
> This tutorial uses the [Wikipedia recent changes stream](https://stream.wikimedia.org/v2/stream/recentchange) - a real-time feed of edits happening across Wikipedia.

## What is Kafka Connect?

Kafka Connect is a framework for streaming data between Kafka and external systems using reusable connectors. Instead of writing custom producer/consumer code for common integrations, you can use pre-built connectors.

![Flowchart: source systems (database, S3, APIs) feed source connectors, which write to Kafka topics, which sink connectors read and deliver to sink systems (Elasticsearch, data warehouse, S3)](https://www.conduktor.io/assets/kafka/diagrams/kafka-connect-cli-tutorial.svg)

```mermaid
flowchart LR
    subgraph Sources["Source Systems"]
        DB[(Database)]
        S3A[S3]
        API[APIs]
    end

    subgraph Connect["Kafka Connect"]
        SC["Source<br/>Connectors"]
        SK["Sink<br/>Connectors"]
    end

    subgraph Kafka["Kafka"]
        T[Topics]
    end

    subgraph Sinks["Sink Systems"]
        ES[Elasticsearch]
        DW[(Data Warehouse)]
        S3B[S3]
    end

    DB & S3A & API --> SC --> T --> SK --> ES & DW & S3B
```

| Connector type | Direction | Examples |
|----------------|-----------|----------|
| Source | External → Kafka | Debezium, JDBC, S3, MongoDB, Twitter |
| Sink | Kafka → External | Elasticsearch, S3, JDBC, HDFS, Splunk |

Find connectors on [Confluent Hub](https://www.confluent.io/hub/).

## How to use Kafka Connect in standalone mode?

To use Kafka Connect in standalone mode, we need to provide the mandatory parameters:

*   Download a Kafka Connect connector, either from GitHub or Confluent Hub [Confluent Hub](https://www.confluent.io/hub/)
*   Create a configuration file for your connector
*   Use the `connect-standalone.sh` CLI to start the connector

### Example: Kafka Connect standalone with Wikipedia data

Create the Kafka topic `wikipedia.recentchange` in Kafka with 3 partitions

```
kafka-topics --bootstrap-server localhost:9092 --topic wikipedia.recentchange --create --partitions 3 --replication-factor 1
```

As well as the topic dead letter queue `wikipedia.dlq`, for catching any errors

```
kafka-topics --bootstrap-server localhost:9092 --topic wikipedia.dlq --create --partitions 3 --replication-factor 1
```

Download the release JAR and configuration from [here](https://github.com/simplesteph/kafka-connect-sse/releases/download/v1.0/kafka-connect-sse.zip) and unzip the archive on your computer at `kafka_2.13-2.8.1/connectors/kafka-connect-sse`:

```
 ~/kafka_2.13-2.8.1/connectors  ls -R
kafka-connect-sse

./kafka-connect-sse:
connector.properties                            kafka-connect-sse-1.0-jar-with-dependencies.jar
```

Edit the configuration file `connectors/kafka-connect-sse/connector.properties` with the following properties:

```
name=sse-source-connector
tasks.max=1
connector.class=com.github.cjmatta.kafka.connect.sse.ServerSentEventsSourceConnector
topic=wikipedia.recentchange
sse.uri=https://stream.wikimedia.org/v2/stream/recentchange
errors.tollerance=all
errors.deadletterqueue.topic.name=wikipedia.dlq
```

Look into your Kafka installation directory (where your `bin` and `config` folders are)

Edit the content of the `config/connect-standalone.properties` file

```
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=true
value.converter.schemas.enable=true
offset.flush.interval.ms=10000

# EDIT BELOW IF NEEDED
bootstrap.servers=localhost:9092
offset.storage.file.filename=/tmp/connect.offsets
plugin.path=/Users/stephanemaarek/kafka_2.13-2.8.1/connectors
```

The last three lines are the most important to make everything work.

In particular, the `plugin.path` config: this is where you indicate the folder where you store your Kafka connectors you have downloaded before.

**This must be an absolute path (not relative, and no shortcut with `~`) to your `connectors` directory**

If you fail this step, Kafka Connect will stop after starting it.

Next, we can start our Kafka Connect standalone connector

```
connect-standalone ~/kafka_2.13-2.8.1/config/connect-standalone.properties ~/kafka_2.13-2.8.1/connectors/kafka-connect-sse/connector.properties
```

And as we can see, the data is flowing into our `wikipedia.recentchange` topic:

```
kafka-console-consumer --bootstrap-server localhost:9092 --topic wikipedia.recentchange
```

## Standalone vs distributed mode

| Mode | Use case | Scalability |
|------|----------|-------------|
| Standalone | Development, testing, single tasks | Single worker |
| Distributed | Production, high availability | Multiple workers |

Standalone mode runs on a single machine with no fault tolerance. For production deployments, use distributed mode with multiple workers.

> **See it in practice with Conduktor**
> [Conduktor Console](https://docs.conduktor.io/guide/manage-kafka/kafka-resources/kafka-connect) provides a visual interface for managing Kafka Connect clusters, deploying connectors and monitoring connector health and throughput.

## Next steps

- [Start programming with Kafka](https://www.conduktor.io/kafka/kafka-programming-tutorials) to build producers and consumers in code
- [Monitor Kafka clusters](https://www.conduktor.io/kafka/kafka-monitoring-and-operations) including Connect workers
- Explore [Confluent Hub](https://www.confluent.io/hub/) for more available connectors
