Apache Kafka – Introduction

Today let’s talk about Kafka and how to quickly start working with that technology.

Quick introduction into Kafka

In the most simple words Kafka is pub-sub system. What does it mean? It means that two communicating parties are not sending messages directly to each others. Instead of that – one of them (producer) sends message in some particular place. And the second one (consumer) subscribes in the same place for messages. At the end – message sent by producer will be forwarded to consumer. Big advantage of that approach is simple scalability. We can add many producers and consumers – and as long as they use the same agreed place – everything should be fine.

Speaking of messages – for Kafka it is just a  byte array without meaning and structure. Massage can have some metadata attached to it. We call that metadata a key. Key also does not have a meaning for Kafka, but hashcode of it might be used to select where message will be stored.

The fact that Kafka does not want to understand messages might be considered as an advantage – we don’t need to send our data in any forced format. At the same time often we we want to use some format for messages (like json) in order to force consistency. That format usually is referred as a schema.

Kafka usually is not sending each message separately. It is rather sending them in batches to prevent network transfer overhead.

The place where producer sends messages and consumer gets them later is called a topic. Topic can be split further into many partitions. This operation allows Kafka to scale up. Firstly because each partition can be hosted on different server. The second thing is that messages send into the topic will be distributed between partitions based on key hashcode.

From the consumer perspective an offset is important concept. Offset tells us which messages will be read next. Offset is stored in consumer but Kafka itself also stores it. Because of that if consumer dies – after restart it can continue from the last message instead of from beginning. Each message will get unique offset in the scope of partition. Consecutive offsets will have increasing values, but we cannot assume what exact value will be next.

Single instance of Kafka server is called a Broker. The role of broker is to receive messages, assign them offsets and persist them on disk. Broker also serves messages to consumers. Brokers usually are not working alone but in groups. Each such group (called consumer group) usually splits reading from partitions between members. If one of the consumers crash – others will rebalance reading.

On the other hands – Kafka Brokers also can work in bigger groups – such groups are called Kafka Clusters. Between all of brokers in cluster one of them is called Cluster Controller. Responsibility of controller is to assign partitions to brokers and look for brokers crashes. One broker owns one partition (we call it Partition Owner), but more brokers can be assigned to single partition (the data will be replicated between them) – so if partition owner crash – other broker will start acting as an owner.

The last important concept to know is retention. Kafka has ability to store messages – so if by some reason consumer cannot read – messages will wait. There are two main policies about retention – one of them is to use some amount of time to indicate if given messages should be stored or not. After that time message will be deleted (actually it is more about group of messages than single message – but it is not that important right now). The other type of policy is to store messages as long as we do not exceed some amount of disk space.

Kafka internally can perform log compacting – so messages that have the same key in partition can be removed and only the newest one will still exists on broker. This is the reason why we cannot assume offset values.

Quick start

Now when we have most basic knowledge about Kafka – let’s try to use it. On Kafka webpage there is a quick start guide  https://kafka.apache.org/quickstart that we can use to try Kafka. After unpacking archives we need to start Kafka. In order to do that first we need to start Zookeeper (it will give Kafka cluster ability to coordinate) and then Kafka itself:

bin/zookeeper-server-start.sh config/zookeeper.properties
 bin/kafka-server-start.sh config/server.properties
We need to have something to write/read - so lets create a topic:
 bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
Now we can check that topic is created:
 bin/kafka-topics.sh --list --zookeeper localhost:2181
Now let start consumer listening on that topic:
 bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
And finally let send some message - first we will start producer:
 bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test

Now what we will be writing in console – it will be send to broker and we can see that consumer reads that messages.

What next?

Using console consumers/producers is nice because we can quickly check that our configuration is correct. But it would be even more fun to use Kafka in Scala apps. So next time we will take a look for simple Producer/Consumer in Scala.

2 Replies to “Apache Kafka – Introduction”

    1. In general – in situations where you would like to have scalable messaging system with persistence and replication of messages – especially when you can process group of messages in parallel.

      For example you might have some entity in your system that is updated very often – each entity have some kind of unique key – you might send to kafka each update about this entity – with id of this entity – kafka will distribute messages across partitions – so multiple consumers can process it in parallel, but since you will use entity id as a key – all updates for the same entity will be assigned to the same partition so you are sure that updates about this entity will be processed in order by one consumer. Besides that you have lots of failover safety – if one broker crash – other brokers have copies of partitions. If consumer crash – one of the other consumers in group will take care of partition assigned to crashed one.

      Kafka holds information about offsets for each consumer group that reads from topic – so you might have multiple consumer groups reading the same topic in parallel (like multiple micro services interested in updates of your entity – but each service would like to process them without having dependency to others)

      There are also some libraries that can integrate with kafka and give you more stream-like/reactive-like semantic to process information from one or more topics like kafka-streams or akka-streams. Having those you can use kafka in each scenario where you would like to have some kind of stream-processing/reactive application – in this case kafka will serve as a source of data to process

Comments are closed.