Apache Kafka is a distributed event streaming platform designed to handle real-time data feeds with high throughput and fault tolerance.
Note - Event streaming is continuous streams of event and process them as soon as a change happens.
To understand how Kafka works internally, let's break it down into its key components and their interactions
Producers - Producers are responsible for publishing data to Kafka. When a producer sends a message to Kafka, it specifies a topic and optionally a partition key. Kafka uses the partition key to determine which partition the message should be written to. In the event that no key is provided during the message production process, the producers will automatically transmit the message round-robin way, ensuring that each partition receives a message.
Consumers - Consumers read data from Kafka topics. They subscribe to one or more topics and receive messages when published. The consumers read the messages from the oldest to the newest using the offset (unique identifier). Kafka allows multiple consumers to read from the same topic in parallel, enabling horizontal scalability.
https://www.cloudkarafka.com/blog/2016-11-30-part1-kafka-for-beginners-what-is-apache-kafka.html
Topics - Topic is type of data stream. Messages in Kafka is organized into topics, which are essentially feeds of messages. Each topic can have multiple partitions, which allows for parallelism and scalability. Topics are also highly fault-tolerant and can be replicated across multiple Kafka brokers.
Partitioning - When a message is produced to Kafka, it is assigned to a partition based on the specified partition key (if provided) or using a partitioning algorithm. Kafka guarantees that messages with the same partition key will always go to the same partition, ensuring order within a partition.
Brokers - Kafka brokers are the servers responsible for storing and managing the topic partitions. They are the core components of the Kafka cluster. Each broker can handle multiple partitions across different topics.
https://www.cloudkarafka.com/blog/2016-11-30-part1-kafka-for-beginners-what-is-apache-kafka.html
Replication - Kafka provides fault tolerance through replication. Each partition can be replicated across multiple brokers. When a partition is replicated, one broker is designated as the leader and the others are followers. There can only be one leader per partition in a broker, and each partition of a topic has a leader. Their replicas will just sync the data from leader, and the leader is the only one who gets the messages. Because of the replicas, it will guarantee that a broker's data won't be lost even in the event of a broker failure. When a leader goes down, Zookeeper will automatically choose a leader.
ZooKeeper - Historically, Kafka used Apache ZooKeeper for cluster coordination, leader election, and metadata management. However, recent versions of Kafka have been moving away from ZooKeeper dependency towards internal metadata management.
Offsets - Every partition will contain a stream of data as well as ordered data. Each message within a partition will have an incremental ID, which represents the message's position within the partition. This particular ID is referred to as an offset.
Consumer Groups - Consumers are organized into consumer groups. Each message in a topic partition is consumed by only one member of a consumer group. This allows multiple consumers to work together to process a large volume of data in parallel.
Commit Logs - Kafka stores messages in a distributed commit log. Each partition is essentially an ordered, immutable sequence of messages. Messages are appended to the end of the log and assigned a sequential offset.
Message Retention - Kafka allows configuring retention policies for topics, specifying how long messages should be retained or how much data should be retained. This allows Kafka to handle use cases ranging from real-time processing to data storage and replay.
Overall, Kafka's architecture is designed for high scalability, fault tolerance, and real-time processing of streaming data. It's built to handle massive volumes of data with low latency and high throughput, making it a popular choice for building data pipelines, real-time analytics, and event-driven architectures.