kafka_arch
kafka_arch
Core Components
Broker:
Kafka runs as a cluster of servers called
brokers.
Each broker stores data, serves clients,
and handles read/write operations.
Brokers are identified by unique IDs and
coordinate via Zookeeper.
Topic:
A logical channel where data (messages)
is published and consumed.
Topics are divided into partitions for
parallelism and scalability.
Each partition is an ordered, immutable
log of messages.
Partition:
A topic is split into multiple
partitions, distributed across brokers.
Partitions enable parallel processing and
load balancing.
Each partition has a leader (handled by
one broker) and replicas (for fault
tolerance) on other brokers.
Producer:
Clients that publish messages to Kafka
topics.
Producers write to the leader partition
of a topic; Kafka handles replication.
Consumer:
Clients that subscribe to topics and read
messages.
Consumers belong to consumer groups for
load balancing; each partition is
consumed by one consumer in a group.
Consumers track their progress using
offsets (position in the partition).
Zookeeper:
A distributed coordination service used
by Kafka for:
Managing broker metadata (e.g., which
broker is the leader for a partition).
Tracking cluster state and configuration.
Handling leader election and failover.
Kafka is moving toward removing Zookeeper
dependency (KRaft mode in newer
versions).
Data Flow
Producers send messages to a topic,
specifying a key (optional) to determine
the target partition.
Messages are appended to the leader
partition and replicated to follower
replicas for durability.
Consumers in a group read from assigned
partitions, pulling messages in order
using offsets.
Kafka retains messages for a configurable
period (or size limit), allowing replay
or late consumption.
Key Architectural Features
Scalability: Add brokers or partitions to
handle more data or traffic.
Fault Tolerance: Replicas ensure data
availability if a broker fails; leaders
are re-elected automatically.
High Throughput: Partitioning and
batching enable millions of messages per
second.
Durability: Messages are persisted to
disk, with configurable retention
policies.
Low Latency: Efficient log-based storage
and zero-copy I/O.
Example Workflow
A producer sends a message to TopicA,
which has 3 partitions.
The message (based on its key) goes to
Partition1, hosted on Broker1 (leader).
Broker1 replicates the message to Broker2
and Broker3 (followers).
A consumer group with two consumers
subscribes to TopicA:
Consumer1 reads from Partition1 and
Partition2.
Consumer2 reads from Partition3.
Zookeeper manages metadata, ensuring
brokers and consumers stay coordinated.
Additional Components
Kafka Connect: For integrating Kafka with
external systems (e.g., databases, S3).
Kafka Streams: A library for building
real-time stream processing applications.
Schema Registry: Manages data schemas for
serialization (often used with Avro).
This architecture makes Kafka ideal for
use cases like event streaming, log
aggregation, and real-time analytics. For
deeper insights, you can explore Kafka’s
documentation or tools like Confluent for
enterprise setups.