Kafka-Utils Documentation

Release 0.5.3

Yelp Inc.

November 09, 2016



Kafka-Utils is a library containing tools to interact with kafka clusters and manage them. The tool provides utilities
like listing of all the clusters, balancing the partition distribution across brokers and replication-groups, managing
consumer groups, rolling-restart of the cluster, cluster healthchecks.
For more information about Apache Kafka see the official Kafka documentation.

How to install

$ pip install kafka-utils

List available clusters.

$ kafka-utils
Cluster type sample_type:
Cluster name: cluster-1
broker list: cluster-elb-1:9092

2.1 Configuration
Kafka-Utils reads the cluster configuration needed to access Kafka clusters from yaml files. Each cluster is identified
by type and name. Multiple clusters of the same type should be listed in the same type.yaml file. The yaml files
are read from $KAFKA_DISCOVERY_DIR, $HOME/.kafka_discovery and /etc/kafka_discovery, the
former overrides the latter.
Sample configuration for sample_type cluster at /etc/kafka_discovery/sample_type.yaml
- "cluster-elb-1:9092"
zookeeper: ",,"
- "cluster-elb-2:9092"
zookeeper: ",,"
cluster: cluster-1

For example the kafka-cluster-manager command:

$ kafka-cluster-manager --cluster-type sample_type stats

will pick up default cluster cluster-1 from the local_config at /etc/kafka_discovery/sample_type.yaml to display statistics of default kafka-configuration.

2.2 Cluster Manager

This tool provides a set of commands to manipulate and modify the cluster topology and get metrics for different states
of the cluster. These include balancing the cluster-state, decommissioning brokers, evaluating metrics for the current
state of the cluster. Each of these commands is as described below.

2.2.1 Replication group parser

The tool supports the grouping of brokers in replication groups. kafka-cluster-manager will try to distribute
replicas of the same partition across different replication group. The user can use this feature to map replication groups
to failure zones, so that a balanced cluster will be more resilient to zone failures.
By default all brokers are considered as part of a single replication group. Custom replication group parsers can be
defined by extending the class ReplicationGroupParser as shown in the example below:
from kafka_utils.kafka_cluster_manager.cluster_info.replication_group_parser \
import ReplicationGroupParser

class SampleGroupParser(ReplicationGroupParser):
def get_replication_group(self, broker):
"""Extract the replication group from a Broker instance.
Suppose each broker hostname is in the form broker-rack<n>, this
function will return "rack<n>" as replication group
if broker.inactive:
# Can't extract replication group from inactive brokers because they
# don't have metadata
return None
hostname = broker.metadata['host']
return hostname.rsplit('-', 1)[1]

Create a file named sample_parser.py into a directory containing the __init__.py.

|-- __init__.py
|-- sample_parser.py

To use the custom parser:

$ kafka-cluster-manager --cluster-type sample_type --group-parser $HOME/parser:sample_parser rebalanc

2.2.2 Cluster rebalance

This command provides the functionality to re-distribute partitions across the cluster to bring it into a more balanced state. The goal is to load balance the cluster based on the distribution of the replicas across replication-groups
(availability-zones or racks), distribution of partitions and leaderships across brokers. The imbalance state of a cluster
has been characterized into 4 different layers.
Note: The tool is very conservative while rebalancing the cluster, ensuring that large assignments are executed in
smaller chunks, controlling the number of partition movements and preferred-leader changes.

Uniform distribution of replicas across replication groups.
$ kafka-cluster-manager --cluster-type sample_type rebalance --replication-groups

Partition distribution
Uniform distribution of partitions across groups and brokers.
$ kafka-cluster-manager --cluster-type sample_type rebalance --brokers

Broker as leaders distribution

Some brokers might be elected as leaders for more partitions than others. This creates load-imbalance for these
brokers. Balancing this layer ensures the uniform election of brokers as leaders.
Note: The rebalancing of this layer doesnt move any partitions across brokers.
It re-elects a new leader for the partitions to ensure that every broker is chosen as a leader uniformly. The tool does
not take into account partition size.
$ kafka-cluster-manager --cluster-type sample_type rebalance --leaders

Topic-partition distribution
Uniform distribution of partitions of the same topic across brokers.
The command provides the ability to balance one or more of these layers except for the topic-partition imbalance layer
which will be balanced implicitly with replica or partition rebalancing.
kafka_utils.kafka_cluster_manager.cluster_topology provides APIs to create a cluster-topology
object based on the distribution of topics, partitions, brokers and replication-groups across the cluster.
Rebalancing all layers
Rebalance all layers for given cluster. This command will generate a plan with a maximum of 10 partition movements
and 25 leader-only changes after rebalancing the cluster for all layers discussed before prior to sending it to zookeeper.
$ kafka-cluster-manager --group-parser $HOME/parser:sample_parser --apply
--cluster-type sample_type rebalance --replication-groups --brokers --leaders
--max-partition-movements 10 --max-leader-changes 25

2.2.3 Brokers decommissioning

This command provides functionalities to decommission a given list of brokers. The key idea is to move all partitions
from brokers that are going to be decommissioned to other brokers in either their replication group (preferred) or
others replication groups while keeping the cluster balanced as above.

Note: While decommissioning brokers we need to ensure that we have at least n number of active brokers where n
is the max replication-factor of a partition.
$ kafka-cluster-manager --cluster-type sample_type decommission 123456 123457 123458

2.2.4 Set Replication Factor

This command provides the ability to increase or decrease the replication-factor of a topic. Replicas are added or
removed in such a way that the balance of the cluster is maintained. Additionally, when the replication-factor is
decreased, any out-of-sync replicas will be removed first.
$ kafka-cluster-manager --cluster-type sample_type set_replication_factor --topic sample_topic 3

2.2.5 Stats
This command provides statistics for the current imbalance state of the cluster. It also provides imbalance statistics
of the cluster if a given partition-assignment plan were to be applied to the cluster. The details include the imbalance
value of each of the above layers for the overall cluster, each broker and across each replication-group.
$ kafka-cluster-manager --group-parser $HOME/parser:sample_parser --cluster-type
sample_type stats

2.2.6 Store assignments

Dump the current cluster-topology in json format.
$ kafka-cluster-manager --group-parser $HOME/parser:sample_parser --cluster-type
sample_type store_assignments

2.3 Consumer Manager

This kafka tool provides the ability to view and manipulate consumer offsets for a specific consumer group. For a
given cluster, this tool provides us with the following functionalities:
Manipulating consumer-groups: Listing consumer-groups subscribed to the cluster. Copying, deleting and
renaming of the group.
Manipulating offsets: For a given consumer-group, fetching current offsets, low and high watermarks for topics
and partitions subscribed to the group. Setting, advancing, rewinding, saving and restoring of current-offsets.
Manipulating topics: For a given consumer-group and cluster, listing and unsubscribing topics.
Offset storage choice: Supports Kafka 0.8.2 and 0.9.0, using offsets stored in either Zookeeper or Kafka.
Version 0 and 2 of the Kafka Protocol are supported for committing offsets.

2.3.1 Subcommands

2.3.2 Listing consumer groups

The list_groups command shows all of the consumer groups that exist in the cluster.
$ kafka-consumer-manager --cluster-type=test list_groups
Consumer Groups:

If list_groups is called with the --storage option, then the groups will only be fetched from Zookeeper or

2.3.3 Listing topics

For information about the topics subscribed by a consumer group, the list_topics subcommand can be used.
$ kafka-consumer-manager --cluster-type=test list_topics group3
Consumer Group ID: group3
Topic: topic_foo
Partitions: [0, 1, 2, 3, 4, 5]
Topic: topic_bar
Partitions: [0, 1, 2]

2.3.4 Getting consumer offsets

The offset_get subcommand gets information about a specific consumer group.
The most basic usage is to call offset_get with a consumer group id.
$ kafka-consumer-manager --cluster-type test --cluster-name my_cluster offset_get my_group
Cluster name: my_cluster, consumer group: my_group
Topic Name: topic1
Partition ID: 0
High Watermark: 787656
Low Watermark: 787089
Current Offset: 787645

The offsets for all topics in the consumer group will be shown by default. A single topic can be specified using the
--topic option. If a topic is specified, then a list of partitions can also be specified using the --partitions
By default, the offsets will be fetched from both Zookeeper and Kafkas internal offset storage. A specific offset
storage location can be speficied using the --storage option.

2.3.5 Manipulating consumer offsets

The offsets for a consumer group can also be saved into a json file.

$ kafka-consumer-manager --cluster-type test --cluster-name my_cluster offset_save my_group my_offset

Cluster name: my_cluster, consumer group: my_group
Consumer offset data saved in json-file my_offsets.json

The save offsets file can then be used to restore the consumer group.

$ kafka-consumer-manager --cluster-type test --cluster-name my_cluster offset_restore my_offsets.json

Restored to new offsets {u'topic1': {0: 425447}}

The offsets can also be set directly using the offset_set command. This command takes a group id, and a set of
topics, partitions, and offsets.

$ kafka-consumer-manager --cluster-type test --cluster-name my_cluster offset_set my_group topic1.0.3

There is also an offset_advance command, which will advance the current offset to the same value as the high
watermark of a topic, and an offset_rewind command, which will rewind to the low watermark.
If the offset needs to be modified for a consumer group does not already exist, then the --force option can be used.
This option can be used with offset_set, offset_rewind, and offset_advance.

2.3.6 Copying or renaming consumer group

Consumer groups can have metadata copied into a new group using the copy_group subcommand.
$ kafka-consumer-manager --cluster-type=test copy_group my_group1 my_group2

They can be renamed using rename_group.

$ kafka-consumer-manager --cluster-type=test rename_group my_group1 my_group2

When the group is copied, if a topic is specified using the --topic option, then only the offsets for that topic will
be copied. If a topic is specified, then a set of partitions of that topic can also be specified using the --partitions

2.3.7 Deleting or unsubscribing consumer groups

A consumer group can be deleted using the delete_group subcommand.
$ kafka-consumer-manager --cluster-type=test delete_group my_group

A consumer group be unsubscribed from topics using the unsubscribe_topics subcommand. If a single topic
is specified using the --topic option, then the group will be unsubscribed from only that topic.

2.4 Rolling Restart

The kafka-rolling-restart script can be used to safely restart an entire cluster, one server at a time. The script finds all
the servers in a cluster, checks their health status and executes the restart.

2.4.1 Cluster health

The health of the cluster is defined in terms of broker availability and under replicated partitions. Kafka-rolling-restart
will check that all brokers are answering to JMX requests, and that the total numer of under replicated partitions is
zero. If both conditions are fulfilled, the cluster is considered healthy and the next broker will be restarted.
The JMX metrics are accessed via Jolokia, which must be running on all brokers.
Note: If a broker is not registered in Zookeeper when the tool is executed, it will not appear in the list of known
brokers and it will be ignored.

2.4.2 Parameters
The parameters specific for kafka-rolling-restart are:
--check-interval INTERVAL: the number of seconds between each check. Default 10.
--check-count COUNT: the number of consecutive checks that must result in cluster healthy before restarting the next server. Default 12.
--unhealthy-time-limit LIMIT: the maximum time in seconds that a cluster can be unhealthy for. If
the limit is reached, the script will terminate with an error. Default 600.
--jolokia-port PORT: The Jolokia port. Default 8778.
--jolokia-prefix PREFIX: The Jolokia prefix. Default jolokia/.
--no-confirm: If specified, the script will not ask for confirmation.
--skip N: Skip the first N servers. Useful to recover from a partial rolling restart. Default 0.
--verbose: Turn on verbose output.

2.4.3 Examples
Restart the generic dev cluster, checking the JXM metrics every 30 seconds, and restarting the next broker after 5
consecutive checks have confirmed the health of the cluster:

$ kafka-rolling-restart --cluster-type generic --cluster-name dev --check-interval 30 --check-count 5

Check the generic prod cluster. It will report an error if the cluster is unhealthy for more than 900 seconds:
$ kafka-rolling-restart --cluster-type generic --cluster-name prod --unhealthy-time-limit 900

2.5 Kafka Check

2.5.1 Checking in-sync replicas
This kafka tool provides the ability to check in-sync replicas for each topic-partition in the cluster.
The min_isr command checks if the number of in-sync replicas for a partition is equal or greater than
the minimum number of in-sync replicas configured for the topic the partition belongs to. A topic specific
min.insync.replicas overrides the given default.
The parameters for min_isr check are:
--default_min_isr DEFAULT_MIN_ISR: Default min.isr value for cases without settings in Zookeeper
for some topics.
--data-path DATA_PATH: Path to the Kafka data folder.
--controller-only: If this parameter is specified, it will do nothing and succeed on non-controller brokers. If --broker-id is also set as -1 then broker-id will be computed from given data-path.
$ kafka-check --cluster-type=sample_type min_isr
OK: All replicas in sync.

In case of min isr violations:

$ kafka-check --cluster-type=sample_type min_isr --default_min_isr 3
isr=2 is lower than min_isr=3 for sample_topic:0
CRITICAL: 1 partition(s) have the number of replicas in sync that is lower
than the specified min ISR.

2.5.2 Checking under replicated partitions

This kafka tool provides the ability to check and report number of under replicated partitions for all brokers in the
The under_replicated command checks if the number of under replicated partitions is equal to zero. It will
report the aggregated result of under replicated partitions of each broker if any.
The parameters specific to under_replicated check are:
--first-broker-only: If this parameter is specified, the command will check for under-replicated partitions for given broker only if its the first broker in broker-list fetched from zookeeper. Otherwise, it does
nothing and succeeds. If --broker-id is also set as -1 then broker-id will be computed from given data-path.
--minimum-replication MINIMUM_REPLICATION: Minimum number of in-sync replicas for under
replicated partition. If the current number of in-sync replicas for partition which has under replicated replicas
below that param, the check will tell about this topic-partition.
$ kafka-check --cluster-type=sample_type under_replicated
OK: No under replicated partitions.

In case of not first broker in the broker list in Zookeeper:

$ kafka-check --cluster-type=sample_type --broker-id 3 under_replicated --first-broker-only
OK: Provided broker is not the first in broker-list.

In case where some partitions are under-replicated.


$ kafka-check --cluster-type=sample_type under_replicated

CRITICAL: 2 under replicated partitions.

2.6 Corruption Check

The kafka-corruption-check script performs a check on the log files stored on the Kafka brokers. This tool finds all
the log files modified in the specified time range and runs DumpLogSegments on them. The output is collected and
filtered, and all information related to corrupted messages will be reported to the user.
Even though this tool executes the log check with a low ionice priority, it can slow down the cluster given the high
number of io operations required. Consider decreasing the batch size to reduce the additional load.

2.6.1 Parameters
The parameters specific for kafka-corruption-check are:
--minutes N: check the log files modified in the last N minutes.
--start-time START_TIME: check the log files modified after START_TIME. Example format:
--start-time "2015-11-26 11:00:00"
--end-time END_TIME: check the log files modified before END_TIME. Example format: --end-time
"2015-11-26 12:00:00"
--data-path: the path to the data files on the Kafka broker.
--java-home: the JAVA_HOME on the Kafka broker.
--batch-size BATCH_SIZE: the number of files that will be checked in parallel on each broker. Default:
--check-replicas: if set it will also check the data on replicas. Default: false.
--verbose: enable verbose output.

2.6.2 Examples
Check all the files (leaders only) in the generic dev cluster and which were modified in the last 30 minutes:

$ kafka-corruption-check --cluster-type generic --cluster-name dev --data-path /var/kafka-logs --minu

Filtering leaders
Broker: 0, leader of 9 over 13 files
Broker: 1, leader of 4 over 11 files
Starting 2 parallel processes
Broker: broker0.example.org, 9 files to check
Broker: broker1.example.org, 4 files to check
Processes running:
broker0.example.org: file 0 of 9
broker0.example.org: file 5 of 9
ERROR Host: broker0.example.org: /var/kafka-logs/test_topic-0/00000000000000003363.log
ERROR Output: offset: 3371 position: 247 isvalid: false payloadsize: 22 magic: 0 compresscodec: NoCom
broker1.example.org: file 0 of 4

In this example, one corrupted file was found in broker 0.

Check all the files modified after the specified date, in both leaders and replicas:

$ kafka-corruption-check [...] --start-time "2015-11-26 11:00:00" --check-replicas

Check all the files that were modified in the specified range:
$ kafka-corruption-check [...] --start-time "2015-11-26 11:00:00" --end-time "2015-11-26 12:00:00"

2.7 Indices and tables



