Before you begin
The instructions in this tutorial use the tutorial environment, which includes a selection of topics each with a live stream of events, created to allow you to explore features in IBM Event Automation. Following the setup instructions to deploy the demo environment gives you a complete instance of Event Automation that you can use to follow this tutorial for yourself.
Versions
This tutorial uses the following versions of Event Automation capabilities. Screenshots might differ from the current interface if you are using a newer version.
- Event Streams 11.6.1
Choosing a monitoring platform
The focus of this tutorial is to illustrate the types of monitoring that can be useful for Event Streams, and the sorts of metrics that you can use to enable this.
Grafana is used in this tutorial to display metrics, but it is not required for monitoring Event Streams. There are many alternative options available that can be used to visualize these metrics or to trigger alerts and notifications when they change.
Choosing the best monitoring platform depends on your specific use case and is beyond the scope of this tutorial. For an overview of available options, see Monitoring Kafka cluster health.
Instructions
-
Set up monitoring in the tutorial environment.
This installs the Grafana community operator and creates two sample Grafana dashboards to help you get started.
-
Access the tutorial dashboards.
The following sections of this tutorial will explain how to use these dashboards.
Monitoring use cases
There are many reasons for an Event Streams administrator to use metrics. This tutorial focuses on two common use cases:
-
Health monitoring - Ensuring that the Event Streams cluster is healthy, and detecting early warnings of potential problems that need attention.
-
Performance monitoring - Assessing the workload the cluster is currently supporting, and identifying signs that the cluster is struggling to support the workload.
There is naturally overlap between these two use cases. Some metrics could be helpful for both use cases, and the objective of this tutorial is not to try and classify the available metrics.
Monitoring is not a single one-size-fits-all activity. The goal of this tutorial is to encourage you to consider what your objective is, and to select metrics to monitor that support that objective.
Note: The focus of this tutorial is on use cases for cluster administrators. Other users also monitor Kafka metrics. For example, users responsible for applications that use an Event Streams cluster must monitor metrics specific to their application and the Kafka topics that their application uses.
Monitoring cluster health
A common use case for Event Streams metrics is ensuring that the cluster is functioning correctly, and identifying any issues that require attention.
The demo playbook creates a sample Grafana dashboard called Event Streams (Health), which provides insights into relevant health metrics.
View the metrics displayed in this dashboard to assess the status of your cluster.
Kafka topics
Some metrics indicate the health of the topic partitions within the Event Streams cluster. The first section of the demo dashboard gives examples of the kinds of values that a cluster administrator must monitor for changes.
Offline partitions
Metric: kafka.controller:type=KafkaController,name=OfflinePartitionsCount
An “offline” partition is a topic partition that doesn’t have an active leader.
Without an active leader, client applications cannot produce to or consume from a topic partition. The partition remains unavailable until a leader is restored.
The number of partitions in this state must normally be zero.
Unclean leader elections
Metric: kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec
A leader election is the process for choosing which replica must be used by producer and consumer client applications.
Normally, an in-sync replica is chosen. However, if there are no in-sync replicas to choose from, the last resort is to select an out-of-sync replica as the new leader.
This results in the loss of some events that were missing on the new leader, and so the number of unclean leader elections must be zero.
Under-replicated partitions
Metric: kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
An under-replicated topic partition is a topic partition where some replicas are not keeping up with the partition leader.
This metric must ideally be zero.
Fetcher lag
Metric: kafka.server:type=FetcherLagMetrics,name=ConsumerLag
Fetcher lag measures the number of events follower replicas have yet to retrieve from topic partition leader. It is a measure of whether replication is keeping up and how far behind the topic replicas are falling.
The lag must be less than the maximum batch size used by producer applications, indicating that cluster is able to keep up with the rate of traffic it is handling.
Kafka brokers
Some metrics indicate the health of the Kafka brokers in the Event Streams cluster. The second section of the demo dashboard provides examples of the kinds of values that a cluster administrator must track for changes.
Active controllers
Metric: kafka.controller:type=KafkaController,name=ActiveControllerCount
A Kafka cluster must always have one broker acting as the controller. Several critical cluster operations depend on the controller, such as partition leadership management and replica assignment. If there is no active controller, the cluster might become unstable.
If there is more than one controller, this is described as a split brain and the potential for conflicting decisions by multiple controllers risks data integrity.
The value for this metric must always be one (you might briefly see zero or two while the cluster transfers control to a different broker, but this must quickly return to one).
Number of online brokers
Metric: kafka.controller:type=KafkaController,name=ActiveBrokerCount
Kafka’s high-availability design allows applications to continue to functioning even after a broker fails. As a result, it is important to monitor that all brokers are currently online, as the loss of a broker cannot always be detected by application behavior.
The Event Streams cluster created by the tutorial playbook must have three brokers.
Kafka pods
Kubernetes is also a useful source of metrics about the Kafka brokers that make up the Event Streams cluster. The third section of the demo dashboard gives examples of the kinds of values that a cluster administrator can use.
Broker uptime
Metric: kube_pod_container_state_started
If the start time for a Kafka broker is more recent than the last time a Kafka broker pod was administratively modified or restarted, this indicates an unexpected restart.
Restart count
Metric: kube_pod_container_status_restarts_total
If the Kubernetes probes registered by Event Streams fail, then Kubernetes will restart the Kafka container. The number of restarts must be zero for a healthy broker.
Disks
Some metrics indicate the health of the disks used by Kafka brokers. The fourth section of the demo dashboard gives examples of the kinds of values that a cluster administrator must monitor.
Disk space remaining
Metric: kubelet_volume_stats_available_bytes
Disk space for Kafka brokers must be closely monitored. If Kafka fills the disk space available to it, it leads to error conditions that are difficult to resolve.
An administrator must take action long before the disk becomes full. Monitoring the remaining disk space for each broker is a key part of this.
Network
Some metrics indicate the network activity by Kafka brokers. The fifth section of the demo dashboard gives examples of the kinds of values that a cluster administrator must monitor.
Network request errors
Metric: kafka.network:type=RequestMetrics,name=ErrorsPerSec
A healthy Kafka cluster must report zero network errors.
If sustained network errors are reported, this will likely impact consumer and producer applications and needs attention.
Using the dashboard
The health dashboard will likely not contain a lot of interesting data for your tutorial cluster. Most of the values displayed must be zero.
To see the dashboard in action, kill the Java process for one of the Kafka brokers and see the impact described by the metrics in the dashboard.
Run the following command to kill one of the brokers (for example, broker 0):
oc exec -it \
-n event-automation \
my-kafka-cluster-kafka-0 -- \
pkill --signal 9 --parent 1 java
You must see a brief increase in under-replicated partitions, as the replicas hosted on broker 0 fall behind the leader partitions on broker 1 and broker 2. This must quickly return to zero after the replacement broker 0 starts.
If broker 0 was the controller of your Event Streams cluster, you will see a brief change to the number of active controllers as the control transfers to one of the other brokers.
The restart of broker 0 will be shown in Kafka broker 0 gauge.
It is likely that you will see a few network errors as broker 1 and broker 2 report their failures to submit requests to broker 0 as it was killed. This must quickly stop as they successfully retry their requests to the replacement broker 0.
Monitoring cluster performance
A common use case for Event Streams metrics is to understand the current level of traffic that the cluster is supporting, and to identify if the scaling of the Event Streams cluster and underlying infrastructure is insufficient to support the workload.
The demo playbook creates a sample Grafana dashboard called Event Streams (Performance) to show the sorts of metrics that can be used for this.
View the metrics displayed in this dashboard.
Messages
Some metrics describe the message throughput that the Event Streams cluster is handling. The first section of the demo dashboard gives examples of the kinds of values that a cluster administrator can use to get insight into this.
Messages in
Metric: kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec
This metric tracks the number of messages per second that the cluster is receiving. This is a good summary of the level of traffic that the cluster is handling.
The data generator in the tutorial environment generates a very low level of traffic for the Event Streams cluster, which must be visible in the graph.
Notice: If you killed one of the Kafka brokers in the previous step, you must see this reflected in the graph as a brief dip in the rate of messages.
Produce requests
Metric: kafka.server:type=BrokerTopicMetrics,name=TotalProduceRequestsPerSec
This metric measures the number of produce requests the cluster receives per second. Efficient applications will use batching to produce multiple messages in a single request, so this must be lower than the number of messages per second.
This is a good summary of the level of traffic that the cluster is handling.
The data generator in the tutorial environment generates a very low level of traffic for the Event Streams cluster, which must be visible in the graph.
Notice: If you killed one of the Kafka brokers in the previous step, you must see this reflected in the graph as a brief dip in the rate of produce requests.
Topics
Some metrics indicate the performance of topic replication performed by the Kafka brokers. The second section of the demo dashboard gives examples of the kinds of values that a cluster administrator can use to get insight into this.
Under minimum-in-sync replicas
kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount
This metric tracks the number of topic partitions that have a number of in-sync replicas which is under their minimum in-sync replica configuration.
Producer applications cannot produce to these partitions, and will be blocked until the issue is resolved. As a result, the number of partitions in this state must be zero for a healthy cluster.
Replicas per broker
Metric: kafka.server:type=ReplicaManager,name=PartitionCount
In a healthy cluster, each of the Kafka brokers must handle a roughly equal proportion of the work for the cluster. One indicator of this is the number of topic partition replicas held on each broker.
It can never be exactly identical across all brokers, however having one broker responsible for a disproportionately high level of topic partition replicas must be avoided.
Tip: Instructions for how to re-distribute topic replicas more evenly can be found in Optimizing Kafka cluster with Cruise Control.
Brokers
Some metrics indicate the amount of work that each of the Kafka brokers is individually doing. The third section of the demo dashboard gives examples of the kinds of values that a cluster administrator can use to get insight into this.
Messages in
Metric: kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec
In a healthy cluster, each of the Kafka brokers must handle a roughly equal proportion of the work for the cluster. One indicator of this is the number of messages per second received by each broker.
It can never be exactly identical across all brokers, however having one broker responsible for a disproportionately high level of messages for the cluster must be avoided.
Tip: Instructions for how to re-distribute topic replicas more evenly can be found in Optimizing Kafka cluster with Cruise Control.
Bytes in
Metric: kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec
In a healthy cluster, each of the Kafka brokers must handle a roughly equal proportion of the work for the cluster. One indicator of this is the amount of data per second received by each broker.
It can never be exactly identical across all brokers, however having one broker responsible for a disproportionately high level of data for the cluster must be avoided.
Tip: Instructions for how to re-distribute topic replicas more evenly can be found in Optimizing Kafka cluster with Cruise Control.
Java (JVM)
Kafka brokers are Java applications, so the Java Virtual Machine is also a useful source of metrics about the Kafka brokers that make up the Event Streams cluster. The fourth section of the demo dashboard gives examples of the kinds of values that a cluster administrator can use.
Garbage collection time
Metric: jvm_gc_collection_seconds_count
Heap memory usage
Metric: jvm_memory_used_bytes
Kafka pods
Kubernetes is also a useful source of metrics about the Kafka brokers that make up the Event Streams cluster. The fifth section of the demo dashboard gives examples of the kinds of values that a cluster administrator can use.
CPU
Metric: container_cpu_user_seconds_total
Memory
Metric: container_memory_usage_bytes
A cluster administrator can monitor these metrics to identify if they need to modify the resource limits specified in the EventStreams
custom resource to better fit the current workload demands.
Disks
Some metrics indicate the ability of the disks used by Kafka brokers to support the current workload. The sixth section of the demo dashboard gives examples of the kinds of values that a cluster administrator must monitor.
Log flush rate
Metric: kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs
Log flush rate is the rate at which log data is written to disk, in milliseconds.
It is a measure of the time taken to flush messages from in-memory log buffers to physical disks. Higher latency will slow down the request times for producer applications, and potentially lead to data loss in the event of a crash before a flush has completed.
The visualization in the demo dashboard shows the value for the 99th percentile - so 99% of the log latency measured is under the value displayed here.
Request handler pool
Metric: kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent
This metric is a measure of the amount of time the request handler threads in the Kafka broker are spending idle. It is an indicator of how busy the broker is when processing requests from client applications and other brokers.
A high value indicates the request threads are mostly idle, and that the broker is under a low workload. A low value indicates the request handlers are fully utilized and potentially struggling to keep up with a high workload.
Network
Some metrics describe the network traffic that the Event Streams cluster is handling. The final section of the demo dashboard gives examples of the kinds of values that a cluster administrator can use to get insight into this.
Bytes in
Metric: kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec
Bytes out
Metric: kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec
Note: These metrics only describe network traffic to or from client applications. Separate metrics can be used to monitor inter-broker network traffic (for example, replication).
Using the dashboard
Note: The tutorial setup generates a very small and slow workload, so the performance dashboard does not show you what to expect for a busy cluster. To see the dashboard reflect a busier Event Streams cluster, you can run some workload applications. Follow the instructions for starting Kafka workload applications. This will create a new Kafka topic (WORKLOAD
) and start a several high-throughput Kafka producers and consumers.
You can use the two sample dashboards that you previously created to monitor the impact of your applications:
-
You can see that there is an increase in messages handled by the cluster.
-
You can also see that the Kafka brokers have to work harder to handle the increased traffic, such as through an increase in CPU and memory usage, and an increase in the amount of time spent garbage collecting.
-
As there is (intentionally) no quota applied to the Kafka credentials being used by the workload applications, it will likely lead to the Kafka brokers reaching their maximum CPU and memory capacity.
-
You might see increases in other infrastructure-related values such as an increase in the time taken to flush the log to disk because of the increase in the amount of disk space caused by the workload applications.
You will be able to monitor the disk space used by the new
WORKLOAD
Kafka topic as it rapidly reduces the disk space remaining on the Kafka brokers. -
You can increase the impact of the workload applications by restarting one of the Kafka brokers while the workload applications are running by using the following command:
oc delete pods \ -n event-automation \ my-kafka-cluster-kafka-0
This command forces one broker to fall further behind, as it attempts to catch up with what it missed, which will result in several network request errors being reported, as the Kafka cluster is working harder than the first time you restarted a broker.
-
You might also see other results, such as replication for some topic partitions failing to keep up, and an increase in the replica fetcher lag.
This type of insight and analysis is why it is essential to monitor a production Event Streams cluster.
Important: After you finished using the dashboards while Event Streams is supporting a high-throughput workload, it is recommended that you follow the instructions to stop and remove the workload application resources.
Selecting metrics
The descriptions earlier of each of the metrics used in this tutorial includes the metric “Mbean name” (for example, kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec
) as documented in the Kafka documentation about monitoring. You can refer to the Kafka documentation for more information about these metrics.
The demo dashboards used in this tutorial are not intended to be an exhaustive representation of the metrics available for Event Streams administrators. The intent of this tutorial is to encourage you to consider what your objective is for monitoring your Event Streams cluster. Use the examples described here as a starting point for exploring the monitoring tables and selecting the appropriate metrics for your use case.
You can use the Grafana installation provided by the demo playbook to explore and experiment with the metrics that are available in the Grafana dashboard. The Explore section for the data source is a useful way to try this.
After you have identified the metrics you want to collect, you can follow the instructions for configuring monitoring to collect these metrics in your (non-demo) Event Streams clusters.