Following are detailed descriptions of the alarm rule metrics by component type.
- Broker
- Kafka Network
- Partition
- Node
- ZooKeeper
- Schema Registry
- Consumer Group
- Topic
- Connect
- CMPS
- Connector
- Data mirroring
Broker Metrics
| Metric | Description |
|---|---|
| Number of brokers in a cluster | An alarm is triggered when the number of online brokers meets the settings in the metric details |
| Abnormal broker status (not running) | An alarm is triggered when the broker state is anything other than running |
| Abnormal number of active controller | An alarm is triggered when there is no active controller broker |
| Broker disk skewed | The distribution of disk usage is calculated by comparing the broker with the highest disk usage and the broker with the lowest disk usage. If this distribution meets the settings in the metric details, the disk usage is deemed unbalanced, triggering an alarm |
| Increasing producer request failures | An alarm is triggered when the producer request failure rate meets the settings in the metric details |
| Kafka Broker instance DOWN | An alarm is triggered when all broker instances (servers) are down |
| Produce messages | An alarm is triggered when the number of messages produced meets the metric detail settings |
| Produce bytes | An alarm is triggered when the size (in Bytes) of messages produced meets the metric detail settings |
| Consume bytes | An alarm is triggered when the size (in Bytes) of messages consumed by the consumer meets the metric detail settings |
| Consumer lag | An alarm is triggered when the size of the consumer lag (the difference between the offset of the data input by the producer and the offset of the data taken by the consumer) meets the metric detail settings |
| KRaft controller broker DOWN | An alarm is triggered when the controller broker goes down in kraft cluster |
Kafka Network Metrics
| Metric | Description |
|---|---|
| Remaining network resource | An alarm is triggered when the idle rate, calculated as the ratio of idle network resources to total network resources, matches the metric detail settings |
| Request latency - Fetch follower | An alarm is triggered when the time it takes for the follower replica of the partition to receive a response after sending a replication request meets the metric detail settings |
| Request latency - Fetch consumer | An alarm is triggered when the time it takes for the consumer to receive a response after sending a consumption request meets the metric detail settings |
| Request latency - Produce | An alarm is triggered when the time it takes for the client to receive a response after sending a produce request meets the metric detail settings |
Partition Metrics
| Metric | Description |
|---|---|
| Broker partition skewed | An alarm is triggered when the distribution of partition counts between the broker with the most partitions and the broker with the least partitions meets the metric detail settings |
| Leader partition skewed | An alarm is triggered when the distribution of leader partition counts between the broker with the most leader partitions and the broker with the least leader partitions meets the metric detail settings |
| Number of the offline-partitions | An alarm is triggered when the number of offline partitions meets the metric detail settings |
| Number of the partitions in a broker | An alarm is triggered when the total number of partitions within a broker meets the metric detail settings |
| Number of the partitions in a cluster | An alarm is triggered when the total number of partitions meets the metric detail settings |
| Number of the under-min-ISR partitions | An alarm is triggered when the number of partitions that don't meet the minimum number of replicas required for the ISR (In-Sync Replicas - follower replicas synchronized with the leader partition) meets the metric detail settings |
| Number of the under-replicated partitions | An alarm is triggered when the number of unreplicated partitions meets the metric detail settings |
Node Metrics
| Metric | Description |
|---|---|
| CPU usage | An alarm is triggered when CPU usage (%) meets the metric detail settings |
| Node disk usage | An alarm is triggered when disk usage (%) on the node meets the metric detail settings |
| Node DOWN | An alarm is triggered when all nodes are down |
| Memory usage | An alarm is triggered when memory usage (%) meets the metric detail settings |
| Node disk usage per mount point | An alarm is triggered when disk usage (%) per mount point on the node meets the metric detail settings |
ZooKeeper Metrics
| Metric | Description |
|---|---|
| ZooKeeper connection status warning | An alarm is triggered when the connection status between the broker and the ZooKeeper is interrupted |
| ZooKeeper instance DOWN | An alarm is triggered when all instances (servers) of the registered ZooKeeper are down |
Schema Registry Metrics
| Metric | Description |
|---|---|
| Schema Registry instance DOWN | An alarm is triggered when all instances (servers) of the registered Schema Registry Cluster are down |
Consumer Group Metrics
| Metric | Description |
|---|---|
| Consumer group lag | An alarm is triggered when the size of the lag (difference between the offset of the data the producer put in and the data the consumer took) in the consumer group meets the metric detail settings |
| Consumer group status is not STABLE | An alarm is triggered when one or more partitions within a consumer group are detected to be in a delayed, paused, or rewound state. In these cases, the consumer group's status is deemed abnormal |
| Number of consumer instances in a consumer group | An alarm is triggered when the number of consumer instances in the consumer group meets the metric detail settings |
Topic Metrics
| Metric | Description |
|---|---|
| Topic message-in/sec | An alarm is triggered when the number of messages produced per second for the topic meets the metric detail settings |
| Topic byte-in/sec | An alarm is triggered when the size (in Bytes) of messages produced per second for the topic meets the metric detail settings |
| Topic byte-out/sec | An alarm is triggered when the size (in Bytes) of messages consumed per second from the topic meets the metric detail settings |
| Increasement of topic message-in in the last T minutes | An alarm is triggered when the increase in the number of messages produced for the topic in the last T minutes meets the metric detail settings |
| Increasement of topic byte-in in the last T minutes | An alarm is triggered when the increase in the size (in Bytes) of messages produced for the topic in the last T minutes meets the metric detail settings |
| Increasement of topic byte-out in the last T minutes | An alarm is triggered when the increase in the size (in Bytes) of messages consumed from the topic in the last T minutes meets the metric detail settings |
| Increasement of topic message-in in the last T hours | An alarm is triggered when the increase in the number of messages produced for the topic in the last T hours meets the metric detail settings |
| Increasement of topic byte-in in the last T hours | An alarm is triggered when the increase in the size (in Bytes) of messages produced for the topic in the last T hours meets the metric detail settings |
| Increasement of topic byte-out in the last T hours | An alarm is triggered when the increase in the size (in Bytes) of messages consumed from the topic in the last T hours meets the metric detail settings |
Connect Metrics
| Metric | Description |
|---|---|
| Connect instance DOWN | An alarm is triggered when all instances (servers) of the registered Connect Cluster are down |
CMPS Metrics
| Metric | Description |
|---|---|
| Cluster Message Consumption Per Second | Alarm triggered when cluster message consumption per second meets the metric detailed settings |
| Cluster Message Consumption Per Second (Consumer Group Level) | Alarm triggered when cluster message consumption per second per consumer group meets the metric detailed settings |
| Cluster Message Consumption Per Second (Consumer Group - Topic Level) | Alarm triggered when cluster message consumption per second per consumer group - topic meets the metric detailed settings |
Connector Metrics
| Metric | Description |
|---|---|
| Task status abnormal (failed) | An alarm is triggered when the task status changes to failed |
| Messages polled per second (poll) [Source Connector] | An alarm is triggered when the number of messages polled per second by the source connector meets the metric details settings |
| Messages written per second (write) [Source Connector] | An alarm is triggered when the number of messages written per second by the source connector meets the metric details settings |
| Messages read per second (read) [Sink Connector] | An alarm is triggered when the number of messages read per second by the sink connector meets the metric details settings |
| Messages sent per second (send) [Sink Connector] | An alarm is triggered when the number of messages sent per second by the sink connector meets the metric details settings |
| Number of records failed by connector | An alarm is triggered when the number of records the connector failed to process meets the metric details settings |
| Number of write failures to DLT (Dead Letter Topic) | Indicates the number of attempts to write to the DLT that failed for records the connector failed to process. An alarm is triggered when the number meets the metric details settings |
Data Mirroring Metrics
| Metric | Description |
|---|---|
| Messages processed per second by topic (bytes) | An alarm is triggered when the number of bytes of messages replicated per second by topic meets the metric details settings |
| Mirroring job lag | An alarm is triggered when the mirroring job lag meets the metric details settings |