Top Kafka Metrics to Monitor in Production

October 6, 2025
Tags:
Observability
OpenTelemetry
Kafka Monitoring

Apache Kafka is the backbone of many modern, event-driven systems, powering everything from user activity streams to fraud detection pipelines. 

But in production, it’s not enough for Kafka to simply be “up and running.” Engineers need to know if messages are backing up, if replication is falling behind, or if controller elections are destabilizing the cluster. These aren’t just metrics. They’re early warning signals for potential downtime, data loss, or cascading failures across the stack.

This guide dives into the most important Kafka broker metrics that SRE and DevOps teams should track to ensure environment reliability and performance at scale. 

Why Kafka Broker Metrics Matter

In the case of Kafka, while both client-side (producer/consumer) and server-side metrics are important, broker-level metrics offer the most reliable window into Kafka’s real-time health and stability. 

The broker is the core engine managing message ingestion, replication, controller elections, and serving reads to consumers. If it slows down, stalls, or fails, the entire pipeline gets disrupted.

From an operations perspective, broker metrics answer key questions such as:

  • Are we ingesting and replicating messages fast enough during peak load?
  • Are any partitions under-replicated or without a leader?
  • Is the controller stable, or are we experiencing frequent failovers?

 Kafka metrics provide real-time operational visibility into the cluster’s performance and fault tolerance

By monitoring these metrics closely,  teams can catch early signs of trouble, whether it’s replication lag, rising message backlogs, or controller instability and take corrective action before it affects production workflows. 

💡 Methods to Monitor Kafka Metrics with OpenTelemetry

OpenTelemetry (OTel) provides flexible measures to instrument and collect Kafka metrics. Depending on your environment and requirements, you can choose from three primary methods to instrument from the broker side :

1. JMX Receiver
– Exposes Kafka’s built-in JMX metrics, collected via the OTel Collector. Good for detailed broker-level metrics.
2. OTel Java Agent – Auto-instruments Kafka applications (producers/consumers) to export metrics without manual code changes.
3. Kafka Metrics Receiver – Purpose-built OTel component that directly collects Kafka metrics without JMX overhead. Lightweight and simpler for cluster-level monitoring.

Each method has trade-offs in terms of setup complexity, performance, and coverage. To learn more checkout our detailed Kafka Monitoring Blog

In short, broker metrics serve as the first line of defense for maintaining performance, resilience, and overall system stability at scale. To provide deeper operational clarity, we’ve divided them into specific categories.

Categories of Kafka Metrics

Kafka metrics can be grouped into categories,  highlighting  workload patterns, replication status, controller stability, and request processing efficiency, ensuring  engineers to understand how well Kafka is sustaining reliability and performance under varying conditions.

1. Traffic/Throughput

Kafka clusters often experience unpredictable traffic patterns, especially when handling spikes in producer or consumer activity. 

Throughput metrics provide foundational visibility into the volume of messages flowing through the system, helping teams understand whether the cluster is keeping pace with demand or experiencing ingestion bottlenecks.

Consider a scenario where your team is onboarding a new high-throughput service that pushes millions of events per minute into Kafka. Or perhaps you're troubleshooting delays during peak load hours. In both cases, throughput metrics help determine whether Kafka is sustaining ingestion and delivery performance, or if scaling is required to avoid backpressure.

Here are the core metrics to track in this category:

  • kafka_message_count_total: Tracks the total number of messages received by the broker, a key indicator of overall ingestion volume.

  • kafka_network_io_bytes_total{direction="in"}: Measures inbound network traffic, reflecting producer write load and helps to validate whether producers are pushing data as expected.

  • kafka_network_io_bytes_total{direction="out"}: Captures outbound traffic, signaling consumer read activity and offering insights into downstream processing throughput.

2. Request Load

Beyond traffic volume, it is equally important to understand how producers and consumers are interacting with the Kafka environment. 

Request load metrics shed light on the frequency, type, and health of requests handled by brokers. These metrics reveal whether client activity is evenly distributed or if the cluster is straining under excessive request pressure, which can lead to degraded performance or failures.

In production, teams often face issues where services start timing out, retries increase, or latency spikes without a clear root cause. Request load metrics reveal whether the problem originates from producers overloading brokers with publish requests, consumers hammering brokers with excessive fetches, or brokers themselves failing to keep up with client demand. These metrics help teams to quickly isolate whether scaling, load redistribution, or client tuning is needed to restore stability.

Here are the core metrics to track in this category:

  • kafka_request_count_total{type="produce"}: Count of producer requests, reflecting the publishing rate of new data into the cluster.


  • kafka_request_count_total{type="fetch"}: Count of consumer fetch requests, signaling read demand and how aggressively consumers are polling brokers.

  • kafka_request_failed_total: Number of failed requests, exposing client-broker communication errors or request-level issues impacting reliability.



3. Partitions & Replication

For ensuring Kafka’s durability, fault tolerance, and balanced workload distribution, partition and replication metrics are critical. They provide visibility into whether data is evenly spread across brokers and if replicas are in sync to guarantee high availability during failures.

While onboarding a new high-volume service, a surge in partition creation and replication activity can expose uneven broker workloads or performance hotspots. Replication delays may also occur if follower brokers struggle to keep pace with leaders, leading to under-replicated partitions. Tracking these metrics enables teams to quickly detect imbalances, validate replication health, and take corrective actions like rebalancing or scaling.

Here are the core metrics to track:

  • kafka_partition_count → Total number of partitions managed by the broker, representing workload distribution.


  • kafka_partition_underReplicated → Partitions missing replicas, signaling replication lag or failures in maintaining redundancy.


  • kafka_partition_offline → Partitions without an active leader, which should always remain at zero to avoid unavailability.


  • kafka_replica_fetcher_lag → Measures the lag between leader and follower replicas, indicating how quickly replicas are catching up with the leader.

4. Controller Health

In Kafka, the controller plays a central role in managing partition leadership and broker coordination. 

These metrics are crucial for ensuring that the cluster remains stable, since any instability in the controller directly impacts availability, partition assignments, and overall reliability of message delivery.

Consider a case where your team is troubleshooting recurring issues where consumer groups face frequent rebalances or producers see intermittent write failures. You notice partition leadership is shifting too often, or more than one broker is acting as the controller. In such cases, controller health metrics reveal if instability is caused by unclean leader elections, failed broker nodes, or competing controllers issues that disrupt service reliability.

Here are the core metrics to track:

  • kafka_controller_active_count → Number of active controllers in the cluster. This should always be exactly one—anything else signals instability in cluster coordination.


  • kafka_leaderElection_unclean_count_total → Counts the number of unclean leader elections. A rising value indicates controller instability and potential risk of data loss during failover events.

5. Latency

Kafka clusters often face performance challenges where even small delays in request handling can ripple through and impact downstream systems. 

Latency metrics provide critical visibility into how responsive the brokers are, helping teams identify whether Kafka is introducing delays that slow down overall event processing.

Let us say your downstream applications begin reporting slow reads, or data pipelines are lagging behind in processing events during peak hours. Developers and SREs often struggle to identify whether the bottleneck lies in Kafka’s request handling or in the consumers themselves. Latency metrics expose if requests are slowing at median levels or if tail latency spikes are creating unpredictable delays insights that guide whether to optimize cluster resources or scale out.

Here are the core metrics to track:

  • kafka_request_time_50p_milliseconds → Median latency per request type, indicating the typical responsiveness of Kafka.

  • kafka_request_time_99p_milliseconds → 99th percentile latency, capturing worst-case request handling delays.

Up next we’ll be focusing on which metrics actually matter most for operational visibility.

Top 10 Kafka Metrics Every Team Should Monitor

Kafka generates hundreds of metrics, but only a handful truly determine whether your environment is stable, reliable, and performing as expected.

These Ten metrics provide actionable insights into Kafka’s availability,  replication integrity, and request processing performance, enabling teams to proactively maintain reliability at scale.

Metric Name Type Description Critical Insight
kafka_message_count_total Counter Total messages received by the broker. Tracks overall message ingestion volume.
kafka_network_io_bytes_total {direction="in"} Counter Inbound traffic in bytes. Monitors producer write load.
kafka_network_io_bytes_total {direction="out"} Counter Outbound traffic in bytes. Monitors consumer read activity.
kafka_request_count_total {type="produce"} Counter Count of producer requests. Indicates producer activity levels.
kafka_request_count_total {type="fetch"} Counter Count of consumer fetch requests. Checks consumer read demand.
kafka_request_failed_total Counter Number of failed requests. Detects client-broker communication issues.
kafka_partition_underReplicated Gauge Partitions missing replicas. Highlights replication issues.
kafka_partition_offline Gauge Partitions without a leader. Signals potential data unavailability.
kafka_controller_active_count Gauge Number of active controllers. Ensures cluster stability.
kafka_leaderElection_unclean_count_total Counter Count of unclean leader elections. Indicates controller instability.
kafka_consumer_lag Gauge Difference between latest offset and consumer offset. Identifies consumers falling behind.
kafka_log_flush_rate Counter Rate of log flush operations. Monitors disk I/O pressure.

Analyzing these metrics side by side in a unified view enables teams to identify correlations, detect anomalies faster, and to make actionable insights requires effective visualization.

Visualizing Kafka Metrics

The collected Kafka metrics can be exposed and can be queried in Prometheus. Below is a comparison of what you can expect depending on the method used:

1. JMX Receiver and Java Agent

Both of these approaches provide the full set of Kafka metrics covering broker, topic, partition, and consumer group level visibility. Since they rely on the underlying JMX metrics, the output is identical for both methods.

2. Kafka Metrics Receiver

Instrumenting Kafka via the kafkametrics receiver is more focused and lightweight. It collects only broker-level, consumer, and topic metrics. While it’s simpler to configure and less resource-intensive, it does not provide the same depth as the JMX-based methods.

Here’s a side-by-side comparison of broker, topic, and consumer-level Kafka metrics available through the Kafka Metrics Receiver and the JMX/Java Agent.

In addition to these metrics, this method also surfaces JVM-level metrics ( heap usage, garbage collection, thread activity). For details on the complete set of available metrics, refer to the documentation.

Conclusion

Ensuring Kafka stays performant under pressure isn’t about tracking hundreds of metrics, it’s about watching the right ones. 

By keeping an eye on broker-level signals like replication lag, consumer throughput, request latency, and controller stability, engineers can catch problems before they turn into incidents.

The goal isn’t just visibility. It’s faster debugging, lower downtime, and smoother operations at scale. 

To learn more about monitoring Kafka in production, refer to the Kafka monitoring guide.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Isha Bhardwaj
Linked In

Receive blog & product updates

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.