Reporting metrics about lag and latency are very important to alert when:
- a consumer is blocked in error
- to understand the throughput of consumer and the need to scale
Reporting metrics about latency is not trivial because:
- Kafka exposes lots of metrics at consumer/producer level but few at the cluster level, it always requires additional tools to report lags
- Even if the lag can be given by Kafka, the latency requires to read the current Record that contains timestamp (Nuxeo code).
- It has a cost to get lag and latency for each stream and each consumer group. Technically this means getting lag/latency for each partition of each topic for each group and aggregates results.
Also latency should not be reported by each Nuxeo because these metrics are about cluster,
and the frequency should be low to not create overhead: something like every minute.
The lag and latency can already be displayed using stream.sh, these metrics can also be tracked with the traker option that persists lag and latency over time into a stream.
Building a new stream.sh monitor command that works exactly like the traker but instead of writing metrics to a stream publish them to graphite is the easiest way to have metrics reported in a reliable way. (the tracker command already supports failover)