[NXP-26248] stream.sh must be able to expose latency to graphite - Nuxeo Issue Tracker

XML

Word

Printable

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 10.3
Fix Version/s: 10.3
Component/s: Streams

Release Notes Description:
Hide

Nuxeo Stream can be monitored using a Graphite stack.
For this, you need to run a provided collector that publishes metrics about the Nuxeo Stream consumers.

For instance to submit stats about the Bulk Service every minute:

# when using Kafka ${NUXEO_HOME}/bin/stream.sh monitor -k --codec avro -l ALL -i 60 --host graphite-server --port 2003 # when using Chronicle Queue ${NUXEO_HOME}/bin/stream.sh monitor --chronicle ./nxserver/data/streams/bulk --codec avro -l ALL -i 60 --host graphite-server --port 2003

When using Kafka you can run this command on different nodes to ensure failover, only one instance will publish metrics.

To setup a Graphite/Grafana stack and see a Nuxeo dashboard for testing purpose use docker compose:
https://github.com/nuxeo/nuxeo/tree/master/nuxeo-runtime/nuxeo-runtime-metrics
Show
Nuxeo Stream can be monitored using a Graphite stack. For this, you need to run a provided collector that publishes metrics about the Nuxeo Stream consumers. For instance to submit stats about the Bulk Service every minute: # when using Kafka ${NUXEO_HOME}/bin/stream.sh monitor -k --codec avro -l ALL -i 60 --host graphite-server --port 2003 # when using Chronicle Queue ${NUXEO_HOME}/bin/stream.sh monitor --chronicle ./nxserver/data/streams/bulk --codec avro -l ALL -i 60 --host graphite-server --port 2003 When using Kafka you can run this command on different nodes to ensure failover, only one instance will publish metrics. To setup a Graphite/Grafana stack and see a Nuxeo dashboard for testing purpose use docker compose: https://github.com/nuxeo/nuxeo/tree/master/nuxeo-runtime/nuxeo-runtime-metrics
Tags:
- nxcore
Backlog priority:
700
Sprint:
nxcore 10.10.2
Story Points:
3

Description

Reporting metrics about lag and latency are very important to alert when:

a consumer is blocked in error
to understand the throughput of consumer and the need to scale

Reporting metrics about latency is not trivial because:

Kafka exposes lots of metrics at consumer/producer level but few at the cluster level, it always requires additional tools to report lags
Even if the lag can be given by Kafka, the latency requires to read the current Record that contains timestamp (Nuxeo code).
It has a cost to get lag and latency for each stream and each consumer group. Technically this means getting lag/latency for each partition of each topic for each group and aggregates results.

Also latency should not be reported by each Nuxeo because these metrics are about cluster,
and the frequency should be low to not create overhead: something like every minute.

The lag and latency can already be displayed using stream.sh, these metrics can also be tracked with the traker option that persists lag and latency over time into a stream.

Building a new stream.sh monitor command that works exactly like the traker but instead of writing metrics to a stream publish them to graphite is the easiest way to have metrics reported in a reliable way. (the tracker command already supports failover)

Attachments

Issue Links

is related to

NXP-26338 Add missing metrics on Grafana dashboard

Resolved

NXP-28508 Expose Nuxeo Stream latency metrics to Datadog

Resolved

NXP-26416 Nuxeo Stream should expose latency to Prometheus

Resolved

Activity

People

Assignee:

Benoit Delbosc

Reporter:

Benoit Delbosc

Participants:

Benoit Delbosc, Jenkins

Votes:

0 Vote for this issue

Watchers:

2 Start watching this issue

Dates

Created:

2018-11-19 15:48

Updated:

2020-01-13 16:47

Resolved:

2018-11-29 11:31

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: