Uploaded image for project: 'Nuxeo Platform'
  1. Nuxeo Platform
  2. NXP-26248

stream.sh must be able to expose latency to graphite

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 10.3
    • Fix Version/s: 10.3
    • Component/s: Streams
    • Release Notes Description:
      Hide

      Nuxeo Stream can be monitored using a Graphite stack.
      For this, you need to run a provided collector that publishes metrics about the Nuxeo Stream consumers.

      For instance to submit stats about the Bulk Service every minute:

      # when using Kafka
      ${NUXEO_HOME}/bin/stream.sh monitor -k --codec avro -l ALL -i 60 --host graphite-server --port 2003
      # when using Chronicle Queue
      ${NUXEO_HOME}/bin/stream.sh monitor --chronicle ./nxserver/data/streams/bulk --codec avro -l ALL -i 60 --host graphite-server --port 2003
      

      When using Kafka you can run this command on different nodes to ensure failover, only one instance will publish metrics.

      To setup a Graphite/Grafana stack and see a Nuxeo dashboard for testing purpose use docker compose:
      https://github.com/nuxeo/nuxeo/tree/master/nuxeo-runtime/nuxeo-runtime-metrics

      Show
      Nuxeo Stream can be monitored using a Graphite stack. For this, you need to run a provided collector that publishes metrics about the Nuxeo Stream consumers. For instance to submit stats about the Bulk Service every minute: # when using Kafka ${NUXEO_HOME}/bin/stream.sh monitor -k --codec avro -l ALL -i 60 --host graphite-server --port 2003 # when using Chronicle Queue ${NUXEO_HOME}/bin/stream.sh monitor --chronicle ./nxserver/data/streams/bulk --codec avro -l ALL -i 60 --host graphite-server --port 2003 When using Kafka you can run this command on different nodes to ensure failover, only one instance will publish metrics. To setup a Graphite/Grafana stack and see a Nuxeo dashboard for testing purpose use docker compose: https://github.com/nuxeo/nuxeo/tree/master/nuxeo-runtime/nuxeo-runtime-metrics
    • Tags:
    • Backlog priority:
      700
    • Sprint:
      nxcore 10.10.2
    • Story Points:
      3

      Description

      Reporting metrics about lag and latency are very important to alert when:

      • a consumer is blocked in error
      • to understand the throughput of consumer and the need to scale

      Reporting metrics about latency is not trivial because:

      • Kafka exposes lots of metrics at consumer/producer level but few at the cluster level, it always requires additional tools to report lags
      • Even if the lag can be given by Kafka, the latency requires to read the current Record that contains timestamp (Nuxeo code).
      • It has a cost to get lag and latency for each stream and each consumer group. Technically this means getting lag/latency for each partition of each topic for each group and aggregates results.

      Also latency should not be reported by each Nuxeo because these metrics are about cluster,
      and the frequency should be low to not create overhead: something like every minute.

      The lag and latency can already be displayed using stream.sh, these metrics can also be tracked with the traker option that persists lag and latency over time into a stream.

      Building a new stream.sh monitor command that works exactly like the traker but instead of writing metrics to a stream publish them to graphite is the easiest way to have metrics reported in a reliable way. (the tracker command already supports failover)

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0 minutes
                  0m
                  Logged:
                  Time Spent - 2 days
                  2d