Uploaded image for project: 'Nuxeo Platform'
  1. Nuxeo Platform
  2. NXP-27471

Expose stream processor failures as metrics

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 10.10
    • Fix Version/s: 11.1, 2021.0
    • Component/s: Streams

      Description

      Since NXP-27164 there is a probe to report stream processor failure. Because probes are used through the runningstatus as health status, the result is that a record that creates a systematic error on the processing will block the entire system.
      This can be mitigated by using proper retry policy for a temporary failure (service unavailable or in failure that requires human intervention) but this is problematic for a buggy record that creates a systematic error.
      So instead of activating a probe for the stream processor, we could have metrics on processing in error that can be used as a warning in a monitoring dashboard.
      This way the ops can choose when to restart Nuxeo node instead of having them automatically blacklisted or restarted.

      The solution is to expose a counter metric when the processing enters in termination due to error, also even if the probe is disabled it will be nice to have the stream processor probe output to list which processing is failing.

        Attachments

        1. Grafana Stream failure counter.png
          Grafana Stream failure counter.png
          3 kB
        2. screenshot-1.png
          screenshot-1.png
          17 kB
        3. screenshot-2.png
          screenshot-2.png
          21 kB
        4. screenshot-3.png
          screenshot-3.png
          30 kB
        5. screenshot-4.png
          screenshot-4.png
          45 kB

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 0 minutes
                  0m
                  Remaining:
                  Remaining Estimate - 0 minutes
                  0m
                  Logged:
                  Time Spent - 6 hours
                  6h

                    PagerDuty

                    Error rendering 'com.pagerduty.jira-server-plugin:PagerDuty'. Please contact your Jira administrators.