Uploaded image for project: 'Nuxeo Platform'
  1. Nuxeo Platform
  2. NXP-28094

Add Nuxeo Stream probe to health check by default

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 10.10
    • Fix Version/s: 11.1, 2021.0
    • Component/s: Streams
    • Release Notes Description:
      Hide

      The default Nuxeo health check that is used by the runningstatus REST endpoint now includes a probe to check Nuxeo Stream Processors.
      In case of a processor failure (after retries and because the policy is set to stop on failure)
      the health check will report a failure message immediately but will return a failure code only after a delay of 36h.
      The idea is to use this delay to fix the problem and choose when to restart the Nuxeo instance, the alert can be triggered by other metrics or error log.

      Show
      The default Nuxeo health check that is used by the runningstatus REST endpoint now includes a probe to check Nuxeo Stream Processors. In case of a processor failure (after retries and because the policy is set to stop on failure) the health check will report a failure message immediately but will return a failure code only after a delay of 36h. The idea is to use this delay to fix the problem and choose when to restart the Nuxeo instance, the alert can be triggered by other metrics or error log.
    • Team:
      PLATFORM
    • Sprint:
      nxplatform 11.1.20, nxplatform 11.1.21
    • Story Points:
      3

      Description

      Today the Nuxeo Stream Probe is not activated in the default health check run by runningstatus endpoint.

      The reason is that any failure in processing will result in application downtime. This is too drastic, in case of processor failure the application can continue to work, the cause of the failure can be resolved and a rolling restart can be done. So the problem could be resolved without data loss or downtime if intervention is done within the stream retention period (4 days for CQ, 7 days for Kafka). So activating the current Nuxeo Stream probe will prevent this kind of intervention.

      To solve this we could have a delayed probe, instead of returning a failure at the time it happens it will report the failure after a configurable period like 2 days for instance.

      Nuxeo instances will continue to run for 2 days before being terminated by the control plane.

      During these 2 days, operations can still monitor errors or metrics or use the original probe, fix the cause of the failure and restart without any downtime.

      But if the failure is ignored Nuxeo will be stopped before the stream retention period, so we don't lose any data and force a manual intervention.

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 0 minutes
                  0m
                  Remaining:
                  Time Spent - 6 hours Remaining Estimate - 30 minutes
                  30m
                  Logged:
                  Time Spent - 6 hours Remaining Estimate - 30 minutes
                  6h

                    PagerDuty

                    Error rendering 'com.pagerduty.jira-server-plugin:PagerDuty'. Please contact your Jira administrators.