[NXP-28094] Add Nuxeo Stream probe to health check by default - Nuxeo Issue Tracker

XML

Word

Printable

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 10.10
Fix Version/s: 11.1, 2021.0
Component/s: Streams

Release Notes Description:

Hide

The default Nuxeo health check that is used by the runningstatus REST endpoint now includes a probe to check Nuxeo Stream Processors.
In case of a processor failure (after retries and because the policy is set to stop on failure)
the health check will report a failure message immediately but will return a failure code only after a delay of 36h.
The idea is to use this delay to fix the problem and choose when to restart the Nuxeo instance, the alert can be triggered by other metrics or error log.

Show
The default Nuxeo health check that is used by the runningstatus REST endpoint now includes a probe to check Nuxeo Stream Processors. In case of a processor failure (after retries and because the policy is set to stop on failure) the health check will report a failure message immediately but will return a failure code only after a delay of 36h. The idea is to use this delay to fix the problem and choose when to restart the Nuxeo instance, the alert can be triggered by other metrics or error log.
Epic Link:
Stream Scalability
Tags:
- nxplatform
- resilience&scalability
Team:
PLATFORM
Sprint:
nxplatform 11.1.20, nxplatform 11.1.21
Story Points:
3

Description

Today the Nuxeo Stream Probe is not activated in the default health check run by runningstatus endpoint.

The reason is that any failure in processing will result in application downtime. This is too drastic, in case of processor failure the application can continue to work, the cause of the failure can be resolved and a rolling restart can be done. So the problem could be resolved without data loss or downtime if intervention is done within the stream retention period (4 days for CQ, 7 days for Kafka). So activating the current Nuxeo Stream probe will prevent this kind of intervention.

To solve this we could have a delayed probe, instead of returning a failure at the time it happens it will report the failure after a configurable period like 2 days for instance.

Nuxeo instances will continue to run for 2 days before being terminated by the control plane.

During these 2 days, operations can still monitor errors or metrics or use the original probe, fix the cause of the failure and restart without any downtime.

But if the failure is ignored Nuxeo will be stopped before the stream retention period, so we don't lose any data and force a manual intervention.

Attachments

Issue Links

is related to

NXDOC-2002 Update Nuxeo Stream Error Handling with new HealthCheck

Resolved

NXP-28329 Add Nuxeo Stream probe to health check by default for 10.10

Resolved

NXP-27471 Expose stream processor failures as metrics

Resolved

Is referenced in

PR for master: #4165

Activity

People

Assignee:

Benoit Delbosc

Reporter:

Benoit Delbosc

Participants:

Benoit Delbosc, Jenkins, Support Tech User

Votes:

0 Vote for this issue

Watchers:

3 Start watching this issue

Dates

Created:

2019-10-01 09:49

Updated:

2020-12-17 16:35

Resolved:

2019-10-29 07:57

Time Tracking

Estimated:

Remaining:

30m

Logged: