-
Type: Improvement
-
Status: Resolved
-
Priority: Minor
-
Resolution: Fixed
-
Affects Version/s: 10.10
-
Component/s: Streams
-
Release Notes Description:
-
Epic Link:
-
Team:PLATFORM
-
Sprint:nxplatform 11.1.20, nxplatform 11.1.21
-
Story Points:3
Today the Nuxeo Stream Probe is not activated in the default health check run by runningstatus endpoint.
The reason is that any failure in processing will result in application downtime. This is too drastic, in case of processor failure the application can continue to work, the cause of the failure can be resolved and a rolling restart can be done. So the problem could be resolved without data loss or downtime if intervention is done within the stream retention period (4 days for CQ, 7 days for Kafka). So activating the current Nuxeo Stream probe will prevent this kind of intervention.
To solve this we could have a delayed probe, instead of returning a failure at the time it happens it will report the failure after a configurable period like 2 days for instance.
Nuxeo instances will continue to run for 2 days before being terminated by the control plane.
During these 2 days, operations can still monitor errors or metrics or use the original probe, fix the cause of the failure and restart without any downtime.
But if the failure is ignored Nuxeo will be stopped before the stream retention period, so we don't lose any data and force a manual intervention.
- is related to
-
NXDOC-2002 Update Nuxeo Stream Error Handling with new HealthCheck
- Resolved
-
NXP-28329 Add Nuxeo Stream probe to health check by default for 10.10
- Resolved
-
NXP-27471 Expose stream processor failures as metrics
- Resolved
- Is referenced in