-
Type: Bug
-
Status: Resolved
-
Priority: Minor
-
Resolution: Fixed
-
Affects Version/s: 10.10
-
Fix Version/s: 10.10-HF22, 11.1, 2021.0
-
Component/s: Streams
-
Release Notes Summary:StreamStatus probe detects all abnormal computation termination.
-
Backlog priority:900
-
Team:PLATFORM
-
Sprint:nxplatform 11.1.27
-
Story Points:3
Since NXP-27471 and NXP-28094, the streamProbe detects a failure during processing (computation user's code) and a metric can be used as alerting.
There are still code paths where failure is not reported as such:
1. In computation code when asking for termination after an uncoverable error, calling askForTermination performs a wanted termination so the probe doesn't report any failure. To fix this an exception must be raised so the fallback policy can be applied and the probe reports the failure.
For instance, this is the case In AbstractBulkComputation if the KVStore is not readable:
2020-01-06T12:06:57,906 ERROR [myActionComputationPool-00] [org.nuxeo.ecm.core.bulk.action.computation.AbstractBulkComputation] Stopping processing, unknown command: 5d20a75d-5ae3-4cf3-8cfb-45f459f883e9, offset: bulkDatasetExport-00:+78456167596035, record: Record{watermark=206874717140025344, wmDate=2020-01-06 11:40:35.602, flags=[DEFAULT], key='5d20a75d-5ae3-4cf3-8cfb-45f459f883e9:1', data.length=160, data="....%'..Y.H5d20a75d-5ae3-4cf3-8cfb-45f459f883e9.Hd4365bec-2831-4be1-a5d4-15eb43bb68adH08097371-9c40-46e4-abaf-84352ed5a797Hb667"}.
2. In the ComputationRunner code, errors are not reported as a failure by the probe. For instance when Kafka is not reachable or is not able to commit the consumer position.
We need to make sure that abnormal termination is reported as a failure by the probe.
Note that this is different from NXP-28524 which is focus on improving resiliency when Kafka is not reachable.
- is related to
-
NXP-27471 Expose stream processor failures as metrics
- Resolved
- Is referenced in