-
Type: Improvement
-
Status: Resolved
-
Priority: Minor
-
Resolution: Fixed
-
Affects Version/s: None
-
Component/s: Streams
-
Release Notes Description:
-
Epic Link:
-
Tags:
-
Sprint:nxplatform 11.1.11, nxplatform 11.1.12
-
Story Points:2
When a stream processor uses computations with policy configured to stop on failure (this is the default fallback continueOnFailure=false),
a computation that generates a systematic failure after retries for a record will block the entire stream processor on all nodes.
There must be some workaround to unlock the situation.
Today the possible solutions are:
1. Fix the computation code so it can handle or skip the record without error.
2. Play with stream.sh position to change the computation position in the stream in order to skip the record.
3. Add a contribution to change the computation policy to continueOnFailure=true so records in failure are skipped.
Solution 1 is the best approach but it requires a long service break, to analyze, code, test, release and restart all nodes.
Solution 2 requires to inspect the log to find the offset of the record in failure, for instance:
2019-06-06T11:40:38,655 ERROR [AbstractComputation] Computation: default fails last record: default-00:+12, after retries.
Which inform that computation "default" fails to process record on stream "default" partition "0" offset "12". It is possible to move the position for the computation to the end of the partition or after a given timestamp using ./bin/stream.sh help position. The problem is that we may skip more than one record with this method (by moving the position to the end or after a timestamp), also if there are multiple consumers for the record the operation has to be done for each failure which is a bit complex.
Solution 3 there is a risk to skip other failures than the one analyzed, also this work only for Stream processors contributed by extension point, this is not the case for the StreamWorkManager (or for some computation of the Bulk Service) that is dynamically created and can be changed only by patching code.
An easier way to bypass a problematic record should be provided, the procedure should be like:
- restart a Nuxeo node with a special option, this will skip the single record in failure
- remove the option and do a rolling restart on all Nuxeo nodes.
This action should be done within the stream retention period (default is 4 days for Chronicle Queue and 7 days for Kafka).
- is related to
-
NXP-28023 Fix Kafka TestStreamProcessor.testComputationRecoveryPolicy
- Resolved
-
NXDOC-1936 Add a Nuxeo Stream section about error handling
- Resolved
-
NXP-28043 Backport Stream Processor probe to the runningstatus
- Resolved
-
NXP-27471 Expose stream processor failures as metrics
- Resolved