Affects Version/s: None
Release Notes Description:
There is a new option to recover from systematic stream processor failure.
First, add nuxeo.stream.recovery.skipFirstFailures=1 to a single Nuxeo node, Processors will skip the first record in failure instead of terminating.
Second, once the problematic record is skipped remove the option from the nuxeo.conf and perform a rolling restart of other Nuxeo nodes in order to restore all processor threads.There is a new option to recover from systematic stream processor failure. First, add nuxeo.stream.recovery.skipFirstFailures=1 to a single Nuxeo node, Processors will skip the first record in failure instead of terminating. Second, once the problematic record is skipped remove the option from the nuxeo.conf and perform a rolling restart of other Nuxeo nodes in order to restore all processor threads.
Sprint:nxplatform 11.1.11, nxplatform 11.1.12
When a stream processor uses computations with policy configured to stop on failure (this is the default fallback continueOnFailure=false),
a computation that generates a systematic failure after retries for a record will block the entire stream processor on all nodes.
There must be some workaround to unlock the situation.
Today the possible solutions are:
1. Fix the computation code so it can handle or skip the record without error.
2. Play with stream.sh position to change the computation position in the stream in order to skip the record.
3. Add a contribution to change the computation policy to continueOnFailure=true so records in failure are skipped.
Solution 1 is the best approach but it requires a long service break, to analyze, code, test, release and restart all nodes.
Solution 2 requires to inspect the log to find the offset of the record in failure, for instance:
Which inform that computation "default" fails to process record on stream "default" partition "0" offset "12". It is possible to move the position for the computation to the end of the partition or after a given timestamp using ./bin/stream.sh help position. The problem is that we may skip more than one record with this method (by moving the position to the end or after a timestamp), also if there are multiple consumers for the record the operation has to be done for each failure which is a bit complex.
Solution 3 there is a risk to skip other failures than the one analyzed, also this work only for Stream processors contributed by extension point, this is not the case for the StreamWorkManager (or for some computation of the Bulk Service) that is dynamically created and can be changed only by patching code.
An easier way to bypass a problematic record should be provided, the procedure should be like:
- restart a Nuxeo node with a special option, this will skip the single record in failure
- remove the option and do a rolling restart on all Nuxeo nodes.
This action should be done within the stream retention period (default is 4 days for Chronicle Queue and 7 days for Kafka).