[NXP-27529] Provide a recovery procedure for systematic failure in a stream processor - Nuxeo Issue Tracker

XML

Word

Printable

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 11.1, 2021.0
Component/s: Streams

Release Notes Description:

Hide

There is a new option to recover from systematic stream processor failure.
First, add nuxeo.stream.recovery.skipFirstFailures=1 to a single Nuxeo node, Processors will skip the first record in failure instead of terminating.
Second, once the problematic record is skipped remove the option from the nuxeo.conf and perform a rolling restart of other Nuxeo nodes in order to restore all processor threads.

Show
There is a new option to recover from systematic stream processor failure. First, add nuxeo.stream.recovery.skipFirstFailures=1 to a single Nuxeo node, Processors will skip the first record in failure instead of terminating. Second, once the problematic record is skipped remove the option from the nuxeo.conf and perform a rolling restart of other Nuxeo nodes in order to restore all processor threads.
Epic Link:
Resiliency
Tags:
- nxplatform
Sprint:
nxplatform 11.1.11, nxplatform 11.1.12
Story Points:
2

Description

When a stream processor uses computations with policy configured to stop on failure (this is the default fallback continueOnFailure=false),
a computation that generates a systematic failure after retries for a record will block the entire stream processor on all nodes.
There must be some workaround to unlock the situation.

Today the possible solutions are:
1. Fix the computation code so it can handle or skip the record without error.
2. Play with stream.sh position to change the computation position in the stream in order to skip the record.
3. Add a contribution to change the computation policy to continueOnFailure=true so records in failure are skipped.

Solution 1 is the best approach but it requires a long service break, to analyze, code, test, release and restart all nodes.

Solution 2 requires to inspect the log to find the offset of the record in failure, for instance:

2019-06-06T11:40:38,655 ERROR [AbstractComputation] Computation: default fails last record: default-00:+12, after retries.

Which inform that computation "default" fails to process record on stream "default" partition "0" offset "12". It is possible to move the position for the computation to the end of the partition or after a given timestamp using ./bin/stream.sh help position. The problem is that we may skip more than one record with this method (by moving the position to the end or after a timestamp), also if there are multiple consumers for the record the operation has to be done for each failure which is a bit complex.

Solution 3 there is a risk to skip other failures than the one analyzed, also this work only for Stream processors contributed by extension point, this is not the case for the StreamWorkManager (or for some computation of the Bulk Service) that is dynamically created and can be changed only by patching code.

An easier way to bypass a problematic record should be provided, the procedure should be like:

restart a Nuxeo node with a special option, this will skip the single record in failure
remove the option and do a rolling restart on all Nuxeo nodes.
This action should be done within the stream retention period (default is 4 days for Chronicle Queue and 7 days for Kafka).

Attachments

Issue Links

is related to

NXP-28023 Fix Kafka TestStreamProcessor.testComputationRecoveryPolicy

Resolved

NXDOC-1936 Add a Nuxeo Stream section about error handling

Resolved

NXP-28043 Backport Stream Processor probe to the runningstatus

Resolved

NXP-27471 Expose stream processor failures as metrics

Resolved

Activity

People

Assignee:

Benoit Delbosc

Reporter:

Benoit Delbosc

Participants:

Benoit Delbosc, Jenkins

Votes:

0 Vote for this issue

Watchers:

2 Start watching this issue

Dates

Created:

2019-06-06 13:00

Updated:

2020-12-17 16:36

Resolved:

2019-06-21 12:48

Time Tracking

Estimated:

Remaining:

Logged: