Uploaded image for project: 'Nuxeo Platform'
  1. Nuxeo Platform
  2. NXP-27529

Provide a recovery procedure for systematic failure in a stream processor

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 11.1, 2021.0
    • Component/s: Streams
    • Release Notes Description:
      Hide

      There is a new option to recover from systematic stream processor failure.
      First, add nuxeo.stream.recovery.skipFirstFailures=1 to a single Nuxeo node, Processors will skip the first record in failure instead of terminating.
      Second, once the problematic record is skipped remove the option from the nuxeo.conf and perform a rolling restart of other Nuxeo nodes in order to restore all processor threads.

      Show
      There is a new option to recover from systematic stream processor failure. First, add nuxeo.stream.recovery.skipFirstFailures=1 to a single Nuxeo node, Processors will skip the first record in failure instead of terminating. Second, once the problematic record is skipped remove the option from the nuxeo.conf and perform a rolling restart of other Nuxeo nodes in order to restore all processor threads.
    • Epic Link:
    • Tags:
    • Sprint:
      nxplatform 11.1.11, nxplatform 11.1.12
    • Story Points:
      2

      Description

      When a stream processor uses computations with policy configured to stop on failure (this is the default fallback continueOnFailure=false),
      a computation that generates a systematic failure after retries for a record will block the entire stream processor on all nodes.
      There must be some workaround to unlock the situation.

      Today the possible solutions are:
      1. Fix the computation code so it can handle or skip the record without error.
      2. Play with stream.sh position to change the computation position in the stream in order to skip the record.
      3. Add a contribution to change the computation policy to continueOnFailure=true so records in failure are skipped.

      Solution 1 is the best approach but it requires a long service break, to analyze, code, test, release and restart all nodes.

      Solution 2 requires to inspect the log to find the offset of the record in failure, for instance:

      2019-06-06T11:40:38,655 ERROR [AbstractComputation] Computation: default fails last record: default-00:+12, after retries.
      

      Which inform that computation "default" fails to process record on stream "default" partition "0" offset "12". It is possible to move the position for the computation to the end of the partition or after a given timestamp using ./bin/stream.sh help position. The problem is that we may skip more than one record with this method (by moving the position to the end or after a timestamp), also if there are multiple consumers for the record the operation has to be done for each failure which is a bit complex.

      Solution 3 there is a risk to skip other failures than the one analyzed, also this work only for Stream processors contributed by extension point, this is not the case for the StreamWorkManager (or for some computation of the Bulk Service) that is dynamically created and can be changed only by patching code.

      An easier way to bypass a problematic record should be provided, the procedure should be like:

      • restart a Nuxeo node with a special option, this will skip the single record in failure
      • remove the option and do a rolling restart on all Nuxeo nodes.
        This action should be done within the stream retention period (default is 4 days for Chronicle Queue and 7 days for Kafka).

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 0 minutes
                  0m
                  Remaining:
                  Remaining Estimate - 0 minutes
                  0m
                  Logged:
                  Time Spent - 1 day
                  1d