Uploaded image for project: 'Nuxeo Platform'
  1. Nuxeo Platform
  2. NXP-27148

Store Work in failure in DLQ for repair purpose

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 9.10
    • Fix Version/s: 9.10-HF34, 10.10-HF10, 11.1
    • Component/s: Streams
    • Release Notes Description:
      Hide

      After retries Works in failure are stored in a dead letter queue (DLQ) stream named dlq-work.
      This DLQ is activated by default on both WorkManager implementations (default and StreamWorkManager).
      Works in this DLQ can be re-executed for a repair purpose using the following automation operation:

      curl -X POST "http://localhost:8080/nuxeo/site/automation/WorkManager.RunWorkInFailure" -u Administrator:Administrator -H 'content-type: application/json+nxrequest' -d '{"params":{},"context":{}}'
      

      This returns a JSON results with the total number of Works re-executed and the number that where successfully executed:

      {"total":3,"success":3}
      

      Note that in cluster mode when NOT using Kafka you need to run this automation operation on each Nuxeo node.

      Show
      After retries Works in failure are stored in a dead letter queue (DLQ) stream named dlq-work . This DLQ is activated by default on both WorkManager implementations (default and StreamWorkManager). Works in this DLQ can be re-executed for a repair purpose using the following automation operation: curl -X POST "http: //localhost:8080/nuxeo/site/automation/WorkManager.RunWorkInFailure" -u Administrator:Administrator -H 'content-type: application/json+nxrequest' -d '{ "params" :{}, "context" :{}}' This returns a JSON results with the total number of Works re-executed and the number that where successfully executed: { "total" :3, "success" :3} Note that in cluster mode when NOT using Kafka you need to run this automation operation on each Nuxeo node.
    • Backlog priority:
      700
    • Sprint:
      nxplatform 11.1.11, nxplatform 11.1.12, nxplatform 11.1.13
    • Story Points:
      5
    • Epic Link:

      Description

      A work that is in failure after retries is skipped, resulting in a possible consistency problem.
      For instance, an indexing work that is failing after retries will be skipped resulting in a discrepancy between the documents in the repository and the one that is indexed.

      When the cause of the failure requires manual intervention: fix a misconfiguration, restart a service, fix a disk full, re-deployment ...
      a retry policy is not enough.

      A possible solution is to have a Dead Letter Queue (DLQ) stream to store Work in failure.

      Exposing a metric to count the works in failure will also be useful for monitoring/alerting See NXP-27673.

      The repair procedure consists of running a stream processor to re-execute the stored dead works,
      this could be exposed by REST (an automation operation).

      It is possible that some Work should not be retried we may add filtering option later.

      Note that without Kafka the repair procedure needs to be executed on each node the DLQ being stored in a local Chronicle Queue storage.

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                1 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0 minutes
                  0m
                  Logged:
                  Time Spent - 1 week
                  1w