[NXP-27148] Store Work in failure in DLQ for repair purpose - Nuxeo Issue Tracker

XML

Word

Printable

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 9.10
Fix Version/s: 9.10-HF34, 10.10-HF10, 11.1, 2021.0
Component/s: Streams

Release Notes Description:
Hide

After retries Works in failure are stored in a dead letter queue (DLQ) stream named dlq-work.
This DLQ is activated by default on both WorkManager implementations (default and StreamWorkManager).
Works in this DLQ can be re-executed for a repair purpose using the following automation operation:

curl -X POST "http://localhost:8080/nuxeo/site/automation/WorkManager.RunWorkInFailure" -u Administrator:Administrator -H 'content-type: application/json+nxrequest' -d '{"params":{},"context":{}}'

This returns a JSON results with the total number of Works re-executed and the number that where successfully executed:

{"total":3,"success":3}

Note that in cluster mode when NOT using Kafka you need to run this automation operation on each Nuxeo node.
Show
After retries Works in failure are stored in a dead letter queue (DLQ) stream named dlq-work . This DLQ is activated by default on both WorkManager implementations (default and StreamWorkManager). Works in this DLQ can be re-executed for a repair purpose using the following automation operation: curl -X POST "http: //localhost:8080/nuxeo/site/automation/WorkManager.RunWorkInFailure" -u Administrator:Administrator -H 'content-type: application/json+nxrequest' -d '{ "params" :{}, "context" :{}}' This returns a JSON results with the total number of Works re-executed and the number that where successfully executed: { "total" :3, "success" :3} Note that in cluster mode when NOT using Kafka you need to run this automation operation on each Nuxeo node.
Epic Link:
Resiliency
Tags:
Backlog priority:
700
Sprint:
nxplatform 11.1.11, nxplatform 11.1.12, nxplatform 11.1.13
Story Points:
5

Description

A work that is in failure after retries is skipped, resulting in a possible consistency problem.
For instance, an indexing work that is failing after retries will be skipped resulting in a discrepancy between the documents in the repository and the one that is indexed.

When the cause of the failure requires manual intervention: fix a misconfiguration, restart a service, fix a disk full, re-deployment ...
a retry policy is not enough.

A possible solution is to have a Dead Letter Queue (DLQ) stream to store Work in failure.

Exposing a metric to count the works in failure will also be useful for monitoring/alerting See ~~NXP-27673~~.

The repair procedure consists of running a stream processor to re-execute the stored dead works,
this could be exposed by REST (an automation operation).

It is possible that some Work should not be retried we may add filtering option later.

Note that without Kafka the repair procedure needs to be executed on each node the DLQ being stored in a local Chronicle Queue storage.

Attachments

Issue Links

is related to

NXP-27674 Fix Stream processor drainAndStop timeout

Resolved

NXP-30450 Support filtering when reprocessing DLQ Work

Resolved

NXP-27687 Add a DLQ fallback to computation policy

Open

NXP-27672 Add a management API to repair Works in failure

Resolved

NXP-27673 Add metrics on Works DLQ usage

Resolved

NXDOC-2295 Write documentation on DLQ

Resolved

NXDOC-1936 Add a Nuxeo Stream section about error handling

Resolved

NXP-26033 Improve Work retry policy

Open

(3 is related to)

Activity

People

Assignee:

Benoit Delbosc

Reporter:

Benoit Delbosc

Participants:

Benoit Delbosc, Gethin James, Jenkins

Reviewers:

Florent Guillaume, Kevin Leturc

Votes:

1 Vote for this issue

Watchers:

6 Start watching this issue

Dates

Created:

2019-04-02 14:43

Updated:

2021-05-31 10:29

Resolved:

2019-07-03 10:07

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: