-
Type: Improvement
-
Status: Resolved
-
Priority: Minor
-
Resolution: Fixed
-
Affects Version/s: 9.10
-
Fix Version/s: 9.10-HF34, 10.10-HF10, 11.1, 2021.0
-
Component/s: Streams
-
Release Notes Description:
-
Epic Link:
-
Tags:
-
Backlog priority:700
-
Sprint:nxplatform 11.1.11, nxplatform 11.1.12, nxplatform 11.1.13
-
Story Points:5
A work that is in failure after retries is skipped, resulting in a possible consistency problem.
For instance, an indexing work that is failing after retries will be skipped resulting in a discrepancy between the documents in the repository and the one that is indexed.
When the cause of the failure requires manual intervention: fix a misconfiguration, restart a service, fix a disk full, re-deployment ...
a retry policy is not enough.
A possible solution is to have a Dead Letter Queue (DLQ) stream to store Work in failure.
Exposing a metric to count the works in failure will also be useful for monitoring/alerting See NXP-27673.
The repair procedure consists of running a stream processor to re-execute the stored dead works,
this could be exposed by REST (an automation operation).
It is possible that some Work should not be retried we may add filtering option later.
Note that without Kafka the repair procedure needs to be executed on each node the DLQ being stored in a local Chronicle Queue storage.
- is related to
-
NXP-27674 Fix Stream processor drainAndStop timeout
- Resolved
-
NXP-30450 Support filtering when reprocessing DLQ Work
- Resolved
-
NXP-27687 Add a DLQ fallback to computation policy
- Open
-
NXP-27672 Add a management API to repair Works in failure
- Resolved
-
NXP-27673 Add metrics on Works DLQ usage
- Resolved
-
NXDOC-2295 Write documentation on DLQ
- Resolved
-
NXDOC-1936 Add a Nuxeo Stream section about error handling
- Resolved
-
NXP-26033 Improve Work retry policy
- Open