[NXP-32166] Avoid retries on scroll that has already downstream records - Nuxeo Issue Tracker

XML

Word

Printable

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2021.50, 2023.8
Component/s: Streams

Release Notes Summary:
Scrolls with downstream records are not retried anymore.
Tags:
- nxplatform
- platform-review
Team:
PLATFORM
Sprint:
nxplatform #106, nxplatform #107
Story Points:
2

Description

The observed behavior is the following:

A bulk command is submitted
The scroll is executing the NXQL query and the threshold of 1m doc ids is reached, it flushes the records downstream with a warning:
```
Scroller records threshold reached (1000000) for action: {} on command: {}, flushing records downstreams
```
The scroll continues while downstream records are processed by the bulk action
The scroller node is shutdown while the scroll is not completed.
Kafka is rebalancing the bulk command partition and our bulk command is assigned to another scroll computation
The new scroll eventually completes

The results is that:

We have duplicate processing for at least 1m docs
The bulk command is marked as completed too early because the completion is based on the number of processed docs which contains duplicates
We have tons of warning from the BBC (big bulk command > 50k items) detector:
```
BBC: 32320547-472a-4d44-8235-d321dbf5ac42 command completed:....
```

Another case resulting in the same behavior is when the the scroll takes longer than the Kafka poll interval, Kafka will assign the bulk command to another scroller computation resulting in multiple scroll of the same command running concurrently.

We want to avoid these cases, it's better to abort the bulk command instead of trying to perform this kind of retry.
This can be done by checking if there is already a bulk status for the bulk command before starting the scroll, if this is the case and there are processed documents, then we should warn and the command must be aborted.

Note that, the case of a query timeouts during the scroll is already handled and result in an invalid query without retry.
The flush threshold is configured by nuxeo.core.bulk.scroller.produceImmediateThreshold which is 1m by default.

Attachments

Issue Links

is related to

NXP-32348 Add a flushed flag to the BulkStatus avro schema

Open

Activity

People

Assignee:

Nour Al Kotob

Reporter:

Benoit Delbosc

Participants:

Benoit Delbosc, Jenkins, Nour Al Kotob

Votes:

0 Vote for this issue

Watchers:

2 Start watching this issue

Dates

Created:

2023-11-15 15:09

Updated:

2024-02-28 13:50

Resolved:

2024-02-23 08:20