-
Type: Bug
-
Status: Resolved
-
Priority: Blocker
-
Resolution: Fixed
-
Affects Version/s: None
-
Component/s: Streams
-
Release Notes Summary:Scrolls with downstream records are not retried anymore.
-
Tags:
-
Team:PLATFORM
-
Sprint:nxplatform #106, nxplatform #107
-
Story Points:2
The observed behavior is the following:
- A bulk command is submitted
- The scroll is executing the NXQL query and the threshold of 1m doc ids is reached, it flushes the records downstream with a warning:
Scroller records threshold reached (1000000) for action: {} on command: {}, flushing records downstreams
- The scroll continues while downstream records are processed by the bulk action
- The scroller node is shutdown while the scroll is not completed.
- Kafka is rebalancing the bulk command partition and our bulk command is assigned to another scroll computation
- The new scroll eventually completes
The results is that:
- We have duplicate processing for at least 1m docs
- The bulk command is marked as completed too early because the completion is based on the number of processed docs which contains duplicates
- We have tons of warning from the BBC (big bulk command > 50k items) detector:
BBC: 32320547-472a-4d44-8235-d321dbf5ac42 command completed:....
Another case resulting in the same behavior is when the the scroll takes longer than the Kafka poll interval, Kafka will assign the bulk command to another scroller computation resulting in multiple scroll of the same command running concurrently.
We want to avoid these cases, it's better to abort the bulk command instead of trying to perform this kind of retry.
This can be done by checking if there is already a bulk status for the bulk command before starting the scroll, if this is the case and there are processed documents, then we should warn and the command must be aborted.
Note that, the case of a query timeouts during the scroll is already handled and result in an invalid query without retry.
The flush threshold is configured by nuxeo.core.bulk.scroller.produceImmediateThreshold which is 1m by default.
- is related to
-
NXP-32348 Add a flushed flag to the BulkStatus avro schema
- Open