[NXP-31080] Avoid Record overflow during bulk indexing of huge fulltext - Nuxeo Issue Tracker

XML

Word

Printable

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 10.10-HF63, 2021.23
Component/s: Bulk, Elasticsearch

Release Notes Summary:
Record overflow is avoided during bulk indexing of huge fulltext
Tags:
- nxplatform
- platform-review
Team:
PLATFORM
Sprint:
nxplatform #65
Story Points:
3

Description

The bulk indexing is materializing the query into a stream, the elastic query contains the json representation of the document including the fulltext field, if this is bigger than the max record size (1MB) then there is a record overflow.

WARN Indexing request for doc: 0ae55b22-df2d-46a4-86b5-da86b695e66f, is too large: 1503580, max record size: 900000

// then ERROR
bulk/index: Error during checkpoint, processing will be duplicated: bulk/index: CHECKPOINT FAILURE: Resuming with possible duplicate processing.

"org.nuxeo.lib.stream.computation.log.ComputationRunner$CheckPointException","cause":{"commonElementCount":10,"localizedMessage":"Unable to send record: ProducerRecord(topic=nuxeo-bulk-bulkIndex, partition=6, headers=RecordHeaders(headers = [], isReadOnly = true), key=84b52c6c-4049-40d8-a0bf-5855bd2edcbe:5094-6, value=\\xC3\\x01\\x98\\xD4\\xE8s\\x
....

Obviously the retry mechanism doesn't help.

There must be an overflow filter by default to avoid this like for the csvExport (see ~~NXP-30796~~)
Also, we should not dump the entire record in case of error this mess up DD logs.

Attachments

Issue Links

causes

NXP-31551 Unable to start with GridFS

Resolved

Activity

People

Assignee:

Benoit Delbosc

Reporter:

Benoit Delbosc

Participants:

Benoit Delbosc, Jenkins

Votes:

0 Vote for this issue

Watchers:

2 Start watching this issue

Dates

Created:

2022-06-22 14:35

Updated:

2022-12-28 13:53

Resolved:

2022-07-08 09:26