Uploaded image for project: 'Nuxeo Platform'
  1. Nuxeo Platform
  2. NXP-31080

Avoid Record overflow during bulk indexing of huge fulltext

    XMLWordPrintable

    Details

    • Release Notes Summary:
      Record overflow is avoided during bulk indexing of huge fulltext
    • Team:
      PLATFORM
    • Sprint:
      nxplatform #65
    • Story Points:
      3

      Description

      The bulk indexing is materializing the query into a stream, the elastic query contains the json representation of the document including the fulltext field, if this is bigger than the max record size (1MB) then there is a record overflow.

      WARN Indexing request for doc: 0ae55b22-df2d-46a4-86b5-da86b695e66f, is too large: 1503580, max record size: 900000
      
      // then ERROR
      bulk/index: Error during checkpoint, processing will be duplicated: bulk/index: CHECKPOINT FAILURE: Resuming with possible duplicate processing.
      
      "org.nuxeo.lib.stream.computation.log.ComputationRunner$CheckPointException","cause":{"commonElementCount":10,"localizedMessage":"Unable to send record: ProducerRecord(topic=nuxeo-bulk-bulkIndex, partition=6, headers=RecordHeaders(headers = [], isReadOnly = true), key=84b52c6c-4049-40d8-a0bf-5855bd2edcbe:5094-6, value=\\xC3\\x01\\x98\\xD4\\xE8s\\x
      ....
      

      Obviously the retry mechanism doesn't help.

      There must be an overflow filter by default to avoid this like for the csvExport (see NXP-30796)
      Also, we should not dump the entire record in case of error this mess up DD logs.

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: