Uploaded image for project: 'Nuxeo Platform'
  1. Nuxeo Platform
  2. NXP-14416

Text extractors for MSOffice XML format can be very expensive

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 5.6.0-HF34, 5.8.0-HF12, 5.9.3
    • Fix Version/s: 5.6.0-HF35, 5.8.0-HF13, 5.9.4
    • Component/s: Convert

      Description

      When MSOffice files are stored inside the repository, the FullText extractor uses Converters to extract the text content.

      In the case of XML based MSOffice formayt (xlsx, pptx, docx), these converters (XLX2TextConverter, DOCX2TextConverter ...) rely on apache POI.

      Unfortunately, when files are huge (like an Excel file with 400 000 lines), it looks like POI has a hard time managing doing the work :

      • lot of Heap allocation => lot of GC
      • long processing with lot of CPU usage
      • potentially ends up in "GC overhead limit exceeded"

      NB : for the exact same reasons, you can not load these files in OpenOffice !

      POI does actually much more than a simple text extraction : that may explain the overhead for managing big files.

      As a workaround, we can have a fallback for big files and use a simple SAX Parser that will extract the text nodes.

      The fulltext extraction may be a little bit less precise than what POI can provide, but this will allow to have something far faster.

      Example for an Excel File that contains more that 100 000 lines :

      POI Extraction :

      text len=7312201
      processing time=34s
      

      SAX Extraction

      text len=6498229
      processing time=2s
      

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: