[NXP-14416] Text extractors for MSOffice XML format can be very expensive - Nuxeo Issue Tracker

XML

Word

Printable

When MSOffice files are stored inside the repository, the FullText extractor uses Converters to extract the text content.

In the case of XML based MSOffice formayt (xlsx, pptx, docx), these converters (XLX2TextConverter, DOCX2TextConverter ...) rely on apache POI.

Unfortunately, when files are huge (like an Excel file with 400 000 lines), it looks like POI has a hard time managing doing the work :

NB : for the exact same reasons, you can not load these files in OpenOffice !

POI does actually much more than a simple text extraction : that may explain the overhead for managing big files.

As a workaround, we can have a fallback for big files and use a simple SAX Parser that will extract the text nodes.

The fulltext extraction may be a little bit less precise than what POI can provide, but this will allow to have something far faster.

Example for an Excel File that contains more that 100 000 lines :

POI Extraction :

text len=7312201
processing time=34s

SAX Extraction

text len=6498229
processing time=2s

is related to

NXP-30294 Possible OOM on XLS fulltext extraction with POI