When MSOffice files are stored inside the repository, the FullText extractor uses Converters to extract the text content.
In the case of XML based MSOffice formayt (xlsx, pptx, docx), these converters (XLX2TextConverter, DOCX2TextConverter ...) rely on apache POI.
Unfortunately, when files are huge (like an Excel file with 400 000 lines), it looks like POI has a hard time managing doing the work :
- lot of Heap allocation => lot of GC
- long processing with lot of CPU usage
- potentially ends up in "GC overhead limit exceeded"
NB : for the exact same reasons, you can not load these files in OpenOffice !
POI does actually much more than a simple text extraction : that may explain the overhead for managing big files.
As a workaround, we can have a fallback for big files and use a simple SAX Parser that will extract the text nodes.
The fulltext extraction may be a little bit less precise than what POI can provide, but this will allow to have something far faster.
Example for an Excel File that contains more that 100 000 lines :
POI Extraction :