The current implementation of any2text transform is a real performance bottleneck. It tries to convert any blob to PDF with a socket connection to openoffice and then use pdfbox to extract the text ... It is not rare for the whole process to take 25s just to extract a few lines of text for the search engine.
We should use dedicated, for ODF and OpenOffice.org v1 files, unziping and getting the text content of content.xml would be much faster. This is precisely what is done by the jackrabbit text extractor:
http://jackrabbit.apache.org/api/1.3/org/apache/jackrabbit/extractor/OpenOfficeTextExtractor.html
We should also reuse as many of them:
http://jackrabbit.apache.org/api/1.3/org/apache/jackrabbit/extractor/
wrapped as *2text transforms and use them instead of any2text.
1.
|
Add Text transformer plugins based on JackRabbit | Resolved | Thierry Delprat | |
2.
|
Include Text Transformers based on Aperture | Closed | Thierry Delprat | |
3.
|
Refactor Transformer API | Resolved | Thierry Delprat | |
4.
|
Refactor blob extractor used by full text indexing | Resolved | Thierry Delprat | |
5.
|
Fix XML transformer | Resolved | Thierry Delprat |