The current implementation of any2text transform is a real performance bottleneck. It tries to convert any blob to PDF with a socket connection to openoffice and then use pdfbox to extract the text ... It is not rare for the whole process to take 25s just to extract a few lines of text for the search engine.
We should use dedicated, for ODF and OpenOffice.org v1 files, unziping and getting the text content of content.xml would be much faster. This is precisely what is done by the jackrabbit text extractor:
http://jackrabbit.apache.org/api/1.3/org/apache/jackrabbit/extractor/OpenOfficeTextExtractor.html
We should also reuse as many of them:
http://jackrabbit.apache.org/api/1.3/org/apache/jackrabbit/extractor/
wrapped as *2text transforms and use them instead of any2text.