Uploaded image for project: 'Nuxeo Platform'
  1. Nuxeo Platform
  2. NXP-2067

get rid of any2text transform and replace it by optimised text extractors

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.1.4, 5.2 M1
    • Component/s: Preview

      Description

      The current implementation of any2text transform is a real performance bottleneck. It tries to convert any blob to PDF with a socket connection to openoffice and then use pdfbox to extract the text ... It is not rare for the whole process to take 25s just to extract a few lines of text for the search engine.

      We should use dedicated, for ODF and OpenOffice.org v1 files, unziping and getting the text content of content.xml would be much faster. This is precisely what is done by the jackrabbit text extractor:

      http://jackrabbit.apache.org/api/1.3/org/apache/jackrabbit/extractor/OpenOfficeTextExtractor.html

      We should also reuse as many of them:

      http://jackrabbit.apache.org/api/1.3/org/apache/jackrabbit/extractor/

      wrapped as *2text transforms and use them instead of any2text.

        Attachments

          Activity

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 2 days, 2 hours
                2d 2h
                Remaining:
                Remaining Estimate - 2 days, 2 hours
                2d 2h
                Logged:
                Time Spent - Not Specified
                Not Specified