[NXP-2067] get rid of any2text transform and replace it by optimised text extractors - Nuxeo Issue Tracker

XML

Word

Printable

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 5.1.4, 5.2 M1
Component/s: Preview

Description

The current implementation of any2text transform is a real performance bottleneck. It tries to convert any blob to PDF with a socket connection to openoffice and then use pdfbox to extract the text ... It is not rare for the whole process to take 25s just to extract a few lines of text for the search engine.

We should use dedicated, for ODF and OpenOffice.org v1 files, unziping and getting the text content of content.xml would be much faster. This is precisely what is done by the jackrabbit text extractor:

http://jackrabbit.apache.org/api/1.3/org/apache/jackrabbit/extractor/OpenOfficeTextExtractor.html

We should also reuse as many of them:

http://jackrabbit.apache.org/api/1.3/org/apache/jackrabbit/extractor/

wrapped as *2text transforms and use them instead of any2text.

Attachments

Options

Sub-Tasks

There are no Sub-Tasks for this issue.

Activity

People

Assignee:

Thierry Delprat

Reporter:

Olivier Grisel

Participants:

Olivier Grisel, Thierry Delprat

Votes:

0 Vote for this issue

Watchers:

0 Start watching this issue

Dates

Created:

2008-02-12 18:21

Updated:

2008-03-06 02:03

Resolved:

2008-03-06 02:03

Time Tracking

Estimated:

2d 2h

Remaining:

2d 2h

Logged:

Not Specified

Include sub-tasks