[NXP-4805] Improve VCS/PostgreSQL fulltext storage - Nuxeo Issue Tracker

XML

Word

Printable

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Duplicate
Affects Version/s: 5.3 GA
Fix Version/s: None
Component/s: Core VCS
Environment:
VCS/PostgreSQL

Tags:
- performances
- postgresql
Impact type:

Content model Change

Description

The extracted fulltext of attached files is stored as a tsvector in the fulltext.binarytext field,
the text metadata of the document is stored as tsvector in the fulltext.simpletext field,
a trigger concatenates the 2 fields into fulltext.fulltext as a tsvector.

There are several drawback for this implementation:

on disk the tsvector field takes ~2 times the size of the text (assuming there are no stop word)
the tsvector is duplicated (fulltext.fulltext = fulltext.binarytext + fulltext.simpletext)
the tsvector computation has a cost and impact mass import where we can post process fulltext indexing
we don't keep the clear fulltext wich prevent to rebuild the fulltext index using another analyzer easily (it requires some code to process again the fulltext extraction)

A first simple solution is to store the fulltext as varchar in the fulltext.binarytext and fulltext.simpletext, then a trigger will build the tsvector of binary + simple text.

Another solution according to Florent is to remove the fulltext.fulltext column and add a gin index on both field:
CREATE INDEX fulltext_idx ON fulltext USING gin(TO_TSVECTOR('fr',COALESCE(simpletext,'') || ' ' || COALESCE(binarytext,'')));
then request fulltext like:
... WHERE TO_TSVECTOR('fr', COALESCE(simpletext,'') || ' ' || COALESCE(binarytext,'')) @@ TO_TSQUERY('fr', ?)
and make sure the gin index is well used.

In this case rebuilding the index is going to be a very long operation.

Attachments

Issue Links

duplicates

NXP-5689 VCS: store PostgreSQL fulltext in clear text

Resolved

Activity

People

Assignee:

Unassigned

Reporter:

Benoit Delbosc

Participants:

Benoit Delbosc, Florent Guillaume

Votes:

0 Vote for this issue

Watchers:

0 Start watching this issue

Dates

Created:

2010-02-26 11:21

Updated:

2011-07-20 09:52

Resolved:

2011-07-20 09:52

Time Tracking

Estimated:

2d 2h

Remaining:

2d 2h

Logged:

Not Specified