-
Type: Improvement
-
Status: Resolved
-
Priority: Minor
-
Resolution: Duplicate
-
Affects Version/s: 5.3 GA
-
Fix Version/s: None
-
Component/s: Core VCS
-
Environment:VCS/PostgreSQL
-
Tags:
-
Impact type:Content model Change
The extracted fulltext of attached files is stored as a tsvector in the fulltext.binarytext field,
the text metadata of the document is stored as tsvector in the fulltext.simpletext field,
a trigger concatenates the 2 fields into fulltext.fulltext as a tsvector.
There are several drawback for this implementation:
- on disk the tsvector field takes ~2 times the size of the text (assuming there are no stop word)
- the tsvector is duplicated (fulltext.fulltext = fulltext.binarytext + fulltext.simpletext)
- the tsvector computation has a cost and impact mass import where we can post process fulltext indexing
- we don't keep the clear fulltext wich prevent to rebuild the fulltext index using another analyzer easily (it requires some code to process again the fulltext extraction)
A first simple solution is to store the fulltext as varchar in the fulltext.binarytext and fulltext.simpletext, then a trigger will build the tsvector of binary + simple text.
Another solution according to Florent is to remove the fulltext.fulltext column and add a gin index on both field:
CREATE INDEX fulltext_idx ON fulltext USING gin(TO_TSVECTOR('fr',COALESCE(simpletext,'') || ' ' || COALESCE(binarytext,'')));
then request fulltext like:
... WHERE TO_TSVECTOR('fr', COALESCE(simpletext,'') || ' ' || COALESCE(binarytext,'')) @@ TO_TSQUERY('fr', ?)
and make sure the gin index is well used.
In this case rebuilding the index is going to be a very long operation.
- duplicates
-
NXP-5689 VCS: store PostgreSQL fulltext in clear text
- Resolved