Uploaded image for project: 'Nuxeo Platform'
  1. Nuxeo Platform
  2. NXP-4805

Improve VCS/PostgreSQL fulltext storage

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Duplicate
    • Affects Version/s: 5.3 GA
    • Fix Version/s: None
    • Component/s: Core VCS
    • Environment:
      VCS/PostgreSQL

      Description

      The extracted fulltext of attached files is stored as a tsvector in the fulltext.binarytext field,
      the text metadata of the document is stored as tsvector in the fulltext.simpletext field,
      a trigger concatenates the 2 fields into fulltext.fulltext as a tsvector.

      There are several drawback for this implementation:

      • on disk the tsvector field takes ~2 times the size of the text (assuming there are no stop word)
      • the tsvector is duplicated (fulltext.fulltext = fulltext.binarytext + fulltext.simpletext)
      • the tsvector computation has a cost and impact mass import where we can post process fulltext indexing
      • we don't keep the clear fulltext wich prevent to rebuild the fulltext index using another analyzer easily (it requires some code to process again the fulltext extraction)

      A first simple solution is to store the fulltext as varchar in the fulltext.binarytext and fulltext.simpletext, then a trigger will build the tsvector of binary + simple text.

      Another solution according to Florent is to remove the fulltext.fulltext column and add a gin index on both field:
      CREATE INDEX fulltext_idx ON fulltext USING gin(TO_TSVECTOR('fr',COALESCE(simpletext,'') || ' ' || COALESCE(binarytext,'')));
      then request fulltext like:
      ... WHERE TO_TSVECTOR('fr', COALESCE(simpletext,'') || ' ' || COALESCE(binarytext,'')) @@ TO_TSQUERY('fr', ?)
      and make sure the gin index is well used.

      In this case rebuilding the index is going to be a very long operation.

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 2 days, 2 hours
                  2d 2h
                  Remaining:
                  Remaining Estimate - 2 days, 2 hours
                  2d 2h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified