Uploaded image for project: 'Nuxeo Platform'
  1. Nuxeo Platform
  2. NXP-25279

Make the raw binary text available for processing

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 10.1
    • Fix Version/s: 10.3
    • Component/s: Core, Elasticsearch
    • Release Notes Summary:
      Makes the raw binary text available different processes like ML enrichment ones
    • Impact type:
      API change
    • Upgrade notes:
      Hide

      The binaryTextUpdated event now contains two properties of interest to know what was update exactly:

      • systemProperty contains the name of the property updated
      • systemPropertyValue contains the value
      Show
      The binaryTextUpdated event now contains two properties of interest to know what was update exactly: systemProperty contains the name of the property updated systemPropertyValue contains the value
    • Sprint:
      nxFG 10.3.6
    • Story Points:
      2

      Description

      FulltextExtractWork extracts the binary text then runs the fulltext parser, which, by default removes punctuation and lowercases the text. This is fine for the database.

      Elasticsearch and AI processing would work better with the raw text before pre-processing. Can we make the StringBlob available?

      FulltextExtractorWork line 153
      StringBlob stringBlob = blobsToStringBlob(blobs, docId);
      String text = fulltextParser.parse(stringBlob.getString(), null, stringBlob.getMimeType(), docLocation);
      

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0 minutes
                  0m
                  Logged:
                  Time Spent - 1 hour
                  1h