-
Type: Improvement
-
Status: Resolved
-
Priority: Minor
-
Resolution: Fixed
-
Affects Version/s: 10.1
-
Fix Version/s: 10.3
-
Component/s: Core, Elasticsearch
-
Release Notes Summary:Makes the raw binary text available different processes like ML enrichment ones
-
Impact type:API change
-
Upgrade notes:
-
Sprint:nxFG 10.3.6
-
Story Points:2
FulltextExtractWork extracts the binary text then runs the fulltext parser, which, by default removes punctuation and lowercases the text. This is fine for the database.
Elasticsearch and AI processing would work better with the raw text before pre-processing. Can we make the StringBlob available?
FulltextExtractorWork line 153
StringBlob stringBlob = blobsToStringBlob(blobs, docId); String text = fulltextParser.parse(stringBlob.getString(), null, stringBlob.getMimeType(), docLocation);