[NXP-25279] Make the raw binary text available for processing - Nuxeo Issue Tracker

XML

Word

Printable

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 10.1
Fix Version/s: 10.3
Component/s: Core, Elasticsearch

Release Notes Summary:
Makes the raw binary text available different processes like ML enrichment ones
Tags:
Impact type:

API change
Upgrade notes:
Hide

The binaryTextUpdated event now contains two properties of interest to know what was update exactly:

systemProperty contains the name of the property updated

systemPropertyValue contains the value
Show
The binaryTextUpdated event now contains two properties of interest to know what was update exactly: systemProperty contains the name of the property updated systemPropertyValue contains the value
Sprint:
nxFG 10.3.6
Story Points:
2

Description

FulltextExtractWork extracts the binary text then runs the fulltext parser, which, by default removes punctuation and lowercases the text. This is fine for the database.

Elasticsearch and AI processing would work better with the raw text before pre-processing. Can we make the StringBlob available?

FulltextExtractorWork line 153

StringBlob stringBlob = blobsToStringBlob(blobs, docId);
String text = fulltextParser.parse(stringBlob.getString(), null, stringBlob.getMimeType(), docLocation);

Attachments

Issue Links

is related to

NXP-25277 binaryTextUpdated event needs the Binary Text

Resolved

NXP-25716 Simplify fulltext extraction

Resolved

is required by

NXP-25838 Update binary fulltext stream handling

Resolved

Activity

People

Assignee:

Florent Guillaume

Reporter:

Gethin James

Participants:

Florent Guillaume, Gethin James, Jenkins

Votes:

0 Vote for this issue

Watchers:

3 Start watching this issue

Dates

Created:

2018-07-02 09:46

Updated:

2019-01-22 13:54

Resolved:

2018-09-21 23:17

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: