System that has as input an NXQL query (or any other way of representing a list of documents with required fields), passes each document by a processing pipeline and stores the collective data into a binary or TFRecord format.
The processes should allow and java processes.
Case study : The data should be extracted from Nuxeo using Bulk operations and stored into S3 in order to use it with Sagemaker. The output format is in TFRecord.
The reference for this data, as well as the statistics (histograms, inputs, outputs) should be stored in an Ai_Copus document.
- for text: store as a string in UTF8 format.
- for image: do a resizing for 299x299x3 (RBG) in float format with values between 0-1.
INFO: No python here for now. All needed pre-processing will be done at the Dataset level when training a model.