When creating a new custom model, the first step is to collect a corpus from the existing Nuxeo Database.
The first step should be a quick statistics collection of the data and give some feedback about the quality of the data, and, if possible, impossibility of training a model with it.
Feeds statistics into the UI to help the user create a model definition (docType, inputs, outputs)
- Required for defining output fields
- If possible with percentiles
- Aggregation (bucketing/histogram)
- Rules: score (green, yellow, red) and comment on values
- Unbalanced data ?
- Enough data for each value ?
- All the same tag !
- Required for defining input fields
- Null fields
- Total number of documents
- Rule: score on possible use for training
- WHERE clause
- Generic service
- Generic rest API
- Define rules for quality of output
- Define rules for quality of input
- Get statistics for input
- Get statistics for output