Goal
The ultimate goal is to provide a more efficient and standard way of doing mass import into Nuxeo repository.
For that the idea is to start from nuxeo-importer-queues so that we can have a better decoupling between readers and writers and we can have the importer able to handle large scale data sets.
Background on the importer
We currently provide several importers that can be used to migrate data to Nuxeo or run daily imports.
These importers are basically sample code that can be adapted to run imports leveraging :
- thread-pooling
- batching (import several documents inside a given transaction)
- event processing filtering (enable bulk mode or skip some events)
Currently, the result "from the fields" are not so great :
- people have a hard time understanding and using the code
- we usually end up fixing the code for them
- they sometimes prefer using simple REST based importer
- but the performances are significantly slower
The bottom-line is that there is not a unique good way of importing data :
- depending if the data hierarchical or not
- file tree with parent/child constraints vs flat CSV import
- depending if the data needs complex pre-processing
- simple file tree vs complex XML envelops
One big mistake we have with the current importer design is that we have a tight coupling between the Read and Write part.
So, we should move to a solution where we completely de-couple the producer and the consumer part.
Having this separation will bring several gains :
- we should be able to have a unique / generic consumer that does the actual import into Nuxeo
- have a highly optimized importer
- we can run separately the producer and the consumer
- this means we can more easily re-run the import without being forced to re-run all the pre-processing
- we can give to the people "working on the import process" something that is mainly decoupled from Nuxeo
- this means we do not need to train them to be Nuxeo developers
For that the idea is to have simple staging area :
Source data => Producer => Queue(s) => Consumer => Import Data in Nuxeo
This flow is similar to what we have inside nuxeo-importer-queues, but now we want to clearly split the importer flow in 2 sub parts and have the queue system externalized.
This basically means that the part the people will need to tweak only depends on the target Queue system and on a common data format.
Once we have the event system aligned on Kafka, we should pretty much already have the infrastructure for the consumer part.
After all, we will have something that reads from Kafka to start workers: there is no real reasons that this could not match the importer/consumer requirements.
With that in mind, we could consider that importing data into Nuxeo means :
- write Documents data into a set of Kafka queues associated with a specific "import event"
- let the default Nuxeo Consumer do the job or contribute a custom one