[NXP-20214] Build new Nuxeo Importer framework based on nuxeo-importer-queues - Nuxeo Issue Tracker

XML

Word

Printable

Details

Type: Task
Status: Resolved
Priority: Minor
Resolution: Done
Affects Version/s: None
Fix Version/s: 8.10
Component/s: Importer

Tags:
- nxAN
Sprint:
nxAN Sprint 9.1.1

Description

Goal

The ultimate goal is to provide a more efficient and standard way of doing mass import into Nuxeo repository.

For that the idea is to start from nuxeo-importer-queues so that we can have a better decoupling between readers and writers and we can have the importer able to handle large scale data sets.

Background on the importer

We currently provide several importers that can be used to migrate data to Nuxeo or run daily imports.

nuxeo-platform-importer

These importers are basically sample code that can be adapted to run imports leveraging :

thread-pooling
batching (import several documents inside a given transaction)
event processing filtering (enable bulk mode or skip some events)

Currently, the result "from the fields" are not so great :

people have a hard time understanding and using the code
- we usually end up fixing the code for them
they sometimes prefer using simple REST based importer
- but the performances are significantly slower

The bottom-line is that there is not a unique good way of importing data :

depending if the data hierarchical or not
- file tree with parent/child constraints vs flat CSV import
depending if the data needs complex pre-processing
- simple file tree vs complex XML envelops

One big mistake we have with the current importer design is that we have a tight coupling between the Read and Write part.

So, we should move to a solution where we completely de-couple the producer and the consumer part.

Having this separation will bring several gains :

we should be able to have a unique / generic consumer that does the actual import into Nuxeo
- have a highly optimized importer
we can run separately the producer and the consumer
- this means we can more easily re-run the import without being forced to re-run all the pre-processing
we can give to the people "working on the import process" something that is mainly decoupled from Nuxeo
- this means we do not need to train them to be Nuxeo developers

For that the idea is to have simple staging area :

    Source data => Producer => Queue(s) => Consumer => Import Data in Nuxeo

This flow is similar to what we have inside nuxeo-importer-queues, but now we want to clearly split the importer flow in 2 sub parts and have the queue system externalized.

This basically means that the part the people will need to tweak only depends on the target Queue system and on a common data format.

Once we have the event system aligned on Kafka, we should pretty much already have the infrastructure for the consumer part.
After all, we will have something that reads from Kafka to start workers: there is no real reasons that this could not match the importer/consumer requirements.

With that in mind, we could consider that importing data into Nuxeo means :

write Documents data into a set of Kafka queues associated with a specific "import event"
let the default Nuxeo Consumer do the job or contribute a custom one

Attachments

Options

Sub-Tasks

There are no Sub-Tasks for this issue.

Activity

People

Assignee:

Andrei Nechaev

Reporter:

Thierry Delprat

Participants:

Andrei Nechaev, Benoit Delbosc, Thierry Delprat

Votes:

0 Vote for this issue

Watchers:

3 Start watching this issue

Dates

Created:

2016-07-28 11:04

Updated:

2017-02-09 17:30

Resolved:

2017-02-01 17:17