Uploaded image for project: 'Nuxeo Platform'
  1. Nuxeo Platform
  2. NXP-20214

Build new Nuxeo Importer framework based on nuxeo-importer-queues

    XMLWordPrintable

    Details

    • Type: Task
    • Status: Resolved
    • Priority: Minor
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: 8.10
    • Component/s: Importer
    • Tags:
    • Sprint:
      nxAN Sprint 9.1.1

      Description

      Goal

      The ultimate goal is to provide a more efficient and standard way of doing mass import into Nuxeo repository.

      For that the idea is to start from nuxeo-importer-queues so that we can have a better decoupling between readers and writers and we can have the importer able to handle large scale data sets.

      Background on the importer

      We currently provide several importers that can be used to migrate data to Nuxeo or run daily imports.

      nuxeo-platform-importer

      These importers are basically sample code that can be adapted to run imports leveraging :

      • thread-pooling
      • batching (import several documents inside a given transaction)
      • event processing filtering (enable bulk mode or skip some events)

      Currently, the result "from the fields" are not so great :

      • people have a hard time understanding and using the code
        • we usually end up fixing the code for them
      • they sometimes prefer using simple REST based importer
        • but the performances are significantly slower

      The bottom-line is that there is not a unique good way of importing data :

      • depending if the data hierarchical or not
        • file tree with parent/child constraints vs flat CSV import
      • depending if the data needs complex pre-processing
        • simple file tree vs complex XML envelops

      One big mistake we have with the current importer design is that we have a tight coupling between the Read and Write part.

      So, we should move to a solution where we completely de-couple the producer and the consumer part.

      Having this separation will bring several gains :

      • we should be able to have a unique / generic consumer that does the actual import into Nuxeo
        • have a highly optimized importer
      • we can run separately the producer and the consumer
        • this means we can more easily re-run the import without being forced to re-run all the pre-processing
      • we can give to the people "working on the import process" something that is mainly decoupled from Nuxeo
        • this means we do not need to train them to be Nuxeo developers

      For that the idea is to have simple staging area :

          Source data => Producer => Queue(s) => Consumer => Import Data in Nuxeo
      

      This flow is similar to what we have inside nuxeo-importer-queues, but now we want to clearly split the importer flow in 2 sub parts and have the queue system externalized.

      This basically means that the part the people will need to tweak only depends on the target Queue system and on a common data format.

      Once we have the event system aligned on Kafka, we should pretty much already have the infrastructure for the consumer part.
      After all, we will have something that reads from Kafka to start workers: there is no real reasons that this could not match the importer/consumer requirements.

      With that in mind, we could consider that importing data into Nuxeo means :

      • write Documents data into a set of Kafka queues associated with a specific "import event"
      • let the default Nuxeo Consumer do the job or contribute a custom one

        Attachments

        There are no Sub-Tasks for this issue.

          Activity

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: