[NXP-13136] Fix importer synchronization issues - Nuxeo Issue Tracker

XML

Word

Printable

Details

Type: Sub-task
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 5.8.0-HF01, 5.9.1
Component/s: Excel Export

Description

About Readers

The importer framework works with a Reader/Writer pipe pattern.

The default implementation provides several types of Readers :

FileSourceNode :
- simply read binary files in a folder tree
FileWithMetadataSourceNode :
- read binary files in a folder tree
- expects meta-data to be defineds globally in a metadata.properties file in each folder
FileWithIndividualMetadasSourceNode :
- read binary files in a folder tree
- expects metadata to be defined file by file in a *.properties file in each file and folder

The 2 implementations that supports meta-data supports inheritance : this mean that the metadata for a given file will be the merge of

metadata at file level
metadata defined in parent folder
metadata defined in grand parent folder
... up to the root

The typical use case is an initial Data migration from a shared file system to Nuxeo repository where files can be qualified on a per folder and eventually per file basis.

Synchronization between threads

The current implementation use a global shared structure to collect the metadata.

To avoid corruption, this structure use a ReadWriteLock to avoid concurrent Read/Write.

This means that each thread that will read the children will collect the metadata files and will then lock the shared structure : blocking then all other threads that want to do the same.

Having a lot of threads, big folders (like 1000 children), and a perfectly balanced tree makes the problem very visible :

the children metadata fetch is long
all thread progress at the same rhythm (same folder size) : so they all wait for the same lock at the same time

So far, it was not detected to be a bottle neck :

usually the database is the bottleneck
for a typical Filesystem initial import
- folders are much smaller : because humain can hardly manage trees with such big folders
- for most people, initial data import is ok if this runs at 50 doc/s
the performances tests done on the CI chain use the RandomFileImporter that does have this kind of synchronization issue

Sync free meta-data management

On recent hardware (like SSD for PGSQL), the database and i/o are not any-more the bottleneck.

For testing purpose, we created a sync free SourceNode that management metadata but not inheritance.

FileWithNonHeritedIndividalMetaDataSourceNode

Removing inheritance allow to avoid any shared structure and then allows to remove any synchronization.

Removing the sync allows to drastically improve performances : going from 50 doc/s to 500 doc/s

Lessons learned

The importer does include additional synchronizations on other shared resources :

counters
- local counters and managed thread by thread
- but they are merged on global counters at commit time
- this use some synchronization
http log
- the http log is a simple buffer of lines accessed by all threads
- there is some synchronization here too

These sync were not a bottleneck at the time of implementation (at least we did some tests to verify that), but this may be different now :

because we did a lot of optimizations in the core and in VCS since then
because the hardware is better (especialy : more RAM and faster disks)

This means we should review the system to have as much as possible a lock free system :

plug http log with logback ?
use metrics for counters ?

Attachments

Issue Links

is required by

NXP-13730 Missing Scan Importer Marketplace Package

Resolved

Activity

People

Assignee:

Thierry Delprat

Reporter:

Thierry Delprat

Participants:

Jenkins, Thierry Delprat

Votes:

0 Vote for this issue

Watchers:

4 Start watching this issue

Dates

Created:

2013-11-08 09:40

Updated:

2014-02-10 15:51

Resolved:

2013-11-17 23:39