-
Type: Sub-task
-
Status: Resolved
-
Priority: Minor
-
Resolution: Fixed
-
Affects Version/s: None
-
Fix Version/s: 5.8.0-HF01, 5.9.1
-
Component/s: Excel Export
About Readers
The importer framework works with a Reader/Writer pipe pattern.
The default implementation provides several types of Readers :
- FileSourceNode :
- simply read binary files in a folder tree
- FileWithMetadataSourceNode :
- read binary files in a folder tree
- expects meta-data to be defineds globally in a metadata.properties file in each folder
- FileWithIndividualMetadasSourceNode :
- read binary files in a folder tree
- expects metadata to be defined file by file in a *.properties file in each file and folder
The 2 implementations that supports meta-data supports inheritance : this mean that the metadata for a given file will be the merge of
- metadata at file level
- metadata defined in parent folder
- metadata defined in grand parent folder
- ... up to the root
The typical use case is an initial Data migration from a shared file system to Nuxeo repository where files can be qualified on a per folder and eventually per file basis.
Synchronization between threads
The current implementation use a global shared structure to collect the metadata.
To avoid corruption, this structure use a ReadWriteLock to avoid concurrent Read/Write.
This means that each thread that will read the children will collect the metadata files and will then lock the shared structure : blocking then all other threads that want to do the same.
Having a lot of threads, big folders (like 1000 children), and a perfectly balanced tree makes the problem very visible :
- the children metadata fetch is long
- all thread progress at the same rhythm (same folder size) : so they all wait for the same lock at the same time
So far, it was not detected to be a bottle neck :
- usually the database is the bottleneck
- for a typical Filesystem initial import
- folders are much smaller : because humain can hardly manage trees with such big folders
- for most people, initial data import is ok if this runs at 50 doc/s
- the performances tests done on the CI chain use the RandomFileImporter that does have this kind of synchronization issue
Sync free meta-data management
On recent hardware (like SSD for PGSQL), the database and i/o are not any-more the bottleneck.
For testing purpose, we created a sync free SourceNode that management metadata but not inheritance.
FileWithNonHeritedIndividalMetaDataSourceNode
Removing inheritance allow to avoid any shared structure and then allows to remove any synchronization.
Removing the sync allows to drastically improve performances : going from 50 doc/s to 500 doc/s
Lessons learned
The importer does include additional synchronizations on other shared resources :
- counters
- local counters and managed thread by thread
- but they are merged on global counters at commit time
- this use some synchronization
- http log
- the http log is a simple buffer of lines accessed by all threads
- there is some synchronization here too
These sync were not a bottleneck at the time of implementation (at least we did some tests to verify that), but this may be different now :
- because we did a lot of optimizations in the core and in VCS since then
- because the hardware is better (especialy : more RAM and faster disks)
This means we should review the system to have as much as possible a lock free system :
- plug http log with logback ?
- use metrics for counters ?
- is required by
-
NXP-13730 Missing Scan Importer Marketplace Package
- Resolved