-
Type: Sub-task
-
Status: In Progress
-
Priority: Minor
-
Resolution: Unresolved
-
Affects Version/s: None
-
Fix Version/s: QualifiedToSchedule
-
Component/s: Importer
Directory Tree and Threading
The default importer is targeting a simple use case: import a complete filesystem tree inside a Nuxeo repository.
On most computers you have several CPUs and several cores: this means you can import more documents per second by using several threads.
However, when importing a tree, threading must be considered carefully:
- Each thread will be associated with a Transaction (remember we import several documents before doing a commit),
- Each transaction is isolated from others (MVCC mode).
This means that a new thread must be created only when a new branch will be accessible inside the source filesystem. At least, the default ImporterThreadingPolicy (DefaultMultiThreadingPolicy) does that.
As a result, if you import a big folder with a flat structure, you will only have one importer thread, even if you configure to allow more.
To be sure to be able to leverage multi-threading, you can either:
Ensure the source filesystem is a tree with at least two levels,
Change the importer threading policy.
Flat folder importer
To make an efficient Flat folder importer we need to change the way the importer walk the filesystem and allocate threads.
The target structure should something like :
- 1 reader thread
- does the getChildren in a lazy way (the default File.listfile won't work)
- may be use java 7 nio treewalker
- push files (or just path) to be imported in a queue
- 1 queue
- stores path of files to be imported
- a ThreadPool with n threads that
- consume the queue