XMLWordPrintable

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.8.0-HF01, 5.9.1
    • Component/s: Excel Export

      Description

      About Readers

      The importer framework works with a Reader/Writer pipe pattern.

      The default implementation provides several types of Readers :

      • FileSourceNode :
        • simply read binary files in a folder tree
      • FileWithMetadataSourceNode :
        • read binary files in a folder tree
        • expects meta-data to be defineds globally in a metadata.properties file in each folder
      • FileWithIndividualMetadasSourceNode :
        • read binary files in a folder tree
        • expects metadata to be defined file by file in a *.properties file in each file and folder

      The 2 implementations that supports meta-data supports inheritance : this mean that the metadata for a given file will be the merge of

      • metadata at file level
      • metadata defined in parent folder
      • metadata defined in grand parent folder
      • ... up to the root

      The typical use case is an initial Data migration from a shared file system to Nuxeo repository where files can be qualified on a per folder and eventually per file basis.

      Synchronization between threads

      The current implementation use a global shared structure to collect the metadata.

      To avoid corruption, this structure use a ReadWriteLock to avoid concurrent Read/Write.

      This means that each thread that will read the children will collect the metadata files and will then lock the shared structure : blocking then all other threads that want to do the same.

      Having a lot of threads, big folders (like 1000 children), and a perfectly balanced tree makes the problem very visible :

      • the children metadata fetch is long
      • all thread progress at the same rhythm (same folder size) : so they all wait for the same lock at the same time

      So far, it was not detected to be a bottle neck :

      • usually the database is the bottleneck
      • for a typical Filesystem initial import
        • folders are much smaller : because humain can hardly manage trees with such big folders
        • for most people, initial data import is ok if this runs at 50 doc/s
      • the performances tests done on the CI chain use the RandomFileImporter that does have this kind of synchronization issue

      Sync free meta-data management

      On recent hardware (like SSD for PGSQL), the database and i/o are not any-more the bottleneck.

      For testing purpose, we created a sync free SourceNode that management metadata but not inheritance.

      FileWithNonHeritedIndividalMetaDataSourceNode

      Removing inheritance allow to avoid any shared structure and then allows to remove any synchronization.

      Removing the sync allows to drastically improve performances : going from 50 doc/s to 500 doc/s

      Lessons learned

      The importer does include additional synchronizations on other shared resources :

      • counters
        • local counters and managed thread by thread
        • but they are merged on global counters at commit time
        • this use some synchronization
      • http log
        • the http log is a simple buffer of lines accessed by all threads
        • there is some synchronization here too

      These sync were not a bottleneck at the time of implementation (at least we did some tests to verify that), but this may be different now :

      • because we did a lot of optimizations in the core and in VCS since then
      • because the hardware is better (especialy : more RAM and faster disks)

      This means we should review the system to have as much as possible a lock free system :

      • plug http log with logback ?
      • use metrics for counters ?

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: