-
Type: Task
-
Status: Resolved
-
Priority: Major
-
Resolution: Fixed
-
Affects Version/s: None
-
Fix Version/s: 8.3
-
Component/s: Nuxeo Drive
-
Impact type:API change, Configuration Change
-
Upgrade notes:
-
Sprint:nxfit 8.3.2
-
Story Points:5
What this is about
What we call the "remote scan" in Nuxeo Drive is the fact of scanning a synchronization root (or part of it) on the Nuxeo server to retrieve the tree structure (metadata of the folders and files) and allow Drive to synchronize it locally.
This is basically used in two cases:
- When connecting Drive for the first time to a Nuxeo instance with some synchronized folders (or after resetting the Nuxeo Drive local storage contained in the .nuxeo-drive folder).
- When detecting a new folder in the change summary retruned by the remote polling performed every 30 seconds.
Technically, Drive populates its local database with the results of the remote scan before launching the synchronization threads to process the creation of the folders and download the files to the file system.
Issues with the previous implementation (Nuxeo < 8.3)
This naive implementation had two main drawbacks:
- Generating too many requests due to the basic algorithm: recursive calls to NuxeoDrive.GetChildren => this was overloading the server and saturating the bandwidth.
- Not allowing to fetch more than 1000 children of a given folder because it was only fetching the current page of the FOLDER_ITEM_CHILDREN page provider with pageSize=1000. Yet this parameter could be increased, but this wouldn't scale for a big number of children.
So at this point, Nuxeo Drive:
- Could not scale on a large volume of documents.
- Might miss some documents!
- Would take a long time for the first synchronization of a big tree.
What has been improved in Nuxeo 8.3
We are now using a "scroll" API to fetch the descendants of a synchronization root by batch.
This solution:
- Allows to scale on a large volume of documents without increasing the server load.
- Removes the "1000 children" limitation.
- Makes the first synchronization faster.
How does it work
This approach is inspired by the principle of the Elasticsearch Scroll API:
- A first query is performed to get the full list of descendant ids simply ordered by ecm:uuid and put it in a cache. It uses queryAndFetch so is quite cheap.
- The client performs a query to fetch the next batch of descendants with a given size and repeats this as long as some results are returned. This request only needs to pop the wanted number of ids from the cached list and load the documents from VCS.
Note that in the VCS implementation, see DocumentBackedFolderItem#getScrollBatch, the documents are sorted by id and not by path, this adds some complexity on the client side to handle documents not sorted by path.
Yet the Elasticsearch implementation returns documents sorted by path and will always be used on a synchronization root, see NXP-19586.
- depends on
-
NXP-19388 Setup a bench for the Nuxeo Drive optimized remote scan
- Resolved
-
NXP-19191 Drive: Prototype optimization of remote scan execution
- Resolved
-
NXP-19209 Setup a bench for Nuxeo Drive remote scan
- Resolved
- is required by
-
NXP-20338 Infinite loop in Drive remote scan of a non synchronization root with Redis enabled
- Resolved
-
NXDRIVE-722 Fix Drive synchronization when removing then adding back a filter.
- Resolved
-
NXP-19586 Drive: Implement Elasticsearch based batched remote scan
- Resolved
-
NXP-19659 Limit number of threads calling DocumentBackedFolderItem#scrollDescendants
- Resolved
-
NXP-19443 Drive: optimize ScrollDescendants operation by avoiding DocumentModel loading
- Open
-
NXDRIVE-307 Remote scan: use new API to fetch the descendants of a folder
- Resolved
-
NXDRIVE-441 Test improved Remote Scan execution with volume
- Resolved