[NXP-19482] Drive: Optimize remote scan execution by using a scroll API - Nuxeo Issue Tracker

XML

Word

Printable

Details

Type: Task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 8.3
Component/s: Nuxeo Drive

Tags:
- nxfitcurrent
- release_notes_to_add
Impact type:

API change, Configuration Change
Upgrade notes:
Hide

Added:

FolderItem#getCanScrollDescendants()

FolderItem#scrollDescendants(String scrollId, int batchSize, long keepAlive)

FileSystemItemManager#scrollDescendants(String id, Principal principal, String scrollId, int batchSize)

The NuxeoDrive.ScrollDescendants Automation operation

To increase the maximum batch size set to 1000 by default use the "org.nuxeo.drive.maxDescendantsBatchSize" configuration property.
Show
Added: FolderItem#getCanScrollDescendants() FolderItem#scrollDescendants(String scrollId, int batchSize, long keepAlive) FileSystemItemManager#scrollDescendants(String id, Principal principal, String scrollId, int batchSize) The NuxeoDrive.ScrollDescendants Automation operation To increase the maximum batch size set to 1000 by default use the "org.nuxeo.drive.maxDescendantsBatchSize" configuration property.
Sprint:
nxfit 8.3.2
Story Points:
5

Description

What this is about

What we call the "remote scan" in Nuxeo Drive is the fact of scanning a synchronization root (or part of it) on the Nuxeo server to retrieve the tree structure (metadata of the folders and files) and allow Drive to synchronize it locally.
This is basically used in two cases:

When connecting Drive for the first time to a Nuxeo instance with some synchronized folders (or after resetting the Nuxeo Drive local storage contained in the .nuxeo-drive folder).
When detecting a new folder in the change summary retruned by the remote polling performed every 30 seconds.

Technically, Drive populates its local database with the results of the remote scan before launching the synchronization threads to process the creation of the folders and download the files to the file system.

Issues with the previous implementation (Nuxeo < 8.3)

This naive implementation had two main drawbacks:

Generating too many requests due to the basic algorithm: recursive calls to NuxeoDrive.GetChildren => this was overloading the server and saturating the bandwidth.
Not allowing to fetch more than 1000 children of a given folder because it was only fetching the current page of the FOLDER_ITEM_CHILDREN page provider with pageSize=1000. Yet this parameter could be increased, but this wouldn't scale for a big number of children.

So at this point, Nuxeo Drive:

Could not scale on a large volume of documents.
Might miss some documents!
Would take a long time for the first synchronization of a big tree.

What has been improved in Nuxeo 8.3

We are now using a "scroll" API to fetch the descendants of a synchronization root by batch.

This solution:

Allows to scale on a large volume of documents without increasing the server load.
Removes the "1000 children" limitation.
Makes the first synchronization faster.

How does it work

This approach is inspired by the principle of the Elasticsearch Scroll API:

A first query is performed to get the full list of descendant ids simply ordered by ecm:uuid and put it in a cache. It uses queryAndFetch so is quite cheap.
The client performs a query to fetch the next batch of descendants with a given size and repeats this as long as some results are returned. This request only needs to pop the wanted number of ids from the cached list and load the documents from VCS.

Note that in the VCS implementation, see DocumentBackedFolderItem#getScrollBatch, the documents are sorted by id and not by path, this adds some complexity on the client side to handle documents not sorted by path.
Yet the Elasticsearch implementation returns documents sorted by path and will always be used on a synchronization root, see ~~NXP-19586~~.

Attachments

Issue Links

depends on

NXP-19388 Setup a bench for the Nuxeo Drive optimized remote scan

Resolved

NXP-19191 Drive: Prototype optimization of remote scan execution

Resolved

NXP-19209 Setup a bench for Nuxeo Drive remote scan

Resolved

is required by

NXP-20338 Infinite loop in Drive remote scan of a non synchronization root with Redis enabled

Resolved

NXDRIVE-722 Fix Drive synchronization when removing then adding back a filter.

Resolved

NXP-19586 Drive: Implement Elasticsearch based batched remote scan

Resolved

NXP-19659 Limit number of threads calling DocumentBackedFolderItem#scrollDescendants

Resolved

NXP-19443 Drive: optimize ScrollDescendants operation by avoiding DocumentModel loading

Open

NXDRIVE-307 Remote scan: use new API to fetch the descendants of a folder

Resolved

NXDRIVE-441 Test improved Remote Scan execution with volume

Resolved

(5 is required by)

Activity

People

Assignee:

Antoine Taillefer

Reporter:

Antoine Taillefer

Participants:

Antoine Taillefer, Jenkins

Votes:

0 Vote for this issue

Watchers:

3 Start watching this issue

Dates

Created:

2016-04-14 13:46

Updated:

2017-01-25 10:54

Resolved:

2016-04-28 14:29