Uploaded image for project: 'Nuxeo Platform'
  1. Nuxeo Platform
  2. NXP-19482

Drive: Optimize remote scan execution by using a scroll API

    Details

    • Type: Task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 8.3
    • Component/s: Nuxeo Drive
    • Impact type:
      API change, Configuration Change
    • Upgrade notes:
      Hide

      Added:

      • FolderItem#getCanScrollDescendants()
      • FolderItem#scrollDescendants(String scrollId, int batchSize, long keepAlive)
      • FileSystemItemManager#scrollDescendants(String id, Principal principal, String scrollId, int batchSize)
      • The NuxeoDrive.ScrollDescendants Automation operation

      To increase the maximum batch size set to 1000 by default use the "org.nuxeo.drive.maxDescendantsBatchSize" configuration property.

      Show
      Added: FolderItem#getCanScrollDescendants() FolderItem#scrollDescendants(String scrollId, int batchSize, long keepAlive) FileSystemItemManager#scrollDescendants(String id, Principal principal, String scrollId, int batchSize) The NuxeoDrive.ScrollDescendants Automation operation To increase the maximum batch size set to 1000 by default use the "org.nuxeo.drive.maxDescendantsBatchSize" configuration property.
    • Sprint:
      nxfit 8.3.2
    • Story Points:
      5

      Description

      What this is about

      What we call the "remote scan" in Nuxeo Drive is the fact of scanning a synchronization root (or part of it) on the Nuxeo server to retrieve the tree structure (metadata of the folders and files) and allow Drive to synchronize it locally.
      This is basically used in two cases:

      • When connecting Drive for the first time to a Nuxeo instance with some synchronized folders (or after resetting the Nuxeo Drive local storage contained in the .nuxeo-drive folder).
      • When detecting a new folder in the change summary retruned by the remote polling performed every 30 seconds.

      Technically, Drive populates its local database with the results of the remote scan before launching the synchronization threads to process the creation of the folders and download the files to the file system.

      Issues with the previous implementation (Nuxeo < 8.3)

      This naive implementation had two main drawbacks:

      • Generating too many requests due to the basic algorithm: recursive calls to NuxeoDrive.GetChildren => this was overloading the server and saturating the bandwidth.
      • Not allowing to fetch more than 1000 children of a given folder because it was only fetching the current page of the FOLDER_ITEM_CHILDREN page provider with pageSize=1000. Yet this parameter could be increased, but this wouldn't scale for a big number of children.

      So at this point, Nuxeo Drive:

      • Could not scale on a large volume of documents.
      • Might miss some documents!
      • Would take a long time for the first synchronization of a big tree.

      What has been improved in Nuxeo 8.3

      We are now using a "scroll" API to fetch the descendants of a synchronization root by batch.

      This solution:

      • Allows to scale on a large volume of documents without increasing the server load.
      • Removes the "1000 children" limitation.
      • Makes the first synchronization faster.

      How does it work

      This approach is inspired by the principle of the Elasticsearch Scroll API:

      • A first query is performed to get the full list of descendant ids simply ordered by ecm:uuid and put it in a cache. It uses queryAndFetch so is quite cheap.
      • The client performs a query to fetch the next batch of descendants with a given size and repeats this as long as some results are returned. This request only needs to pop the wanted number of ids from the cached list and load the documents from VCS.

      Note that in the VCS implementation, see DocumentBackedFolderItem#getScrollBatch, the documents are sorted by id and not by path, this adds some complexity on the client side to handle documents not sorted by path.
      Yet the Elasticsearch implementation returns documents sorted by path and will always be used on a synchronization root, see NXP-19586.

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  PagerDuty

                  Error rendering 'com.pagerduty.jira-server-plugin:PagerDuty'. Please contact your Jira administrators.