Uploaded image for project: 'Nuxeo Platform'
  1. Nuxeo Platform
  2. NXP-28565

Make orphan binaries GC scalable

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2023.0, 2021.38
    • Component/s: Core, Rest API
    • Release Notes Summary:
      A new Full Garbage Collector is available to clean up orphaned document blobs and is exposed in the management Rest API
    • Release Notes Description:
      Hide

      This Full GC implementation leverages the work done for NXP-31594 which contains detailed release notes about limitations.

      See documentation for more details

      Note that, this process is only available on instances working with:

      • repositories having the ecm:blobKeys capability (introduced by NXP-29516)
        and
      • blob providers extending BlobStoreBlobProvider such as S3BlobProvider and LocalBlobProvider (in 2021, the default blob provider is org.nuxeo.ecm.core.blob.binary.DefaultBinaryManager and is not supported see NXP-31876)
      Show
      This Full GC implementation leverages the work done for NXP-31594 which contains detailed release notes about limitations. See documentation for more details Note that, this process is only available on instances working with: repositories having the ecm:blobKeys capability (introduced by NXP-29516 ) and blob providers extending BlobStoreBlobProvider such as S3BlobProvider and LocalBlobProvider (in 2021, the default blob provider is org.nuxeo.ecm.core.blob.binary.DefaultBinaryManager and is not supported see NXP-31876 )
    • Team:
      PLATFORM
    • Sprint:
      nxplatform #87
    • Story Points:
      5

      Description

      The orphan binaries GC can take a long time (days) when the Nuxeo Platform is storing a large number of binaries (millions) in its file storage.

      Even though orphan binaries GC is considered a maintenance operation, some users need to run it often e.g. when the Nuxeo Platform is used as a temporary storage therefore does not need a large file storage.

      Several improvements can be provided depending on the user's need:

      • the initial list of digests retrieved from the database could be splitted and each part execute in its own thread,
      • make the orphan binaries GC synchronous (see NXP-28523) (interesting when the Nuxeo Platform is used as a temporary storage),
      • maintain a reverse index referencing the document(s) using the binary

      EDIT

      With NXP-31737, we'll be able to scroll the blob stores of each provider of each repository. Adding a BAF on top of this that leverages the new APIs available since NXP-31594 will offer a scalable and resilient Full GC.
      For the record, the `defaultConcurrency` and `defaultPartitions` of BAF's processor can be customized to speed up the Full GC process.
      Note that instances populated with data prior to NXP-29516 will need NXP-30070 (i.e. ecm:blobKeys capability)

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: