Uploaded image for project: 'Nuxeo Platform'
  1. Nuxeo Platform
  2. NXP-31594

Clean up orphan binaries after document removal, blob property edition and dispatch

    XMLWordPrintable

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2023.0, 2021.35
    • Component/s: Core
    • Release Notes Summary:
      Orphan blobs (binaries) are now deleted in blob stores on document deletion, document blob property edition and blob dispatched to another blob provider
    • Release Notes Description:
      Hide

      Whenever a:

      • document is removed
      • document blob property is edited
      • document blob property is dispatched to another blob provider

      a domain event referencing the related blob(s) is fired and a record is written to the "source/blob" stream for each blob candidate for deletion.

      This stream is consumed asynchronously and the blob is eventually deleted if it is not referenced by any other documents.

      Note that, this process is only available on instances working with:

      • repositories having the "ecm:blobKeys" capability (introduced by NXP-29516) (i.e. MongoDB)
        and
      • blob providers extending BlobStoreBlobProvider such as S3BlobProvider and LocalBlobProvider

      WARNING: In case of multi-repository deployment with a custom Blob Dispatcher configuration, the Nuxeo platform cannot ascertain that each repository has its own different binary store path and binaries referenced in another repository may be deleted (see documentation). In that case, it is recommended to disable this feature with NXP-31794.

      Show
      Whenever a: document is removed document blob property is edited document blob property is dispatched to another blob provider a domain event referencing the related blob(s) is fired and a record is written to the "source/blob" stream for each blob candidate for deletion. This stream is consumed asynchronously and the blob is eventually deleted if it is not referenced by any other documents. Note that, this process is only available on instances working with: repositories having the "ecm:blobKeys" capability (introduced by NXP-29516 ) (i.e. MongoDB) and blob providers extending BlobStoreBlobProvider such as S3BlobProvider and LocalBlobProvider WARNING : In case of multi-repository deployment with a custom Blob Dispatcher configuration, the Nuxeo platform cannot ascertain that each repository has its own different binary store path and binaries referenced in another repository may be deleted (see documentation ). In that case, it is recommended to disable this feature with NXP-31794 .
    • Team:
      PLATFORM
    • Sprint:
      nxplatform #79, nxplatform #80, nxplatform #81, nxplatform #82
    • Story Points:
      8

      Description

      Today, when the default blob dispatcher moves a binary from one blob provider to another (following the blob dispatcher rules definition), it performs a copy and does not delete the source binary if it is orphaned i.e. if it is not referenced by a blob property in the backend anymore.

      In the case of the retention feature, when a document becomes a record, its main blob is moved to a dedicated blob provider (hence blob store), see the recommended configuration https://doc.nuxeo.com/nxdoc/nuxeo-retention-installation-standard/#configure-via-xml-contribution:

      <extension target="org.nuxeo.ecm.core.blob.DocumentBlobManager" point="configuration">
          <blobdispatcher>
            <class>org.nuxeo.ecm.core.blob.DefaultBlobDispatcher</class>
            <property name="records">records</property>
            <property name="default">default</property>
          </blobdispatcher>
      </extension>
      

      since the blob dispatcher performs a copy from the default provider to the record one and does not delete the source blob in the default bucket if unreferenced (orphaned), the main blob is duplicated in both associated blob stores resulting in a storage cost rise. Unless a full orphaned binaries GC is performed which is costly.

      This is because the default blob store has a default digest key strategy and the stored blobs are potentially referenced by others documents. It will be too heavy to scan the repository to check if it can be deleted synchronously.

      To improve the current state, we'd like to be able to add to a stream as records the keys of the blobs that are candidates for deletion.

      A computation consuming this stream will be in charge of querying the database to check that the blob key is not referenced by another document's blob field before proceeding to its removal.

      The implementation could leverage the domain event feature.

      Note that such an improvement could be leveraged when:

      • a document is removed: all the blob keys held in the document blob fields could be added to this stream in order to clean up its binaries.
      • a blob field value of a document is edited, the old blob key if any could be added to this stream in order to check if the associated blob could be deleted

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: