Uploaded image for project: 'Nuxeo Platform'
  1. Nuxeo Platform
  2. NXP-24679

Glacier Integration

    XMLWordPrintable

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: QualifiedToSchedule
    • Component/s: BlobManager

      Description

      Context and High-Level requirements

      For large ECM or DAM systems with a large number of big files, the Binary Storage can end up being a cost issue.

      For example, even when deploying in Nuxeo Cloud with S3, storing 500 TB of data can cost a lot of money:

      • we pay storage and requests
      • our default policy is to have 2 replicated buckets (doubling the required storage) volume

      There are some adjustments we can do at the infrastructure level:

      • do not use replicated buckets, for example, use Glacier as a DR recovery solution
      • use a different S3 SLA (reduced redundancy, infrequent access ...)

      The goal of this ticket is not to talk about these infrastructure options (that would be more the topic of an NCO ticket), but to focus on what application-level strategies we can integrate inside Nuxeo.

      Application Level Strategies

      Tiered storage and BlobDispatcher

      We already have BlobDispatcher that can be used to dispatch binaries between several BlobManagers depending on custom business rules.

      Using Cold Storage

      Using Glacier seems appealing from a cost point of view, but we can not really expect to use Glacier as a simple backend of a BinaryManager

      • Glacier is slow: retrieving content can take minutes to hours
      • Glacier is cheap only if you do not do too many retrieval requests

      In order to integrate Glacier, we basically need to have a 3 layers storage:

      • cold storage: Glacier
      • hot storage: S3
      • cache: local FS

      This implies that the application layer is always reading and writing to the S3 layer (through a local FS cache as it is already done for the standard S3BinaryManager).

      Depending on an archiving policy, we can have a background job that moves files from S3 to Glacier (could even be an S3 Lifecycle policy).

      Because Glacier cost depends on "archives" retrieval, we may want to have a configurable/contributable policy for assembling several files in one archive.
      For example, it could make sense to store all the files attached to the same Document or to a collection of documents in the same archive.

      *Retrievals, placeholders and Blob dispatching*

      We want to optimize storage cost but at the same time, we still need to be able to display the assets in the search results for example.

      If we consider one Document containing a Video, we are likely to have:

      • a thumbnail
      • a storyboard
      • a web rendition
      • the HiRes source video

      It would make sense that we keep the thumbnail, the storyboard and may be even the web rendition in a "low latency" Blobstore:

      • so that we can have a nice display
      • since it does not impact storage volume in a significant manner

      We can do that using the BlobDispatcher feature.

      When it comes to accessing the HiRes video, then we need to initiate a retrieval request.

      For the retrieval, we have 2 kinds of retrieval APIs

      • expedited: result < 5 minutes
      • standard: result < 5h

      Both requests have a cost.

      As a result, we probably want the retrieval requests to be manually initiated:

      • so that we can choose the "speed/cost" of retrieval
      • so that we are sure we do not end up with an automated background process that triggers long and costly retrieval operations

      Because of that, we will need to be able to serve placeholders until the real resource is available (may just be the low res version for example).

      We also probably want to have a notification system (like SNS => Nuxeo => Email?) to let the user know when the file is really available.

      *Async write*

      All Write operation will need to go through S3 first.

      Then we can choose between Nuxeo and AWS infrastructure to handle the move from S3 to Glacier

      • Nuxeo: leverage computation Framework
      • AWS: leverage LifeCycle or S3 fired Lambda

      *Glacier Meta-data*

      Because we may need to pack files in different archives and that we probably want to avoid accessing Glacier for bad reasons, we are likely to need some external index.
      A simple KV store may do the trick.

        Attachments

          Activity

            People

            • Votes:
              2 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated: