[NXP-24679] Glacier Integration - Nuxeo Issue Tracker

XML

Word

Printable

Details

Type: New Feature
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: QualifiedToSchedule
Component/s: BlobManager

Description

Context and High-Level requirements

For large ECM or DAM systems with a large number of big files, the Binary Storage can end up being a cost issue.

For example, even when deploying in Nuxeo Cloud with S3, storing 500 TB of data can cost a lot of money:

we pay storage and requests
our default policy is to have 2 replicated buckets (doubling the required storage) volume

There are some adjustments we can do at the infrastructure level:

do not use replicated buckets, for example, use Glacier as a DR recovery solution
use a different S3 SLA (reduced redundancy, infrequent access ...)

The goal of this ticket is not to talk about these infrastructure options (that would be more the topic of an NCO ticket), but to focus on what application-level strategies we can integrate inside Nuxeo.

Application Level Strategies

Tiered storage and `BlobDispatcher`

We already have BlobDispatcher that can be used to dispatch binaries between several BlobManagers depending on custom business rules.

Using Cold Storage

Using Glacier seems appealing from a cost point of view, but we can not really expect to use Glacier as a simple backend of a BinaryManager

Glacier is slow: retrieving content can take minutes to hours
Glacier is cheap only if you do not do too many retrieval requests

In order to integrate Glacier, we basically need to have a 3 layers storage:

cold storage: Glacier
hot storage: S3
cache: local FS

This implies that the application layer is always reading and writing to the S3 layer (through a local FS cache as it is already done for the standard S3BinaryManager).

Depending on an archiving policy, we can have a background job that moves files from S3 to Glacier (could even be an S3 Lifecycle policy).

Because Glacier cost depends on "archives" retrieval, we may want to have a configurable/contributable policy for assembling several files in one archive.
For example, it could make sense to store all the files attached to the same Document or to a collection of documents in the same archive.

*Retrievals, placeholders and Blob dispatching*

We want to optimize storage cost but at the same time, we still need to be able to display the assets in the search results for example.

If we consider one Document containing a Video, we are likely to have:

a thumbnail
a storyboard
a web rendition
the HiRes source video

It would make sense that we keep the thumbnail, the storyboard and may be even the web rendition in a "low latency" Blobstore:

so that we can have a nice display
since it does not impact storage volume in a significant manner

We can do that using the BlobDispatcher feature.

When it comes to accessing the HiRes video, then we need to initiate a retrieval request.

For the retrieval, we have 2 kinds of retrieval APIs

expedited: result < 5 minutes
standard: result < 5h

Both requests have a cost.

As a result, we probably want the retrieval requests to be manually initiated:

so that we can choose the "speed/cost" of retrieval
so that we are sure we do not end up with an automated background process that triggers long and costly retrieval operations

Because of that, we will need to be able to serve placeholders until the real resource is available (may just be the low res version for example).

We also probably want to have a notification system (like SNS => Nuxeo => Email?) to let the user know when the file is really available.

*Async write*

All Write operation will need to go through S3 first.

Then we can choose between Nuxeo and AWS infrastructure to handle the move from S3 to Glacier

Nuxeo: leverage computation Framework
AWS: leverage LifeCycle or S3 fired Lambda

*Glacier Meta-data*

Because we may need to pack files in different archives and that we probably want to avoid accessing Glacier for bad reasons, we are likely to need some external index.
A simple KV store may do the trick.

Attachments

Activity

People

Assignee:

Unassigned

Reporter:

Thierry Delprat

Participants:

Florent Guillaume, François Richeboeuf, Thierry Delprat

Votes:

2 Vote for this issue

Watchers:

5 Start watching this issue

Dates

Created:

2018-03-21 18:25

Updated:

2019-06-19 10:03