For large ECM or DAM systems with a large number of big files, the Binary Storage can end up being a cost issue.
For example, even when deploying in Nuxeo Cloud with S3, storing 500 TB of data can cost a lot of money:
- we pay storage and requests
- our default policy is to have 2 replicated buckets (doubling the required storage) volume
There are some adjustments we can do at the infrastructure level:
- do not use replicated buckets, for example, use Glacier as a DR recovery solution
- use a different S3 SLA (reduced redundancy, infrequent access ...)
The goal of this ticket is not to talk about these infrastructure options (that would be more the topic of an NCO ticket), but to focus on what application-level strategies we can integrate inside Nuxeo.
We already have BlobDispatcher that can be used to dispatch binaries between several BlobManagers depending on custom business rules.
Using Glacier seems appealing from a cost point of view, but we can not really expect to use Glacier as a simple backend of a BinaryManager
- Glacier is slow: retrieving content can take minutes to hours
- Glacier is cheap only if you do not do too many retrieval requests
In order to integrate Glacier, we basically need to have a 3 layers storage:
- cold storage: Glacier
- hot storage: S3
- cache: local FS
This implies that the application layer is always reading and writing to the S3 layer (through a local FS cache as it is already done for the standard S3BinaryManager).
Depending on an archiving policy, we can have a background job that moves files from S3 to Glacier (could even be an S3 Lifecycle policy).
Because Glacier cost depends on "archives" retrieval, we may want to have a configurable/contributable policy for assembling several files in one archive.
For example, it could make sense to store all the files attached to the same Document or to a collection of documents in the same archive.
*Retrievals, placeholders and Blob dispatching*
We want to optimize storage cost but at the same time, we still need to be able to display the assets in the search results for example.
If we consider one Document containing a Video, we are likely to have:
- a thumbnail
- a storyboard
- a web rendition
- the HiRes source video
It would make sense that we keep the thumbnail, the storyboard and may be even the web rendition in a "low latency" Blobstore:
- so that we can have a nice display
- since it does not impact storage volume in a significant manner
We can do that using the BlobDispatcher feature.
When it comes to accessing the HiRes video, then we need to initiate a retrieval request.
For the retrieval, we have 2 kinds of retrieval APIs
- expedited: result < 5 minutes
- standard: result < 5h
Both requests have a cost.
As a result, we probably want the retrieval requests to be manually initiated:
- so that we can choose the "speed/cost" of retrieval
- so that we are sure we do not end up with an automated background process that triggers long and costly retrieval operations
Because of that, we will need to be able to serve placeholders until the real resource is available (may just be the low res version for example).
We also probably want to have a notification system (like SNS => Nuxeo => Email?) to let the user know when the file is really available.
All Write operation will need to go through S3 first.
Then we can choose between Nuxeo and AWS infrastructure to handle the move from S3 to Glacier
- Nuxeo: leverage computation Framework
- AWS: leverage LifeCycle or S3 fired Lambda
Because we may need to pack files in different archives and that we probably want to avoid accessing Glacier for bad reasons, we are likely to need some external index.
A simple KV store may do the trick.