When uploading large file, the upload from the client side will take a long time because of the client bandwidth limitations.
One uploaded on the server side the file will be pushed to the BlobManager : this can take time too.
The whole process would be more efficient if :
- BlobManager was providing an API to upload chunks
- BatchManager was directly using this API when available
The batch Manager stores the stream in temporary files.
These tmp files will be used to create Blobs that will be associated to a Document and that finally will be moved to the BinaryManager.
The final step will involve a copy of the file to write it inside the binary manager : thestoreAndDigest will read / compute digest / write the stream to a file.
With S3 BinaryManager this read/digest/write will occur on a temporary store and then the file will be copied over S3.
Client side upload is supposed to be limited by client side connectivity : S3/NAS access is likely to be faster than the http channel used by the client to upload the file.
So, we could leverage this to automatically have the batch manager write the content into the BinaryManager :
- no more file duplication
- no more slow write on S3 during end of transaction
Doing so would create several issues
- BinaryManager would contain temporary streams
- this is actually not really an issue : we have GC for that
- Chunking and Upload resume are a problem
- BinaryManager resolve stream according to their digest : you can not find it without a digest
- you don’t have the Digest if you did not finish the upload
There are basically 2 approaches to solve this :
- Change the BinaryManager API to be able to manage chunks and temporary files
- Rely on client side logic
- Make the client directly upload by it’s own means to the backend (ex: using S3 API)
- allow the Batch API to simply reference an existing Blob