Affects Version/s: None
Fix Version/s: None
We want to generate 10B pdf files to serve as an attachment to the 10B Nuxeo documents.
We need to find a technical solution for:
- generating the 10B files in a timely manner
- move them to the cloud in a time and cost-efficient manner
For this benchmark, we will use PDF files because:
- this is what most people use for the target use cases
- generating PDF is much faster than generating TIFF
Based on these tests
- generating PDF from scratch is slower than *updating a template*
- *iText* is the fastest java lib
- although I did not test some commercial libraries
- with a 4 Core system we can expect about 2,000 pdf/s
The target PDF files will remain small ~4.5 KB
- this is enough to get a meaningful PDF
- this limit the storage and transfer cost
We want to have thumbnails for the PDF files.
However, generating a Thumbnail from the generated PDF is actually very expensive: it is at least *10x slower than generating the PDF* itself.
In addition, this increase the storage volume and the difficulty to upload all the files.
As a result, we will simply skip this step during the import:
- no pre-generation of thumbnail with the PDFs
- no thumbnail generation at import time
- generate the thumbnail "on the fly" when needed
Tests with PDFBox show that we should be able to generate at least 200 thumbs per second, this is should far than enough for this use case.
To scale a simple PDF generation to 10B, we need to take care about:
- generation time
- storage cost
The initail idea was to leverage AWS Lambda to scale out the process and generate everything "in the cloud".
Unfortunately in addition of the coding overhead, there is a significant cost overhead: doing 10B PUT on S3 does cost a lot!!!
As a result, we will use a SBE to generate and upload the 10B documents:
- leverage an EC2 instance inside the SBE to generate the 10B PDFs
- use the SBE itself to populate the target S3 bucket and skip the cost of 10B PUT requests