Uploaded image for project: 'Nuxeo Platform'
  1. Nuxeo Platform
  2. NXP-28765

Generate 10B files



    • Type: Task
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Performance



      We want to generate 10B pdf files to serve as an attachment to the 10B Nuxeo documents.

      We need to find a technical solution for:

      • generating the 10B files in a timely manner
      • move them to the cloud in a time and cost-efficient manner

      PDF Generation

      For this benchmark, we will use PDF files because:

      • this is what most people use for the target use cases
      • generating PDF is much faster than generating TIFF

      Based on these tests

      • generating PDF from scratch is slower than *updating a template*
      • *iText* is the fastest java lib
        • although I did not test some commercial libraries
      • with a 4 Core system we can expect about 2,000 pdf/s

      The target PDF files will remain small ~4.5 KB

      • this is enough to get a meaningful PDF
      • this limit the storage and transfer cost

      PDF Thumbnails

      We want to have thumbnails for the PDF files.

      However, generating a Thumbnail from the generated PDF is actually very expensive: it is at least *10x slower than generating the PDF* itself.
      In addition, this increase the storage volume and the difficulty to upload all the files.

      As a result, we will simply skip this step during the import:

      • no pre-generation of thumbnail with the PDFs
      • no thumbnail generation at import time
      • generate the thumbnail "on the fly" when needed

      Tests with PDFBox show that we should be able to generate at least 200 thumbs per second, this is should far than enough for this use case.

      Scale generation to 10B files

      To scale a simple PDF generation to 10B, we need to take care about:

      • generation time
      • storage cost
      Generation time and scale out

      The initail idea was to leverage AWS Lambda to scale out the process and generate everything "in the cloud".

      Unfortunately in addition of the coding overhead, there is a significant cost overhead: doing 10B PUT on S3 does cost a lot!!!


      As a result, we will use a SBE to generate and upload the 10B documents:

      • leverage an EC2 instance inside the SBE to generate the 10B PDFs
      • use the SBE itself to populate the target S3 bucket and skip the cost of 10B PUT requests


          Issue Links



              • Assignee:
                tdelprat Thierry Delprat
                tdelprat Thierry Delprat
              • Votes:
                0 Vote for this issue
                2 Start watching this issue


                • Created: