Affects Version/s: None
Fix Version/s: None
We have done benchmarks in the past in order to validate that our principles for scalability were working:
- we did a benchmark to verify that Application Level Sharding allows handling more documents
- 1B docs with 10 PGSQL based repositories
- we did a benchmark to verify that Infrastructure Level Sharding allows handling more documents
- 1B docs with 1 repository but Sharded MongoDB cluster
- we did a benchmark to verify that we can isolate workloads
- 1B docs with DAM use cases
This gives us confidence in our ability to scale Nuxeo and isolate the workloads to be able to deliver the right performance depending on the actual use cases.
However, what we have not identified yet are the different scalability steps:
- when do we need to start sharding at the infrastructure level
- when do we need to start sharding at the application level
- how far can we go
This is the very goal of the current benchmark campaign.
Because Nuxeo is a very configurable Platform the type of application and the business use cases have a very significant impact on performance and scalability aspects.
We want this benchmark to be realistic of a use case where 10B documents can be used.
The initial idea is to simulate a "bank-like use case" and store Bank Statements for a large number of customers.
For that, we need to define a Studio project that will define the content model.
Statements will be generated as PDF because this is the only practical solution we have at scale(see NXP-28765)
More info related to the content model, the filing plan and the tests:
We want to go through all the "scalability" in a logical order:
- step 1: *non-sharded environment*
- optimize to leverage as much as possible the hardware
- step 2: *sharding at the infrastructure level*
- leverage MongoDB and Elastic sharding capabilities
- see how far we can go without having a cluster that becomes hard to manage
- step 3: *sharding at application level*
- leverage Nuxeo multi-repositories feature
- leverage multiple MongoDB or Elasticsearch clusters
The idea is to progressively grow the volume of documents and move from step 1 to step 3 as needed.
- 100M documents
- 0.5B documents
- 2B documents
- 5B documents
- 10B documents
Sharding is a way to handle scalability, but it comes with a price in terms of hardware resources and management complexity. So, we want to delay as much as possible the need to move from step 1 to the next steps.
For that, we want to leverage as much as possible the hardware so that we can go as far as possible with step 1.
When targetting billions of documents inside the repository one of the obvious limiting factors comes from the capacity of the database to handle the queries in a timely manner.
To mitigate that aspect we can:
- offload more queries to Elasticsearch
- add more indexes inside MongoDB
Indexes are key to have good performance but this is only true if the database can maintain a significant part of the indexes in memory. If the database needs to always read the indexes from the hard drive, unless we are using expensive NVMe storage it will be slow.
This means that one way to improve database performance for Nuxeo is to minimize the indexes size.
Documents are identified and linked together using their UUID, so it is useful to indexes UUID and fields referencing UUIDs (like parentId or links).
As a result, providing a way to reduce how we encode and index Nuxeo UUIDs is a way to increase the number of documents we can handle on given hardware.
Hence the work on NXP-28763.
When full-text indexing is activated, by default, Nuxeo stores the text content extracted from the blob inside the database.
The goal is to allow to rebuild the full-text index without having to go through the full-text extraction process for all the blobs.
This approach has the disadvantage that it makes the DB storage grow faster with the number of documents.
We still believe that storing the full-text is a good trade-off, but we are working on other storage options in order to diminish the pressure on the DB.
The associated task is NXP-26704
The default policy inside Nuxeo is to compute and store all conversions:
- thumbnails are automatically generated and stored
- pictures and video renditions are automatically generated and stored
In the context of repositories with a few million documents, this seems to be a sensible policy.
However, when reaching several billion, we may want to reconsider the trade-off:
- is this worth paying the additional storage if only a small percentage of these renditions will ever be accessed?
- is on-the-fly generation not a better alternative?
The default approach is to use the Nuxeo Document UUID as a sharding key.
This approach has several benefits:
- sharding is completely transparent
- shards are statistically well balanced as long as UUIDs generation is well distributed
However, this approach also comes with some limitations:
- queries cost tends to grow with the number of shards
- sent the query to all nodes + merge and sort results
- adding a new shard requires a cluster-wide re-balancing
Inside MongoDB do not need to rely on the doc UUID to define sharding, we could also use a custom meta-data.
However, for this meta-data to be usable as a sharding key:
- it needs to be present on every single document
- values need to be properly balanced
If we identify such a partitioning scheme at the functional level, it is probably much more efficient to switch to multi-repositories:
- each "logical shard" can be managed and configured separately
- storage speed, DB types, hardware resources
- indexing and storage are sharded using the same axis
As a result, the scale-out steps will be:
- single repository - no sharding
- single repository - one MongoDB collection sharded on UUID
- multi-repositories - multiple MongoDB collections sharded on UUID (on multiple MongoDB Cluster)
For each step in the benchmark, we want to be able to check the level of performance so that we can determine if we can add more documents on the same infrastructure or if we need to scale out first.
For very basic REST API calls that perform CRUD operation, we can learn from the previous benchmarks, and what we see is that:
- the database is the main bottleneck
- some databases more than others
- having more than one Nuxeo node is useful
- because of HTTP threads-pool
- because otherwise, we do not test the cluster invalidation
- but after 2 nodes, the impact of adding Nuxeo nodes is small (if not null)
So, in the context of this benchmark:
- running simple REST CRUD test is meaningful for testing throughput and response time
- we do not need more than 2 Nuxeo nodes
When we have a very large repository a lot of operations can result in a massive amount of asynchronous processing:
- bulk-update on a query
- ACLs update
Theoretically, we could choose either Redis or Kafka as the backend for the asynchronous processing, however, clearly, Kafka is the solution we want to test:
- we know Kafka will not be the bottleneck
- Kafka can handle much bigger throughput than what we need
- Redis is limited by memory and memory is expensive
- benchmarks show us that with Kafka we better handle the back-pressure
- this even impacts positively the CRUD test
The previous benchmarks, especially the DAM one, have demonstrated clearly that we can easily scale out asynchronous processing by adding Nuxeo nodes.
The DAM benchmark also showed that we can handle priorities between the different streams of work.
In the context of this benchmark, we want to validate a "simple configuration" so that we can be sure that "bulk and import" activity will not impact negatively interactive processing.