-
Type: Task
-
Status: In Progress
-
Priority: Minor
-
Resolution: Unresolved
-
Affects Version/s: None
-
Fix Version/s: ADDONS_9.10, ADDONS_10.10
-
Component/s: Clustering
-
Epic Link:
-
Backlog priority:1,000
Multi-Datacenters : Multi-AZ vs Multi-Regions
When talking about HA, a lot of people talk about using multiple data centers.
However, it is important to acknowledge the fact that multi-datacenters by itself is not precise enough.
The main impact on deployment architecture is tied to the way the data centers are connected:
- datacenters are connected via a high speed and low latency network
- this is a Multi-Availability Zone deployment
- we can spread the architecture across the data centers
- datacenters are using a WAN (or a network with significant latency)
- this is a Multi-Region deployment
- we need to deploy 2 copies of the architecture and provide asynchronous data replication
Multi-AZ Deployment
Principles
The main goal of this type of deployment is High Availability: be sure that if one zone goes down the service will continue to run.
Since AZ usually have different power sources, Internet access and cooling systems, deploying across multiple AZ is good for HA and can also provide some DRP options.
However, since, by definition, AZ are geographically co-located, it does not really protect in case of disaster (earthquake, global power grid being down ...): so may not be seen as a complete DRP solution.
Constraints
The main constraint is associated to the network between the Data Centers:
- reliable
- high speed (Gb/s)
- low latency (< 1 ms)
For HA architecture we rely on HA services, fault tolerant service requires an odd number of nodes to reach a consensus necessary for leader election, the minimal number of nodes being 3.
To support a zone outage it requires 3 zones, having only 2 zones means that there are more nodes on one side, when this side goes down the other size can not be elected as the leader because there is no majority, so the service is not available.
Deployment Architecture
The goal is to have an Active / Active / Active deployment.
Multi-Regions deployment
Principles
When deploying Nuxeo across multiple regions the goal should be disaster recovery: be sure that the company / application can continue working even if a regions goes down completely.
When talking about DRP, there are several metrics that impact the target architecture:
- RTO: Recovery Time Objective
- maximum time before the system can be up again
- this is basically the maximum down time
- RPO: Recovery Point Objective
- maximum amount of data that can be lost
Obviously is RTO/RPO are large a simple externalized backup can be a solution.
However, when RTO is below 1h and RPO is a few minutes, we need a dedicated architecture.
Limitations: not HA
DRP is about replicating the data between data centers on different regions, because this replication can not be synchronous (network latency) one site is always "behind" the other meaning that it can not be used for serving user requests.
Another way to say that is: because this replication is asynchronous and there is network latency there is a window of data loss in case of failover.
Also this latency requires to duplicate services because cluster can not be stretched between regions
One other goal for using multi-regions may be to optimize geographical delivery, but for that Nuxeo approach will be more CDN and upload accelerators and this is not the architecture discussed below.
Constraints
Since HA is usually also needed, we need to have a mixed architecture with:
- 3 AZs deployment
- 1 DRP deployment
Deployment Architecture
Focus for this Epic
We will evaluate both the Multi-AZ (HA) failover scenario as well as the Multi-region (DR, with high latency) failover scenario during our testing.