Uploaded image for project: 'Nuxeo Platform'
  1. Nuxeo Platform
  2. NXP-14662 Add support for HA Redis
  3. NXP-14689

Ensure proper fallback if Redis is not available

    XMLWordPrintable

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.0
    • Component/s: Redis

      Description

      Overview

      Integrating Redis + Sentinel should allow the workmanager so handle graceful fallback.

      This includes :

      • detect Redis failover
      • continue working in Memory during the fallback
      • use new Redis master when elected

      Nb : a custom addon will be released for 5.8

      About Failover strategy

      There are at leasts 2 ways for handling Redis fail-over :

      Block/Retry strategy

      The first option is that in case Redis is not available the code will wait and retry later.

      There are 2 types of accessors :

      • the job schedulers
        • they will wait until Redis is available again for scheduling their job
      • the jobs runners :
        • they will simply wait for Redis to be back online

      In addition, for job scheduling, we'll manage a TimeOut, so that if Redis master is still not available after a configured timeout, we'll raise an error and rollback the originating transaction.

      This approach assumes the Redis fail-over will be relatively fast : this should be true as long as you have a large Redis cluster will a lot of nodes and that you don't experience network partitions.

      Fallback scheduling strategy

      An other approach is that in case Redis queuing is not available, we schedule the jobs in an in memory queue and that when Redis is back again, we empty the in memory queue before switching back to Redis.

      This approach make the global system more-resilient to Redis fail-over, especially in case the fail-over is slow, however, there is a trade-off :

      • when in memory queuing is used, if the Nuxeo node goes down the jobs will be lost
      • for thread-safety reasons, this approach requires significant changes inside the current WorkManager code
        Implementation choices

      How we will move forward

      In 5.8, we can not safely implement the second approach, so the strategy we are working on is :

      in 5.8 :

      • we implement the first approach (block)
      • we upgrade Jedis client
      • we push that as HotFix

      in 6.0

      • we start with the block strategy
      • we'll see in 6.0 or 6.x if we can rework the WorkManager architecture to allow the fallback scheduling approach

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: