Details

      Description

      What is happening on node shutdown:

      • Setup a rejection policy for new job to canceled them
      • Kill ilde workers, (running worker continue to run)
      • Move the scheduled queue content to a suspended list -> this is wrong in cluster mode
      • Suspend all running job -> no effect if already if already running
      • Wait for running job completion with a timeout of 5s
      • if Timeout -> drain the scheduled queue content -> this is wrong because we loose job if other nodes have scheduled new jobs but this does not have effect in practice because of unimplemented "iter" method of NuxeoBlockingQueue
        Then Interrupt the thread, does not work if the work is not waiting or blocking

      This shutdown process generates multiples errors in the logs when there are activity on other nodes.

      When starting a node:

      • RedisWorkQueue move the suspended list to the scheduled queue, executors process jobs (requires NXP-19008)

      We could summarize as follow

      • on a local node, with in-memory queuing, works are canceled and lost because there is no persistence.
      • On a cluster node, with redis queuing, all works are suspended and re-activated only at the next node startup. That's wrong, because theseworks do not belong to the node which is stopped.

      So we should not suspend scheduled job on shutdown this state is not useful here.
      The rejection policy is enough to not continue to add jobs on shutdown and running job should be rescheduled and rollbacked.
      There is no need to wait for termination and implement a timeout as we guarantee that the running jobs will never commit.

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: