What is happening on node shutdown:
- Setup a rejection policy for new job to canceled them
- Kill ilde workers, (running worker continue to run)
- Move the scheduled queue content to a suspended list -> this is wrong in cluster mode
- Suspend all running job -> no effect if already if already running
- Wait for running job completion with a timeout of 5s
- if Timeout -> drain the scheduled queue content -> this is wrong because we loose job if other nodes have scheduled new jobs but this does not have effect in practice because of unimplemented "iter" method of NuxeoBlockingQueue
Then Interrupt the thread, does not work if the work is not waiting or blocking
This shutdown process generates multiples errors in the logs when there are activity on other nodes.
When starting a node:
- RedisWorkQueue move the suspended list to the scheduled queue, executors process jobs (requires
We could summarize as follow
- on a local node, with in-memory queuing, works are canceled and lost because there is no persistence.
- On a cluster node, with redis queuing, all works are suspended and re-activated only at the next node startup. That's wrong, because theseworks do not belong to the node which is stopped.
So we should not suspend scheduled job on shutdown this state is not useful here.
The rejection policy is enough to not continue to add jobs on shutdown and running job should be rescheduled and rollbacked.
There is no need to wait for termination and implement a timeout as we guarantee that the running jobs will never commit.