-
Type: Sub-task
-
Status: Resolved
-
Priority: Critical
-
Resolution: Fixed
-
Affects Version/s: None
-
Component/s: Events / Works
What is happening on node shutdown:
- Setup a rejection policy for new job to canceled them
- Kill ilde workers, (running worker continue to run)
- Move the scheduled queue content to a suspended list -> this is wrong in cluster mode
- Suspend all running job -> no effect if already if already running
- Wait for running job completion with a timeout of 5s
- if Timeout -> drain the scheduled queue content -> this is wrong because we loose job if other nodes have scheduled new jobs but this does not have effect in practice because of unimplemented "iter" method of NuxeoBlockingQueue
Then Interrupt the thread, does not work if the work is not waiting or blocking
This shutdown process generates multiples errors in the logs when there are activity on other nodes.
When starting a node:
- RedisWorkQueue move the suspended list to the scheduled queue, executors process jobs (requires
NXP-19008)
We could summarize as follow
- on a local node, with in-memory queuing, works are canceled and lost because there is no persistence.
- On a cluster node, with redis queuing, all works are suspended and re-activated only at the next node startup. That's wrong, because theseworks do not belong to the node which is stopped.
So we should not suspend scheduled job on shutdown this state is not useful here.
The rejection policy is enough to not continue to add jobs on shutdown and running job should be rescheduled and rollbacked.
There is no need to wait for termination and implement a timeout as we guarantee that the running jobs will never commit.
- is required by
-
NXP-19042 WorkManager in cluster mode should not leave job in running state
- Open
-
NXP-19618 Fix PublisherServiceImpl startup not to start a transaction when one is already active
- Resolved
-
NXP-20305 Fix functional tests on 6.0 on mssql
- Resolved
-
NXP-20306 Fix functional tests on 6.0 on postgresql
- Resolved