[NXP-19049] work manager shutdown should not suspend works - Nuxeo Issue Tracker

XML

Word

Printable

Details

Type: Sub-task
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 6.0-HF28, 7.10-HF07, 8.2
Component/s: Events / Works

Tags:

Description

What is happening on node shutdown:

Setup a rejection policy for new job to canceled them
Kill ilde workers, (running worker continue to run)
Move the scheduled queue content to a suspended list -> this is wrong in cluster mode
Suspend all running job -> no effect if already if already running
Wait for running job completion with a timeout of 5s
if Timeout -> drain the scheduled queue content -> this is wrong because we loose job if other nodes have scheduled new jobs but this does not have effect in practice because of unimplemented "iter" method of NuxeoBlockingQueue
Then Interrupt the thread, does not work if the work is not waiting or blocking

This shutdown process generates multiples errors in the logs when there are activity on other nodes.

When starting a node:

RedisWorkQueue move the suspended list to the scheduled queue, executors process jobs (requires ~~NXP-19008~~)

We could summarize as follow

on a local node, with in-memory queuing, works are canceled and lost because there is no persistence.
On a cluster node, with redis queuing, all works are suspended and re-activated only at the next node startup. That's wrong, because theseworks do not belong to the node which is stopped.

So we should not suspend scheduled job on shutdown this state is not useful here.
The rejection policy is enough to not continue to add jobs on shutdown and running job should be rescheduled and rollbacked.
There is no need to wait for termination and implement a timeout as we guarantee that the running jobs will never commit.

Attachments

Options
- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Attachments

server-node-stop.log.gz
127 kB
2016-02-18 10:01

Issue Links

is required by

NXP-19042 WorkManager in cluster mode should not leave job in running state

Open

NXP-19618 Fix PublisherServiceImpl startup not to start a transaction when one is already active

Resolved

NXP-20305 Fix functional tests on 6.0 on mssql

Resolved

NXP-20306 Fix functional tests on 6.0 on postgresql

Resolved

Activity

People

Assignee:

Stéphane Lacoin

Reporter:

Stéphane Lacoin

Participants:

Jenkins, Stéphane Lacoin

Votes:

0 Vote for this issue

Watchers:

3 Start watching this issue

Dates

Created:

2016-02-18 09:18

Updated:

2016-08-19 14:24

Resolved:

2016-03-03 10:34