[NXP-28558] SchedulerService may fail when starting multiple Nuxeo nodes - Nuxeo Issue Tracker

XML

Word

Printable

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 9.10-HF42, 10.10-HF23, 11.1, 2021.0
Component/s: Core MongoDB, Scheduler

Release Notes Summary:
The scheduler services handles the startup with multiple Nuxeo nodes.
Tags:
- nxfg
Impact type:

Configuration Change
Upgrade notes:
Hide

In cluster mode, the scheduler service is initialized non-concurrently in a cluster-wide critical section.

When a cluster node attempts to initialize the scheduler service and another node is already doing the same thing, it will wait for 1 min for the cluster-wide lock to be released and do its own initialization. If this timeout expires, then initialization fails with an exception.

The following two nuxeo.conf property can be used to change this timeout:

org.nuxeo.scheduler.cluster.start.duration=1m

In case where there's a startup crash while a lock is held, it may be necessary to manually cleanup the key/value store of its locks. The key corresponding to the lock is nuxeo:cluster:start-scheduler. For a MongoDB key/value store, the key is stored in the collection kv.cluster
Show
In cluster mode, the scheduler service is initialized non-concurrently in a cluster-wide critical section. When a cluster node attempts to initialize the scheduler service and another node is already doing the same thing, it will wait for 1 min for the cluster-wide lock to be released and do its own initialization. If this timeout expires, then initialization fails with an exception. The following two nuxeo.conf property can be used to change this timeout: org.nuxeo.scheduler.cluster.start.duration=1m In case where there's a startup crash while a lock is held, it may be necessary to manually cleanup the key/value store of its locks. The key corresponding to the lock is nuxeo:cluster:start-scheduler . For a MongoDB key/value store, the key is stored in the collection kv.cluster
Team:
FG
Sprint:
nxFG 11.1.12

Description

When starting at least two nuxeo at the same time could lead to the issue below for one of the two nodes. Just restarting the failing node will fix the issue.

2020-01-21T11:42:04,167 ERROR [main] [org.nuxeo.runtime.model.ComponentManager] Component service:org.nuxeo.ecm.core.scheduler.SchedulerService notification of application started failed: null
java.lang.NullPointerException: null
        at com.novemberain.quartz.mongodb.dao.JobDao.storeJobInMongo(JobDao.java:134) ~[quartz-mongodb-2.0.0-NX3.jar:?]
        at com.novemberain.quartz.mongodb.TriggerAndJobPersister.storeJobAndTrigger(TriggerAndJobPersister.java:106) ~[quartz-mongodb-2.0.0-NX3.jar:?]
        at com.novemberain.quartz.mongodb.MongoDBJobStore.storeJobAndTrigger(MongoDBJobStore.java:189) ~[quartz-mongodb-2.0.0-NX3.jar:?]
        at org.quartz.core.QuartzScheduler.scheduleJob(QuartzScheduler.java:855) ~[quartz-2.3.0.jar:?]
        at org.quartz.impl.StdScheduler.scheduleJob(StdScheduler.java:249) ~[quartz-2.3.0.jar:?]
        at org.nuxeo.ecm.core.scheduler.SchedulerServiceImpl.schedule(SchedulerServiceImpl.java:240) ~[nuxeo-core-event-10.10-HF18.jar:?]
        at org.nuxeo.ecm.core.scheduler.SchedulerServiceImpl.schedule(SchedulerServiceImpl.java:213) ~[nuxeo-core-event-10.10-HF18.jar:?]
        at org.nuxeo.ecm.core.scheduler.SchedulerServiceImpl.registerSchedule(SchedulerServiceImpl.java:195) ~[nuxeo-core-event-10.10-HF18.jar:?]
        at org.nuxeo.ecm.core.scheduler.SchedulerServiceImpl.registerSchedule(SchedulerServiceImpl.java:184) ~[nuxeo-core-event-10.10-HF18.jar:?]
        at org.nuxeo.ecm.core.scheduler.SchedulerServiceImpl.setupScheduler(SchedulerServiceImpl.java:113) ~[nuxeo-core-event-10.10-HF18.jar:?]
        at org.nuxeo.ecm.core.scheduler.SchedulerServiceImpl.start(SchedulerServiceImpl.java:143) ~[nuxeo-core-event-10.10-HF18.jar:?]
        at org.nuxeo.runtime.model.impl.RegistrationInfoImpl.start(RegistrationInfoImpl.java:381) [nuxeo-runtime-10.10-HF10.jar:?]

Regarding the library code it looks like we're facing a concurrency issue such as job created by nodeA, nodeB fails to create the same job, nodeB fails to retrieve the job (probably deleted?).

We want to have a look at this and fix it at least in our fork.

Attachments

Issue Links

is related to

NXP-28661 Allow concurrent startup of Nuxeo instances

Resolved

NXP-26621 SchedulerService may fail to start when launching multiple nodes simultaneously

Resolved

Is referenced in

PR for master: #4165

PR for master: #4294

Activity

People

Assignee:

Florent Guillaume

Reporter:

Kevin Leturc

Participants:

Florent Guillaume, Jenkins, Kevin Leturc, Support Tech User

Votes:

0 Vote for this issue

Watchers:

3 Start watching this issue

Dates

Created:

2020-01-22 16:44

Updated:

2020-12-17 16:37

Resolved:

2020-02-26 11:12

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: