Uploaded image for project: 'Nuxeo Platform'
  1. Nuxeo Platform
  2. NXP-28558

SchedulerService may fail when starting multiple Nuxeo nodes

    XMLWordPrintable

    Details

    • Release Notes Summary:
      The scheduler services handles the startup with multiple Nuxeo nodes.
    • Tags:
    • Impact type:
      Configuration Change
    • Upgrade notes:
      Hide

      In cluster mode, the scheduler service is initialized non-concurrently in a cluster-wide critical section.

      When a cluster node attempts to initialize the scheduler service and another node is already doing the same thing, it will wait for 1 min for the cluster-wide lock to be released and do its own initialization. If this timeout expires, then initialization fails with an exception.

      The following two nuxeo.conf property can be used to change this timeout:

      org.nuxeo.scheduler.cluster.start.duration=1m
      

      In case where there's a startup crash while a lock is held, it may be necessary to manually cleanup the key/value store of its locks. The key corresponding to the lock is nuxeo:cluster:start-scheduler. For a MongoDB key/value store, the key is stored in the collection kv.cluster

      Show
      In cluster mode, the scheduler service is initialized non-concurrently in a cluster-wide critical section. When a cluster node attempts to initialize the scheduler service and another node is already doing the same thing, it will wait for 1 min for the cluster-wide lock to be released and do its own initialization. If this timeout expires, then initialization fails with an exception. The following two nuxeo.conf property can be used to change this timeout: org.nuxeo.scheduler.cluster.start.duration=1m In case where there's a startup crash while a lock is held, it may be necessary to manually cleanup the key/value store of its locks. The key corresponding to the lock is nuxeo:cluster:start-scheduler . For a MongoDB key/value store, the key is stored in the collection  kv.cluster
    • Team:
      FG
    • Sprint:
      nxFG 11.1.12

      Description

      When starting at least two nuxeo at the same time could lead to the issue below for one of the two nodes. Just restarting the failing node will fix the issue.

      2020-01-21T11:42:04,167 ERROR [main] [org.nuxeo.runtime.model.ComponentManager] Component service:org.nuxeo.ecm.core.scheduler.SchedulerService notification of application started failed: null
      java.lang.NullPointerException: null
              at com.novemberain.quartz.mongodb.dao.JobDao.storeJobInMongo(JobDao.java:134) ~[quartz-mongodb-2.0.0-NX3.jar:?]
              at com.novemberain.quartz.mongodb.TriggerAndJobPersister.storeJobAndTrigger(TriggerAndJobPersister.java:106) ~[quartz-mongodb-2.0.0-NX3.jar:?]
              at com.novemberain.quartz.mongodb.MongoDBJobStore.storeJobAndTrigger(MongoDBJobStore.java:189) ~[quartz-mongodb-2.0.0-NX3.jar:?]
              at org.quartz.core.QuartzScheduler.scheduleJob(QuartzScheduler.java:855) ~[quartz-2.3.0.jar:?]
              at org.quartz.impl.StdScheduler.scheduleJob(StdScheduler.java:249) ~[quartz-2.3.0.jar:?]
              at org.nuxeo.ecm.core.scheduler.SchedulerServiceImpl.schedule(SchedulerServiceImpl.java:240) ~[nuxeo-core-event-10.10-HF18.jar:?]
              at org.nuxeo.ecm.core.scheduler.SchedulerServiceImpl.schedule(SchedulerServiceImpl.java:213) ~[nuxeo-core-event-10.10-HF18.jar:?]
              at org.nuxeo.ecm.core.scheduler.SchedulerServiceImpl.registerSchedule(SchedulerServiceImpl.java:195) ~[nuxeo-core-event-10.10-HF18.jar:?]
              at org.nuxeo.ecm.core.scheduler.SchedulerServiceImpl.registerSchedule(SchedulerServiceImpl.java:184) ~[nuxeo-core-event-10.10-HF18.jar:?]
              at org.nuxeo.ecm.core.scheduler.SchedulerServiceImpl.setupScheduler(SchedulerServiceImpl.java:113) ~[nuxeo-core-event-10.10-HF18.jar:?]
              at org.nuxeo.ecm.core.scheduler.SchedulerServiceImpl.start(SchedulerServiceImpl.java:143) ~[nuxeo-core-event-10.10-HF18.jar:?]
              at org.nuxeo.runtime.model.impl.RegistrationInfoImpl.start(RegistrationInfoImpl.java:381) [nuxeo-runtime-10.10-HF10.jar:?]
      

      Regarding the library code it looks like we're facing a concurrency issue such as job created by nodeA, nodeB fails to create the same job, nodeB fails to retrieve the job (probably deleted?).

      We want to have a look at this and fix it at least in our fork.

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0 minutes
                  0m
                  Logged:
                  Time Spent - 2 hours
                  2h