Uploaded image for project: 'Nuxeo Platform'
  1. Nuxeo Platform
  2. NXP-28931

WorkTest stuck against PostgreSQL / StreamWorkManager / Kafka

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 10.10-HF26, 11.1, 2021.0
    • Component/s: CI/CD

      Description

      When run in such conditions, the test gets stuck with the logs below, attached jstack and PostgreSQL report.

      [INFO] -------------------------------------------------------
      [INFO]  T E S T S
      [INFO] -------------------------------------------------------
      RuntimeStreamFeature: Deploying Nuxeo Stream using Kafka
      WorkManagerFeature: Deploying WorkManager using stream implementation
      StorageConfiguration: Deploying JDBC using DatabasePostgreSQL
      StorageConfiguration: Deploying a VCS repository
      2020-04-15 09:16:45,718 [main] WARN  [TransientStorageComponent] Missing configuration for default transient store, using in-memory
      2020-04-15 09:16:45,884 [main] ERROR [OSGiRuntimeService] Nuxeo Platform Started
      ======================================================================
      = Component Loading Warnings:
        * Missing configuration for default transient store, using in-memory
      ======================================================================
      = Component Loading Status: Pending: 0 / Missing: 2 / Unstarted: 0 / Total: 114
        * service:org.nuxeo.ecm.core.coldstorage.contrib references missing [target=org.nuxeo.ecm.platform.ec.notification.service.NotificationService;point=notifications, target=org.nuxeo.ecm.platform.ec.notification.service.NotificationService;point=templates]
        * service:org.nuxeo.runtime.stream.service.managment.contrib references missing [target=org.nuxeo.ecm.core.management.CoreManagementComponent;point=probes, target=org.nuxeo.ecm.core.management.CoreManagementComponent;point=healthCheck]
      ======================================================================
      [INFO] Running org.nuxeo.ecm.core.event.test.WorkTest
      2020-04-15 09:17:07,244 [kafka-coordinator-heartbeat-thread | nuxeo-test-1586935003463-default] WARN  [AbstractCoordinator$HeartbeatThread] [Consumer clientId=default-2, groupId=nuxeo-test-1586935003463-default] This member will leave the group because consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
      2020-04-15 09:17:27,612 [kafka-coordinator-heartbeat-thread | nuxeo-test-1586935003463-default] WARN  [AbstractCoordinator$HeartbeatThread] [Consumer clientId=default-3, groupId=nuxeo-test-1586935003463-default] This member will leave the group because consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
      2020-04-15 09:17:47,822 [kafka-coordinator-heartbeat-thread | nuxeo-test-1586935003463-default] WARN  [AbstractCoordinator$HeartbeatThread] [Consumer clientId=default-1, groupId=nuxeo-test-1586935003463-default] This member will leave the group because consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
      2020-04-15 09:18:08,532 [kafka-coordinator-heartbeat-thread | nuxeo-test-1586935003463-default] WARN  [AbstractCoordinator$HeartbeatThread] [Consumer clientId=default-4, groupId=nuxeo-test-1586935003463-default] This member will leave the group because consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
      

      The problem is a deadlock coming from the following SQL statement executed 4 times:

      DELETE FROM "hierarchy" WHERE "id" = $1
      

      As stated in the PostgreSQL report:

      blocked_pid | blocked_user |           blocking_statement            | blocking_duration | blocking_pid | blocking_user |            blocked_statement            | blocked_duration 
      -------------+--------------+-----------------------------------------+-------------------+--------------+---------------+-----------------------------------------+------------------
             15323 | nuxeo        | DELETE FROM "hierarchy" WHERE "id" = $1 | 00:01:08.874771   |        15324 | nuxeo         | DELETE FROM "hierarchy" WHERE "id" = $1 | 00:01:08.87208
      
      datid |     datname     |  pid  | usesysid | usename  |    application_name    | client_addr | client_hostname | client_port |         backend_start         |          xact_start           |          query_start          |         state_change          | wait_event_type |     wait_event      |        state        | backend_xid | backend_xmin |                      query                      |    backend_type     
      -------+-----------------+-------+----------+----------+------------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+---------------------+---------------------+-------------+--------------+-------------------------------------------------+---------------------
             |                 | 27510 |          |          |                        |             |                 |             | 2020-04-14 19:36:13.534513+02 |                               |                               |                               | Activity        | AutoVacuumMain      |                     |             |              |                                                 | autovacuum launcher
             |                 | 27512 |       10 | postgres |                        |             |                 |             | 2020-04-14 19:36:13.537269+02 |                               |                               |                               | Activity        | LogicalLauncherMain |                     |             |              |                                                 | background worker
       37175 | nuxeojunittests | 15323 |    16384 | nuxeo    | PostgreSQL JDBC Driver | 127.0.0.1   |                 |       41792 | 2020-04-15 09:16:44.254229+02 | 2020-04-15 09:17:07.672647+02 | 2020-04-15 09:17:07.693248+02 | 2020-04-15 09:17:07.693251+02 | Lock            | transactionid       | active              |       11084 |        11083 | DELETE FROM "hierarchy" WHERE "id" = $1         | client backend
       37175 | nuxeojunittests | 15324 |    16384 | nuxeo    | PostgreSQL JDBC Driver | 127.0.0.1   |                 |       41794 | 2020-04-15 09:16:44.281756+02 | 2020-04-15 09:17:07.672647+02 | 2020-04-15 09:17:07.690557+02 | 2020-04-15 09:17:07.694158+02 | Client          | ClientRead          | idle in transaction |       11083 |              | DELETE FROM "hierarchy" WHERE "id" = $1         | client backend
       37175 | nuxeojunittests | 15325 |    16384 | nuxeo    | PostgreSQL JDBC Driver | 127.0.0.1   |                 |       41796 | 2020-04-15 09:16:44.298111+02 | 2020-04-15 09:17:48.480821+02 | 2020-04-15 09:17:48.507306+02 | 2020-04-15 09:17:48.507308+02 | Lock            | tuple               | active              |       11090 |        11083 | DELETE FROM "hierarchy" WHERE "id" = $1         | client backend
       37175 | nuxeojunittests | 15326 |    16384 | nuxeo    | PostgreSQL JDBC Driver | 127.0.0.1   |                 |       41798 | 2020-04-15 09:16:44.309807+02 |                               | 2020-04-15 09:16:44.320103+02 | 2020-04-15 09:16:44.320119+02 | Client          | ClientRead          | idle                |             |              | SET application_name = 'PostgreSQL JDBC Driver' | client backend
       37175 | nuxeojunittests | 15327 |    16384 | nuxeo    | PostgreSQL JDBC Driver | 127.0.0.1   |                 |       41800 | 2020-04-15 09:16:44.321193+02 | 2020-04-15 09:17:27.879553+02 | 2020-04-15 09:17:27.900379+02 | 2020-04-15 09:17:27.90038+02  | Lock            | tuple               | active              |       11085 |        11083 | DELETE FROM "hierarchy" WHERE "id" = $1         | client backend
      

      We can see that 2 queries are executed at the same time: 2020-04-15 09:17:07.672647+02.

      To reproduce the issue:

      Have postgresql running with a nuxeojunittests database

      sudo su - postgres
      psql
      create role nuxeo with password 'nuxeo' login;
      create database nuxeojunittests encoding 'utf8' owner nuxeo;
      \q
      exit
      

      Have Kafka running

      git clone https://github.com/wurstmeister/kafka-docker.git
      cd kafka-docker/
      

      Update the following section in docker-compose.yml:

      services:
        kafka:
          ports:
            - "9092:9092"
          environment:
            KAFKA_ADVERTISED_LISTENERS: INSIDE://kafka:9093,OUTSIDE://localhost:9092
            KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: INSIDE:PLAINTEXT,OUTSIDE:PLAINTEXT
            KAFKA_LISTENERS: INSIDE://0.0.0.0:9093,OUTSIDE://0.0.0.0:9092
            KAFKA_INTER_BROKER_LISTENER_NAME: INSIDE
            KAFKA_ADVERTISED_HOST_NAME: 192.168.99.100
            KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      

      Start Kafka

      docker-compose up -d
      

      Run the test

      cd ~/sources/nuxeo
       mvn test -nsu \
       -pl modules/core/nuxeo-core-test \
       -Dtest=WorkTest \
       -Dnuxeo.test.core=vcs \
       -Dnuxeo.test.vcs.db=PostgreSQL \
       -Dnuxeo.test.workmanager=stream \
       -Dnuxeo.test.stream=kafka
      

      Wait for the first Kafka warnings

      2020-04-15 09:54:09,332 [kafka-coordinator-heartbeat-thread | nuxeo-test-1586937225214-default] WARN  [AbstractCoordinator$HeartbeatThread] [Consumer clientId=default-2, groupId=nuxeo-test-1586937225214-default] This member will leave the group because consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
      

      Observe the problem

      jps
      14963 surefirebooter9268064286344412207.jar
      14860 Launcher
      4318 org.eclipse.equinox.launcher_1.5.500.v20190715-1310.jar
      15391 Jps
      
      jstack 14963 > jstack
      vi jstack
      

      Observe the PostgreSQL report

      sudo su - postgres
      wget --no-check-certificate https://gist.github.com/bdelbosc/5507796/raw/dump-nuxeo-postgres-config.sql
      psql nuxeojunittests -f dump-nuxeo-postgres-config.sql
      exit
      vi /tmp/pgconf.txt
      

      Stop Kafka

      docker-compose stop
      

        Attachments

        1. jstack
          58 kB
        2. pgconf.txt
          70 kB

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0 minutes
                  0m
                  Logged:
                  Time Spent - 5 hours
                  5h