[NXP-28931] WorkTest stuck against PostgreSQL / StreamWorkManager / Kafka - Nuxeo Issue Tracker

XML

Word

Printable

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 10.10-HF26, 11.1, 2021.0
Component/s: CI/CD

Tags:
- ci/cd
- noRNS
- nxFG
- qaduty
Team:
FG
Sprint:
nxFG 11.1.13

Description

When run in such conditions, the test gets stuck with the logs below, attached jstack and PostgreSQL report.

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
RuntimeStreamFeature: Deploying Nuxeo Stream using Kafka
WorkManagerFeature: Deploying WorkManager using stream implementation
StorageConfiguration: Deploying JDBC using DatabasePostgreSQL
StorageConfiguration: Deploying a VCS repository
2020-04-15 09:16:45,718 [main] WARN  [TransientStorageComponent] Missing configuration for default transient store, using in-memory
2020-04-15 09:16:45,884 [main] ERROR [OSGiRuntimeService] Nuxeo Platform Started
======================================================================
= Component Loading Warnings:
  * Missing configuration for default transient store, using in-memory
======================================================================
= Component Loading Status: Pending: 0 / Missing: 2 / Unstarted: 0 / Total: 114
  * service:org.nuxeo.ecm.core.coldstorage.contrib references missing [target=org.nuxeo.ecm.platform.ec.notification.service.NotificationService;point=notifications, target=org.nuxeo.ecm.platform.ec.notification.service.NotificationService;point=templates]
  * service:org.nuxeo.runtime.stream.service.managment.contrib references missing [target=org.nuxeo.ecm.core.management.CoreManagementComponent;point=probes, target=org.nuxeo.ecm.core.management.CoreManagementComponent;point=healthCheck]
======================================================================
[INFO] Running org.nuxeo.ecm.core.event.test.WorkTest
2020-04-15 09:17:07,244 [kafka-coordinator-heartbeat-thread | nuxeo-test-1586935003463-default] WARN  [AbstractCoordinator$HeartbeatThread] [Consumer clientId=default-2, groupId=nuxeo-test-1586935003463-default] This member will leave the group because consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
2020-04-15 09:17:27,612 [kafka-coordinator-heartbeat-thread | nuxeo-test-1586935003463-default] WARN  [AbstractCoordinator$HeartbeatThread] [Consumer clientId=default-3, groupId=nuxeo-test-1586935003463-default] This member will leave the group because consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
2020-04-15 09:17:47,822 [kafka-coordinator-heartbeat-thread | nuxeo-test-1586935003463-default] WARN  [AbstractCoordinator$HeartbeatThread] [Consumer clientId=default-1, groupId=nuxeo-test-1586935003463-default] This member will leave the group because consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
2020-04-15 09:18:08,532 [kafka-coordinator-heartbeat-thread | nuxeo-test-1586935003463-default] WARN  [AbstractCoordinator$HeartbeatThread] [Consumer clientId=default-4, groupId=nuxeo-test-1586935003463-default] This member will leave the group because consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.

The problem is a deadlock coming from the following SQL statement executed 4 times:

DELETE FROM "hierarchy" WHERE "id" = $1

As stated in the PostgreSQL report:

blocked_pid | blocked_user |           blocking_statement            | blocking_duration | blocking_pid | blocking_user |            blocked_statement            | blocked_duration 
-------------+--------------+-----------------------------------------+-------------------+--------------+---------------+-----------------------------------------+------------------
       15323 | nuxeo        | DELETE FROM "hierarchy" WHERE "id" = $1 | 00:01:08.874771   |        15324 | nuxeo         | DELETE FROM "hierarchy" WHERE "id" = $1 | 00:01:08.87208

datid |     datname     |  pid  | usesysid | usename  |    application_name    | client_addr | client_hostname | client_port |         backend_start         |          xact_start           |          query_start          |         state_change          | wait_event_type |     wait_event      |        state        | backend_xid | backend_xmin |                      query                      |    backend_type     
-------+-----------------+-------+----------+----------+------------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+---------------------+---------------------+-------------+--------------+-------------------------------------------------+---------------------
       |                 | 27510 |          |          |                        |             |                 |             | 2020-04-14 19:36:13.534513+02 |                               |                               |                               | Activity        | AutoVacuumMain      |                     |             |              |                                                 | autovacuum launcher
       |                 | 27512 |       10 | postgres |                        |             |                 |             | 2020-04-14 19:36:13.537269+02 |                               |                               |                               | Activity        | LogicalLauncherMain |                     |             |              |                                                 | background worker
 37175 | nuxeojunittests | 15323 |    16384 | nuxeo    | PostgreSQL JDBC Driver | 127.0.0.1   |                 |       41792 | 2020-04-15 09:16:44.254229+02 | 2020-04-15 09:17:07.672647+02 | 2020-04-15 09:17:07.693248+02 | 2020-04-15 09:17:07.693251+02 | Lock            | transactionid       | active              |       11084 |        11083 | DELETE FROM "hierarchy" WHERE "id" = $1         | client backend
 37175 | nuxeojunittests | 15324 |    16384 | nuxeo    | PostgreSQL JDBC Driver | 127.0.0.1   |                 |       41794 | 2020-04-15 09:16:44.281756+02 | 2020-04-15 09:17:07.672647+02 | 2020-04-15 09:17:07.690557+02 | 2020-04-15 09:17:07.694158+02 | Client          | ClientRead          | idle in transaction |       11083 |              | DELETE FROM "hierarchy" WHERE "id" = $1         | client backend
 37175 | nuxeojunittests | 15325 |    16384 | nuxeo    | PostgreSQL JDBC Driver | 127.0.0.1   |                 |       41796 | 2020-04-15 09:16:44.298111+02 | 2020-04-15 09:17:48.480821+02 | 2020-04-15 09:17:48.507306+02 | 2020-04-15 09:17:48.507308+02 | Lock            | tuple               | active              |       11090 |        11083 | DELETE FROM "hierarchy" WHERE "id" = $1         | client backend
 37175 | nuxeojunittests | 15326 |    16384 | nuxeo    | PostgreSQL JDBC Driver | 127.0.0.1   |                 |       41798 | 2020-04-15 09:16:44.309807+02 |                               | 2020-04-15 09:16:44.320103+02 | 2020-04-15 09:16:44.320119+02 | Client          | ClientRead          | idle                |             |              | SET application_name = 'PostgreSQL JDBC Driver' | client backend
 37175 | nuxeojunittests | 15327 |    16384 | nuxeo    | PostgreSQL JDBC Driver | 127.0.0.1   |                 |       41800 | 2020-04-15 09:16:44.321193+02 | 2020-04-15 09:17:27.879553+02 | 2020-04-15 09:17:27.900379+02 | 2020-04-15 09:17:27.90038+02  | Lock            | tuple               | active              |       11085 |        11083 | DELETE FROM "hierarchy" WHERE "id" = $1         | client backend

We can see that 2 queries are executed at the same time: 2020-04-15 09:17:07.672647+02.

To reproduce the issue:

Have postgresql running with a nuxeojunittests database

sudo su - postgres
psql
create role nuxeo with password 'nuxeo' login;
create database nuxeojunittests encoding 'utf8' owner nuxeo;
\q
exit

Have Kafka running

git clone https://github.com/wurstmeister/kafka-docker.git
cd kafka-docker/

Update the following section in docker-compose.yml:

services:
  kafka:
    ports:
      - "9092:9092"
    environment:
      KAFKA_ADVERTISED_LISTENERS: INSIDE://kafka:9093,OUTSIDE://localhost:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: INSIDE:PLAINTEXT,OUTSIDE:PLAINTEXT
      KAFKA_LISTENERS: INSIDE://0.0.0.0:9093,OUTSIDE://0.0.0.0:9092
      KAFKA_INTER_BROKER_LISTENER_NAME: INSIDE
      KAFKA_ADVERTISED_HOST_NAME: 192.168.99.100
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181

Start Kafka

docker-compose up -d

Run the test

cd ~/sources/nuxeo
 mvn test -nsu \
 -pl modules/core/nuxeo-core-test \
 -Dtest=WorkTest \
 -Dnuxeo.test.core=vcs \
 -Dnuxeo.test.vcs.db=PostgreSQL \
 -Dnuxeo.test.workmanager=stream \
 -Dnuxeo.test.stream=kafka

Wait for the first Kafka warnings

2020-04-15 09:54:09,332 [kafka-coordinator-heartbeat-thread | nuxeo-test-1586937225214-default] WARN  [AbstractCoordinator$HeartbeatThread] [Consumer clientId=default-2, groupId=nuxeo-test-1586937225214-default] This member will leave the group because consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.

Observe the problem

jps
14963 surefirebooter9268064286344412207.jar
14860 Launcher
4318 org.eclipse.equinox.launcher_1.5.500.v20190715-1310.jar
15391 Jps

jstack 14963 > jstack
vi jstack

Observe the PostgreSQL report

sudo su - postgres
wget --no-check-certificate https://gist.github.com/bdelbosc/5507796/raw/dump-nuxeo-postgres-config.sql
psql nuxeojunittests -f dump-nuxeo-postgres-config.sql
exit
vi /tmp/pgconf.txt

Stop Kafka

docker-compose stop

Attachments

Options
- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Attachments

pgconf.txt
70 kB
2020-04-15 07:47
jstack
58 kB
2020-04-15 07:47

Issue Links

is related to

NXP-28619 Run Nuxeo Platform unit tests against Kafka

Resolved

Is referenced in

PR for master: #4165

PR for master: #4294

PR for master: #4597

Activity

People

Assignee:

Florent Guillaume

Reporter:

Antoine Taillefer

Participants:

Antoine Taillefer, Florent Guillaume, Jenkins, Support Tech User

Votes:

0 Vote for this issue

Watchers:

4 Start watching this issue

Dates

Created:

2020-04-15 07:46

Updated:

2021-01-19 16:56

Resolved:

2020-04-17 11:42

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: