When run in such conditions, the test gets stuck with the logs below, attached jstack and PostgreSQL report.
[INFO] ------------------------------------------------------- [INFO] T E S T S [INFO] ------------------------------------------------------- RuntimeStreamFeature: Deploying Nuxeo Stream using Kafka WorkManagerFeature: Deploying WorkManager using stream implementation StorageConfiguration: Deploying JDBC using DatabasePostgreSQL StorageConfiguration: Deploying a VCS repository 2020-04-15 09:16:45,718 [main] WARN [TransientStorageComponent] Missing configuration for default transient store, using in-memory 2020-04-15 09:16:45,884 [main] ERROR [OSGiRuntimeService] Nuxeo Platform Started ====================================================================== = Component Loading Warnings: * Missing configuration for default transient store, using in-memory ====================================================================== = Component Loading Status: Pending: 0 / Missing: 2 / Unstarted: 0 / Total: 114 * service:org.nuxeo.ecm.core.coldstorage.contrib references missing [target=org.nuxeo.ecm.platform.ec.notification.service.NotificationService;point=notifications, target=org.nuxeo.ecm.platform.ec.notification.service.NotificationService;point=templates] * service:org.nuxeo.runtime.stream.service.managment.contrib references missing [target=org.nuxeo.ecm.core.management.CoreManagementComponent;point=probes, target=org.nuxeo.ecm.core.management.CoreManagementComponent;point=healthCheck] ====================================================================== [INFO] Running org.nuxeo.ecm.core.event.test.WorkTest 2020-04-15 09:17:07,244 [kafka-coordinator-heartbeat-thread | nuxeo-test-1586935003463-default] WARN [AbstractCoordinator$HeartbeatThread] [Consumer clientId=default-2, groupId=nuxeo-test-1586935003463-default] This member will leave the group because consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records. 2020-04-15 09:17:27,612 [kafka-coordinator-heartbeat-thread | nuxeo-test-1586935003463-default] WARN [AbstractCoordinator$HeartbeatThread] [Consumer clientId=default-3, groupId=nuxeo-test-1586935003463-default] This member will leave the group because consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records. 2020-04-15 09:17:47,822 [kafka-coordinator-heartbeat-thread | nuxeo-test-1586935003463-default] WARN [AbstractCoordinator$HeartbeatThread] [Consumer clientId=default-1, groupId=nuxeo-test-1586935003463-default] This member will leave the group because consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records. 2020-04-15 09:18:08,532 [kafka-coordinator-heartbeat-thread | nuxeo-test-1586935003463-default] WARN [AbstractCoordinator$HeartbeatThread] [Consumer clientId=default-4, groupId=nuxeo-test-1586935003463-default] This member will leave the group because consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
The problem is a deadlock coming from the following SQL statement executed 4 times:
DELETE FROM "hierarchy" WHERE "id" = $1
As stated in the PostgreSQL report:
blocked_pid | blocked_user | blocking_statement | blocking_duration | blocking_pid | blocking_user | blocked_statement | blocked_duration -------------+--------------+-----------------------------------------+-------------------+--------------+---------------+-----------------------------------------+------------------ 15323 | nuxeo | DELETE FROM "hierarchy" WHERE "id" = $1 | 00:01:08.874771 | 15324 | nuxeo | DELETE FROM "hierarchy" WHERE "id" = $1 | 00:01:08.87208
datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start | state_change | wait_event_type | wait_event | state | backend_xid | backend_xmin | query | backend_type -------+-----------------+-------+----------+----------+------------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+---------------------+---------------------+-------------+--------------+-------------------------------------------------+--------------------- | | 27510 | | | | | | | 2020-04-14 19:36:13.534513+02 | | | | Activity | AutoVacuumMain | | | | | autovacuum launcher | | 27512 | 10 | postgres | | | | | 2020-04-14 19:36:13.537269+02 | | | | Activity | LogicalLauncherMain | | | | | background worker 37175 | nuxeojunittests | 15323 | 16384 | nuxeo | PostgreSQL JDBC Driver | 127.0.0.1 | | 41792 | 2020-04-15 09:16:44.254229+02 | 2020-04-15 09:17:07.672647+02 | 2020-04-15 09:17:07.693248+02 | 2020-04-15 09:17:07.693251+02 | Lock | transactionid | active | 11084 | 11083 | DELETE FROM "hierarchy" WHERE "id" = $1 | client backend 37175 | nuxeojunittests | 15324 | 16384 | nuxeo | PostgreSQL JDBC Driver | 127.0.0.1 | | 41794 | 2020-04-15 09:16:44.281756+02 | 2020-04-15 09:17:07.672647+02 | 2020-04-15 09:17:07.690557+02 | 2020-04-15 09:17:07.694158+02 | Client | ClientRead | idle in transaction | 11083 | | DELETE FROM "hierarchy" WHERE "id" = $1 | client backend 37175 | nuxeojunittests | 15325 | 16384 | nuxeo | PostgreSQL JDBC Driver | 127.0.0.1 | | 41796 | 2020-04-15 09:16:44.298111+02 | 2020-04-15 09:17:48.480821+02 | 2020-04-15 09:17:48.507306+02 | 2020-04-15 09:17:48.507308+02 | Lock | tuple | active | 11090 | 11083 | DELETE FROM "hierarchy" WHERE "id" = $1 | client backend 37175 | nuxeojunittests | 15326 | 16384 | nuxeo | PostgreSQL JDBC Driver | 127.0.0.1 | | 41798 | 2020-04-15 09:16:44.309807+02 | | 2020-04-15 09:16:44.320103+02 | 2020-04-15 09:16:44.320119+02 | Client | ClientRead | idle | | | SET application_name = 'PostgreSQL JDBC Driver' | client backend 37175 | nuxeojunittests | 15327 | 16384 | nuxeo | PostgreSQL JDBC Driver | 127.0.0.1 | | 41800 | 2020-04-15 09:16:44.321193+02 | 2020-04-15 09:17:27.879553+02 | 2020-04-15 09:17:27.900379+02 | 2020-04-15 09:17:27.90038+02 | Lock | tuple | active | 11085 | 11083 | DELETE FROM "hierarchy" WHERE "id" = $1 | client backend
We can see that 2 queries are executed at the same time: 2020-04-15 09:17:07.672647+02.
To reproduce the issue:
Have postgresql running with a nuxeojunittests database
sudo su - postgres psql create role nuxeo with password 'nuxeo' login; create database nuxeojunittests encoding 'utf8' owner nuxeo; \q exit
Have Kafka running
git clone https://github.com/wurstmeister/kafka-docker.git cd kafka-docker/
Update the following section in docker-compose.yml:
services: kafka: ports: - "9092:9092" environment: KAFKA_ADVERTISED_LISTENERS: INSIDE://kafka:9093,OUTSIDE://localhost:9092 KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: INSIDE:PLAINTEXT,OUTSIDE:PLAINTEXT KAFKA_LISTENERS: INSIDE://0.0.0.0:9093,OUTSIDE://0.0.0.0:9092 KAFKA_INTER_BROKER_LISTENER_NAME: INSIDE KAFKA_ADVERTISED_HOST_NAME: 192.168.99.100 KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
Start Kafka
docker-compose up -d
Run the test
cd ~/sources/nuxeo mvn test -nsu \ -pl modules/core/nuxeo-core-test \ -Dtest=WorkTest \ -Dnuxeo.test.core=vcs \ -Dnuxeo.test.vcs.db=PostgreSQL \ -Dnuxeo.test.workmanager=stream \ -Dnuxeo.test.stream=kafka
Wait for the first Kafka warnings
2020-04-15 09:54:09,332 [kafka-coordinator-heartbeat-thread | nuxeo-test-1586937225214-default] WARN [AbstractCoordinator$HeartbeatThread] [Consumer clientId=default-2, groupId=nuxeo-test-1586937225214-default] This member will leave the group because consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
Observe the problem
jps 14963 surefirebooter9268064286344412207.jar 14860 Launcher 4318 org.eclipse.equinox.launcher_1.5.500.v20190715-1310.jar 15391 Jps jstack 14963 > jstack vi jstack
Observe the PostgreSQL report
sudo su - postgres wget --no-check-certificate https://gist.github.com/bdelbosc/5507796/raw/dump-nuxeo-postgres-config.sql psql nuxeojunittests -f dump-nuxeo-postgres-config.sql exit vi /tmp/pgconf.txt
Stop Kafka
docker-compose stop
- is related to
-
NXP-28619 Run Nuxeo Platform unit tests against Kafka
- Resolved
- Is referenced in