Randomly on some 2023 PRs, various unit tests are failing with:
Unable to create topic nuxeo-test-1669127687373-testSimpleTopoFewRecordsOneThread-internal-metrics within the request timeout
stdout:
org.nuxeo.lib.stream.StreamRuntimeException: Unable to create topic nuxeo-test-1669127687373-testSimpleTopoFewRecordsOneThread-internal-metrics within the request timeout at org.nuxeo.lib.stream.log.kafka.KafkaUtils.createTopic(KafkaUtils.java:185) at org.nuxeo.lib.stream.log.kafka.KafkaLogManager.create(KafkaLogManager.java:103) at org.nuxeo.lib.stream.log.internals.AbstractLogManager.createIfNotExists(AbstractLogManager.java:75) at org.nuxeo.lib.stream.computation.log.LogStreamManager.initInternalStream(LogStreamManager.java:83) at org.nuxeo.lib.stream.computation.log.LogStreamManager.initInternalStreams(LogStreamManager.java:79) at org.nuxeo.lib.stream.computation.log.LogStreamManager.<init>(LogStreamManager.java:74) at org.nuxeo.lib.stream.tests.computation.TestStreamProcessor.testSimpleTopo(TestStreamProcessor.java:99) at org.nuxeo.lib.stream.tests.computation.TestStreamProcessor.testSimpleTopoFewRecordsOneThread(TestStreamProcessor.java:192) ... Caused by: java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: The request timed out. at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396) at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2096) at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:180) at org.nuxeo.lib.stream.log.kafka.KafkaUtils.createTopic(KafkaUtils.java:177) ... 49 more Caused by: org.apache.kafka.common.errors.TimeoutException: The request timed out.
or
Component service:org.nuxeo.runtime.stream.service notification of application started failed: java.util.concurrent.TimeoutException: Timeout while waiting for topic nuxeo-test-1669145630583-internal-metrics metadata propagation in the cluster [2022-11-22T19:37:16.030Z] org.nuxeo.lib.stream.StreamRuntimeException: java.util.concurrent.TimeoutException: Timeout while waiting for topic nuxeo-test-1669145630583-internal-metrics metadata propagation in the cluster ... Caused by: java.util.concurrent.TimeoutException: Timeout while waiting for topic nuxeo-test-1669145630583-internal-metrics metadata propagation in the cluster
with stdout:
NullPointerException: Cannot invoke "org.nuxeo.lib.stream.computation.StreamManager.getProcessor(String)" because "this.streamManager" is null
or (see attachement platform-nuxeo-unit-tests-pr-939-14-mongodb_kafka-0.log for full log)
Component service:org.nuxeo.runtime.stream.service notification of application started failed: java.util.concurrent.TimeoutException: Timeout while waiting for topic nuxeo-test-1675297548460-bulk-automationUi metadata propagation in the cluster [2023-02-02T05:14:50.961Z] org.nuxeo.lib.stream.StreamRuntimeException: java.util.concurrent.TimeoutException: Timeout while waiting for topic nuxeo-test-1675297548460-bulk-automationUi metadata propagation in the cluster ... Caused by: java.util.concurrent.TimeoutException: Timeout while waiting for topic nuxeo-test-1675297548460-bulk-automationUi metadata propagation in the cluster
Looking at the Kafka logs, we can see that the topic is created after the test failure, so, we are probably facing a Kafka slowness issue.
Note that the test is failing after 30 s and not 4 min (request.timeout.ms increased by NXP-31423) because this timeout parameter is not taken into account at his point in the test.
Not sure if this slowness is related with the Kafka upgrade to 3.3.1 done for 2023 (NXP-31410).
Let's try to bump the CPU resources allowed to the Kafka pod, currently they seem quite low:
resources: requests: cpu: "500m" memory: "1024Mi" limits: cpu: "1" memory: "1536Mi"