Uploaded image for project: 'Nuxeo Platform'
  1. Nuxeo Platform
  2. NXP-31444

Random Kafka timeout at topic creation or metadata propagation

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2023.x, 2021.30
    • Component/s: CI/CD

      Description

      Randomly on some 2023 PRs, various unit tests are failing with:

      Unable to create topic nuxeo-test-1669127687373-testSimpleTopoFewRecordsOneThread-internal-metrics within the request timeout
      

      stdout:

      org.nuxeo.lib.stream.StreamRuntimeException: Unable to create topic nuxeo-test-1669127687373-testSimpleTopoFewRecordsOneThread-internal-metrics within the request timeout
      	at org.nuxeo.lib.stream.log.kafka.KafkaUtils.createTopic(KafkaUtils.java:185)
      	at org.nuxeo.lib.stream.log.kafka.KafkaLogManager.create(KafkaLogManager.java:103)
      	at org.nuxeo.lib.stream.log.internals.AbstractLogManager.createIfNotExists(AbstractLogManager.java:75)
      	at org.nuxeo.lib.stream.computation.log.LogStreamManager.initInternalStream(LogStreamManager.java:83)
      	at org.nuxeo.lib.stream.computation.log.LogStreamManager.initInternalStreams(LogStreamManager.java:79)
      	at org.nuxeo.lib.stream.computation.log.LogStreamManager.<init>(LogStreamManager.java:74)
      	at org.nuxeo.lib.stream.tests.computation.TestStreamProcessor.testSimpleTopo(TestStreamProcessor.java:99)
      	at org.nuxeo.lib.stream.tests.computation.TestStreamProcessor.testSimpleTopoFewRecordsOneThread(TestStreamProcessor.java:192)
        ...
      Caused by: java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: The request timed out.
      	at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
      	at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2096)
      	at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:180)
      	at org.nuxeo.lib.stream.log.kafka.KafkaUtils.createTopic(KafkaUtils.java:177)
      	... 49 more
      Caused by: org.apache.kafka.common.errors.TimeoutException: The request timed out.
      

      or

      Component service:org.nuxeo.runtime.stream.service notification of application started failed: java.util.concurrent.TimeoutException: Timeout while waiting for topic nuxeo-test-1669145630583-internal-metrics metadata propagation in the cluster
      [2022-11-22T19:37:16.030Z] org.nuxeo.lib.stream.StreamRuntimeException: java.util.concurrent.TimeoutException: Timeout while waiting for topic nuxeo-test-1669145630583-internal-metrics metadata propagation in the cluster
      ...
      Caused by: java.util.concurrent.TimeoutException: Timeout while waiting for topic nuxeo-test-1669145630583-internal-metrics metadata propagation in the cluster
      

      with stdout:

      NullPointerException: Cannot invoke "org.nuxeo.lib.stream.computation.StreamManager.getProcessor(String)" because "this.streamManager" is null
      

      or (see attachement platform-nuxeo-unit-tests-pr-939-14-mongodb_kafka-0.log for full log)

      Component service:org.nuxeo.runtime.stream.service notification of application started failed: java.util.concurrent.TimeoutException: Timeout while waiting for topic nuxeo-test-1675297548460-bulk-automationUi metadata propagation in the cluster
      [2023-02-02T05:14:50.961Z] org.nuxeo.lib.stream.StreamRuntimeException: java.util.concurrent.TimeoutException: Timeout while waiting for topic nuxeo-test-1675297548460-bulk-automationUi metadata propagation in the cluster
      ...
      Caused by: java.util.concurrent.TimeoutException: Timeout while waiting for topic nuxeo-test-1675297548460-bulk-automationUi metadata propagation in the cluster
      

      Looking at the Kafka logs, we can see that the topic is created after the test failure, so, we are probably facing a Kafka slowness issue.
      Note that the test is failing after 30 s and not 4 min (request.timeout.ms increased by NXP-31423) because this timeout parameter is not taken into account at his point in the test.

      Not sure if this slowness is related with the Kafka upgrade to 3.3.1 done for 2023 (NXP-31410).

      Let's try to bump the CPU resources allowed to the Kafka pod, currently they seem quite low:

      resources:
        requests:
          cpu: "500m"
          memory: "1024Mi"
        limits:
          cpu: "1"
          memory: "1536Mi"
      

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: