For the kafka impl of mqueue some tests are failing randomly:
org.nuxeo.ecm.platform.importer.mqueues.tests.computation.TestMQComputationManagerKafka.testStopAndResume org.nuxeo.ecm.platform.importer.mqueues.tests.computation.TestMQComputationManagerKafka.testComplexTopoManyRecords org.nuxeo.ecm.platform.importer.mqueues.tests.TestAutomationKafka.testBlobAndDocumentImport
These failures were related to long delay to first partition attributions, the results is that consumers believes that there is no more messages to read,
now we wait for partition attribution before taking in account read timeout.
Also on test infra kafka has a 24h retention policy but this does not removes old topics, Kafka create a folder per partition, full unit test creates around 700 partitions, after dozen of executions topic creation are very slow and buggy even if we wait for topic availability for zookeeper,
there is a lag from the broker and producer may simply believe that the new partition does not exists:
12:43:28 14:43:28,709 [kafka-producer-network-thread | producer-27] WARN [NetworkClient$DefaultMetadataUpdater] Error while fetching metadata with correlation id 1 : {nuxeo-test-1499172208473-queueName=UNKNOWN_TOPIC_OR_PARTITION}
As work around a CI cleaning job as been setup https://qa.nuxeo.org/jenkins/job/System/job/cleanup-kafka/ to reset ZK and Kafka data.