Affects Version/s: 10.10
This problem may happen since Nuxeo 10.3 and
NXP-25600 that upgrade the Kafka Client lib to 2.x.
Kafka 2.0 introduces a new consumer.poll(Duration) method where there is no more guarantee that the partitions assignment during a rebalancing is completed. This means that when the method returns because of a duration timeout we may be in the middle of a rebalancing. This was not the case before Kafka 2.0 when using the now deprecated method consumer.poll(long).
When this happens the tailer returns no new record. If the position needs to be committed a CommitFailedException is raised:
The exception message is wrong in this case because this is not tied to a poll interval that has not been respected, increasing the max.poll.interval.ms will not help in this case.
This problem has been handled in Kafka 2.5.0 where a more specific exception has been introduced: RebalanceInProgressException.
The consequence for Nuxeo is that during tail processing where consumers are waiting for new records, if there is rebalancing and some position needs to be checkpointed, the commit exception will terminate the consumer thread, resulting in failure and duplicate processing.
This problem has been seen on the Nuxeo-stream-importer where the consumer pool terminates when there is no new record to read generating rebalancing, the result is the presence of duplicate documents created.
The fix is to listen to partition revocation and trigger a Nuxeo rebalance exception.
Also, upgrading to Kafka 2.5.0 clients will help to get a better error message.