[NXP-31857] Fix kafka unable to communicate with its zookeeper when zookeeper is re-deployed - Nuxeo Issue Tracker

XML

Word

Printable

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2023.0, 2021.38
Component/s: CI/CD

Tags:
- noRNS
- nxplatform
Team:
PLATFORM
Sprint:
nxplatform #87
Story Points:
1

Description

Sometime in the Platform CI we have the following logs in the kafka output:

[2023-05-04 13:03:05,301] INFO Opening socket connection to server kafka-zookeeper/10.63.244.84:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2023-05-04 13:03:05,302] INFO Socket connection established, initiating session, client: /10.60.250.8:58242, server: kafka-zookeeper/10.63.244.84:2181 (org.apache.zookeeper.ClientCnxn)
[2023-05-04 13:03:05,303] INFO Unable to read additional data from server sessionid 0x10000225a140000, likely server has closed socket, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)

We observe on a build that it could happen if the zookeeper pod is re-deployed, for example if the kubelet decides to remove the underlying node.

After digging the internet, it seems that the culprit could be the lost zookeeper data between the two pods.
The persistence is enabled by default on the Kafka chart whereas we disable it for unit tests.

As it occurs on a node pool scale down, which makes sense since our k8s cluster has been upgraded recently and is more agressive on node scale down, we will prevent this behavior by setting PDB on each of the services needed for tests.

Attachments

Issue Links

is related to

NXP-31862 Remove Zookeeper in CI tests

Open

Activity

People

Assignee:

Kevin Leturc

Reporter:

Kevin Leturc

Participants:

Antoine Taillefer, Benoit Delbosc, Jenkins, Kevin Leturc

Votes:

0 Vote for this issue

Watchers:

4 Start watching this issue

Dates

Created:

2023-05-04 13:10

Updated:

2023-07-12 08:56

Resolved:

2023-05-16 08:11