Uploaded image for project: 'Nuxeo Platform'
  1. Nuxeo Platform
  2. NXP-31857

Fix kafka unable to communicate with its zookeeper when zookeeper is re-deployed

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2023.0, 2021.38
    • Component/s: CI/CD
    • Team:
      PLATFORM
    • Sprint:
      nxplatform #87
    • Story Points:
      1

      Description

      Sometime in the Platform CI we have the following logs in the kafka output:

      [2023-05-04 13:03:05,301] INFO Opening socket connection to server kafka-zookeeper/10.63.244.84:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
      [2023-05-04 13:03:05,302] INFO Socket connection established, initiating session, client: /10.60.250.8:58242, server: kafka-zookeeper/10.63.244.84:2181 (org.apache.zookeeper.ClientCnxn)
      [2023-05-04 13:03:05,303] INFO Unable to read additional data from server sessionid 0x10000225a140000, likely server has closed socket, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
      

      We observe on a build that it could happen if the zookeeper pod is re-deployed, for example if the kubelet decides to remove the underlying node.

      After digging the internet, it seems that the culprit could be the lost zookeeper data between the two pods.
      The persistence is enabled by default on the Kafka chart whereas we disable it for unit tests.

      As it occurs on a node pool scale down, which makes sense since our k8s cluster has been upgraded recently and is more agressive on node scale down, we will prevent this behavior by setting PDB on each of the services needed for tests.

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: