Uploaded image for project: 'Nuxeo ECM Build/Test Environment'
  1. Nuxeo ECM Build/Test Environment
  2. NXBT-3722

Keep internal Nexus pod alive when K8s scales down

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None

      Description

      Since the Kubernetes cluster upgrade to 1.22, the "ScaleDown" event deletes the Nexus pod, which then takes a long time to restart due to "Unable to attach or mount volumes".
      This is blocking the CI jobs with errors such as:

      Connect to nexus:80 [nexus/10.63.255.85] failed: Connection refused (Connection refused)
      

      ScaleDown events:

      k get events --sort-by='lastTimestamp' | grep "deleting pod"
      7m17s       Normal    ScaleDown                pod/nexus-84f644b8f8-lnnzs                                      deleting pod for node scale down
      ...
      

      See https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler

      In GKE clusters with control plane version 1.22 or later, Pods with local storage no longer block scaling down.

      Possible fix: adding the following annotation to the Nexus pod:

      "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
      

      See https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-autoscaler-visibility.

      Also see NXBT-3604 and possible fix https://github.com/jenkinsci/helm-charts/blob/main/charts/jenkins/README.md#long-volume-attachmount-times.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                ataillefer Antoine Taillefer
                Reporter:
                ataillefer Antoine Taillefer
                Participants:
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: