Uploaded image for project: 'Nuxeo ECM Build/Test Environment'
  1. Nuxeo ECM Build/Test Environment
  2. NXBT-3601

[Kubernetes CI] Allow pods to be scheduled in the default node pool

    XMLWordPrintable

    Details

    • Type: Task
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Continuous Integration

      Description

      Regularly, the different teams are suffering from pods never being scheduled: Platform, UI, AI.
      ----------------------------------------------------
      Because of theĀ nodes-startup Daemonset installed in the admin-nodes-startup namespace, the nodes of the default pool "pool-2" are tainted with:

       taints:
       - effect: NoSchedule
         key: dedicated
         value: nodes-startup
      

      Theoretically, this taint shoud disappear at some point, but it seems not to.
      Consequently, no pod can get scheduled in this node pool, unless it has the "dedicated: nodes-startup" toleration.
      Thus, a pod without any specific toleration that would allow it to be scheduled on a node pool dedicated to a team will never be scheduled.
      Typical symptom:

      Normal   NotTriggerScaleUp  27s                cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) had taint {team: nos}, that the pod didn't tolerate, 1 node(s) had taint {team: platform}, that the pod didn't tolerate, 1 node(s) had taint {team: ui}, that the pod didn't tolerate, 1 node(s) had taint {team: napps}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: nodes-startup}, that the pod didn't tolerate, 1 max node group size reached
        Warning  FailedScheduling   16s (x4 over 29s)  default-scheduler   0/28 nodes are available: 1 node(s) had taint {team: ui}, that the pod didn't tolerate, 10 node(s) had taint {team: ai}, that the pod didn't tolerate, 5 node(s) had taint {team: napps}, that the pod didn't tolerate, 6 node(s) had taint {dedicated: nodes-startup}, that the pod didn't tolerate, 6 node(s) had taint {team: platform}, that the pod didn't tolerate.
      

      List of possible tolerations for the team dedicated node pools:

      • team: platform
      • team: napps
      • team: ui
      • team: nos
      • team: ai

      Though the recommendation is to always set tolerations/nodeSelector on the Pods/Deployments/Statefulsets/... to get attached to a team dedicated node pool, we should allow pods to be scheduled in this "pool-2" default node pool.

      Simple solution: remove the "nodes-startup" Daemonset. It should not be needed.
      If problematic we can create it back.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                ataillefer Antoine Taillefer
                Reporter:
                ataillefer Antoine Taillefer
                Participants:
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: