-
Type: Task
-
Status: Resolved
-
Priority: Minor
-
Resolution: Fixed
-
Affects Version/s: None
-
Fix Version/s: None
-
Component/s: Continuous Integration
-
Tags:
-
Sprint:nxplatform #57
-
Story Points:1
Regularly, the different teams are suffering from pods never being scheduled: Platform, UI, AI.
----------------------------------------------------
Because of theĀ nodes-startup Daemonset installed in the admin-nodes-startup namespace, the nodes of the default pool "pool-2" are tainted with:
taints: - effect: NoSchedule key: dedicated value: nodes-startup
Theoretically, this taint shoud disappear at some point, but it seems not to.
Consequently, no pod can get scheduled in this node pool, unless it has the "dedicated: nodes-startup" toleration.
Thus, a pod without any specific toleration that would allow it to be scheduled on a node pool dedicated to a team will never be scheduled.
Typical symptom:
Normal NotTriggerScaleUp 27s cluster-autoscaler pod didn't trigger scale-up: 1 node(s) had taint {team: nos}, that the pod didn't tolerate, 1 node(s) had taint {team: platform}, that the pod didn't tolerate, 1 node(s) had taint {team: ui}, that the pod didn't tolerate, 1 node(s) had taint {team: napps}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: nodes-startup}, that the pod didn't tolerate, 1 max node group size reached Warning FailedScheduling 16s (x4 over 29s) default-scheduler 0/28 nodes are available: 1 node(s) had taint {team: ui}, that the pod didn't tolerate, 10 node(s) had taint {team: ai}, that the pod didn't tolerate, 5 node(s) had taint {team: napps}, that the pod didn't tolerate, 6 node(s) had taint {dedicated: nodes-startup}, that the pod didn't tolerate, 6 node(s) had taint {team: platform}, that the pod didn't tolerate.
List of possible tolerations for the team dedicated node pools:
- team: platform
- team: napps
- team: ui
- team: nos
- team: ai
Though the recommendation is to always set tolerations/nodeSelector on the Pods/Deployments/Statefulsets/... to get attached to a team dedicated node pool, we should allow pods to be scheduled in this "pool-2" default node pool.
Simple solution: remove the "nodes-startup" Daemonset. It should not be needed.
If problematic we can create it back.
- is related to
-
NXBT-3603 [Platform CI] Fix CI deployment stuck build
- Resolved
-
NXBT-3602 [Platform CI] Attach AWS credentials rotation pod to the dedicated node pool
- Resolved
-
NXBT-3624 [Kubernetes CI] Network issues blocking Platform, AI and Web UI CI
- Resolved
-
NXBT-3615 [Platform CI] Clean up Jenkins X related stuff
- Resolved