[NXBT-3601] [Kubernetes CI] Allow pods to be scheduled in the default node pool - Nuxeo Issue Tracker

XML

Word

Printable

Details

Type: Task
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: Continuous Integration

Tags:
- ci/cd
- nxplatform
Sprint:
nxplatform #57
Story Points:
1

Description

Regularly, the different teams are suffering from pods never being scheduled: Platform, UI, AI.
----------------------------------------------------
Because of the nodes-startup Daemonset installed in the admin-nodes-startup namespace, the nodes of the default pool "pool-2" are tainted with:

 taints:
 - effect: NoSchedule
   key: dedicated
   value: nodes-startup

Theoretically, this taint shoud disappear at some point, but it seems not to.
Consequently, no pod can get scheduled in this node pool, unless it has the "dedicated: nodes-startup" toleration.
Thus, a pod without any specific toleration that would allow it to be scheduled on a node pool dedicated to a team will never be scheduled.
Typical symptom:

Normal   NotTriggerScaleUp  27s                cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) had taint {team: nos}, that the pod didn't tolerate, 1 node(s) had taint {team: platform}, that the pod didn't tolerate, 1 node(s) had taint {team: ui}, that the pod didn't tolerate, 1 node(s) had taint {team: napps}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: nodes-startup}, that the pod didn't tolerate, 1 max node group size reached
  Warning  FailedScheduling   16s (x4 over 29s)  default-scheduler   0/28 nodes are available: 1 node(s) had taint {team: ui}, that the pod didn't tolerate, 10 node(s) had taint {team: ai}, that the pod didn't tolerate, 5 node(s) had taint {team: napps}, that the pod didn't tolerate, 6 node(s) had taint {dedicated: nodes-startup}, that the pod didn't tolerate, 6 node(s) had taint {team: platform}, that the pod didn't tolerate.

List of possible tolerations for the team dedicated node pools:

team: platform
team: napps
team: ui
team: nos
team: ai

Though the recommendation is to always set tolerations/nodeSelector on the Pods/Deployments/Statefulsets/... to get attached to a team dedicated node pool, we should allow pods to be scheduled in this "pool-2" default node pool.

Simple solution: remove the "nodes-startup" Daemonset. It should not be needed.
If problematic we can create it back.

Attachments

Issue Links

is related to

NXBT-3603 [Platform CI] Fix CI deployment stuck build

Resolved

NXBT-3602 [Platform CI] Attach AWS credentials rotation pod to the dedicated node pool

Resolved

NXBT-3624 [Kubernetes CI] Network issues blocking Platform, AI and Web UI CI

Resolved

NXBT-3615 [Platform CI] Clean up Jenkins X related stuff

Resolved

Activity

People

Assignee:

Antoine Taillefer

Reporter:

Antoine Taillefer

Participants:

Antoine Taillefer

Votes:

0 Vote for this issue

Watchers:

1 Start watching this issue

Dates

Created:

2022-03-02 07:13

Updated:

2022-03-31 15:21

Resolved:

2022-03-21 15:11