[NXBT-3722] Keep internal Nexus pod alive when K8s scales down - Nuxeo Issue Tracker

XML

Word

Printable

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: None

Tags:
- ci/cd
- nxplatform
Sprint:
nxplatform #87
Story Points:
3

Description

Since the Kubernetes cluster upgrade to 1.22, the "ScaleDown" event deletes the Nexus pod, which then takes a long time to restart due to "Unable to attach or mount volumes".
This is blocking the CI jobs with errors such as:

Connect to nexus:80 [nexus/10.63.255.85] failed: Connection refused (Connection refused)

ScaleDown events:

k get events --sort-by='lastTimestamp' | grep "deleting pod"
7m17s       Normal    ScaleDown                pod/nexus-84f644b8f8-lnnzs                                      deleting pod for node scale down
...

See https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler

In GKE clusters with control plane version 1.22 or later, Pods with local storage no longer block scaling down.

Possible fix: adding the following annotation to the Nexus pod:

"cluster-autoscaler.kubernetes.io/safe-to-evict": "false"

See https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-autoscaler-visibility.

Also see ~~NXBT-3604~~ and possible fix https://github.com/jenkinsci/helm-charts/blob/main/charts/jenkins/README.md#long-volume-attachmount-times.

Attachments

Issue Links

is related to

NXBT-3604 [PlatformCI] Fix Chartmuseum stuck at ContainerCreating

Resolved

Activity

People

Assignee:

Antoine Taillefer

Reporter:

Antoine Taillefer

Participants:

Antoine Taillefer

Votes:

0 Vote for this issue

Watchers:

1 Start watching this issue

Dates

Created:

2023-05-02 09:07

Updated:

2023-05-03 14:40

Resolved:

2023-05-03 14:08