-
Type: Improvement
-
Status: Open
-
Priority: Minor
-
Resolution: Unresolved
-
Affects Version/s: None
-
Fix Version/s: None
-
Component/s: Docker Image, Monitoring
Goal
Nuxeo exposes metrics via StackDriver.
Inside K8S, we can define HorizontalPodAutoScaling that can add/remove pods to a deployment depending on metrics.
There is a support for "custom metrics" in K8S and we would like to make K8S deployment scale based on Nuxeo metrics:
- scale-out Nuxeo API nodes if requests throughput becomes too big
- scale-out worker nodes if the lag becomes to big
- ...
Tests and experimentations done so far
Activated StackDriver integration at the Nuxeo level
metrics.stackdriver.enabled=true metrics.stackdriver.gcpProjectId=XXX metrics.enabled=true
Metrics explorer
I am then able to read metrics from Metric Explorer:
fetch k8s_container | metric 'custom.googleapis.com/nuxeo/dropwizard5_nuxeo.works.global.queue.completed_gauge' | filter (resource.cluster_name == 'multitenants') && (metric.queue == 'default') | group_by 1m, [value_completed_gauge_mean: mean(value.completed_gauge)] | every 1m | group_by [], [value_completed_gauge_mean_aggregate: aggregate(value_completed_gauge_mean)]
Deploy the StackDriver Adapter
Folowing the doc I deployed the custom-metrics-stackdriver-adapter
kubectl create clusterrolebinding cluster-admin-binding \ --clusterrole cluster-admin --user "$(gcloud config get-value account)" kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml
The underlying pod seems to start without issue:
kubectl get pods -n custom-metrics NAME READY STATUS RESTARTS AGE custom-metrics-stackdriver-adapter-7454cc95c5-vpzc5 1/1 Running 0 67m
API and CRD seem to be ok
kubectl get apiservices |grep metrics v1beta1.custom.metrics.k8s.io custom-metrics/custom-metrics-stackdriver-adapter True 69m v1beta1.external.metrics.k8s.io custom-metrics/custom-metrics-stackdriver-adapter True 69m v1beta1.metrics.k8s.io kube-system/metrics-server True 33d v1beta2.custom.metrics.k8s.io custom-metrics/custom-metrics-stackdriver-adapter True 69m
Create the HPA
I created an HorizontalPodAutoscaler from the GKE UI (workload => deployement => Autoscaling)
kubectl get hpa -n tenant1 NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE nuxeo-company-a-api Deployment/nuxeo-company-a-api <unknown>/1 1 4 2 3h56m nuxeo-company-a-worker Deployment/nuxeo-company-a-worker <unknown>/1 1 3 1 48m
kubectl get hpa -n tenant1 nuxeo-company-a-worker -o yaml apiVersion: autoscaling/v1 kind: HorizontalPodAutoscaler metadata: annotations: autoscaling.alpha.kubernetes.io/conditions: '[{"type":"AbleToScale","status":"True","lastTransitionTime":"2020-11-25T19:56:52Z","reason":"SucceededGetScale","message":"the HPA controller was able to get the target''s current scale"},{"type":"ScalingActive","status":"False","lastTransitionTime":"2020-11-25T19:56:52Z","reason":"FailedGetPodsMetric","message":"the HPA was unable to compute the replica count: unable to get metric custom.googleapis.com|nuxeo|dropwizard5_nuxeo.works.global.queue.completed_gauge: no metrics returned from custom metrics API"}]' autoscaling.alpha.kubernetes.io/metrics: '[{"type":"Pods","pods":{"metricName":"custom.googleapis.com|nuxeo|dropwizard5_nuxeo.works.global.queue.completed_gauge","targetAverageValue":"1","selector":{}}}]' creationTimestamp: "2020-11-25T19:56:30Z" name: nuxeo-company-a-worker namespace: tenant1 resourceVersion: "13343121" selfLink: /apis/autoscaling/v1/namespaces/tenant1/horizontalpodautoscalers/nuxeo-company-a-worker uid: 9cac3b05-61eb-447f-bb6d-2580b9c029e9 spec: maxReplicas: 3 minReplicas: 1 scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: nuxeo-company-a-worker status: currentReplicas: 1 desiredReplicas: 0
Here the metrics do not make much sense, but I wanted to be sure to have a metric with a non-null value.
Still I have the error
no metrics returned from custom metrics API
It is worth mentioning that if I use a Metric that does not exist the error is different:
the server could not find the descriptor for metric custom.googleapis.com/nuxeo/dropwizard5_nuxeo.works.global.queue.completed_gauge_idonotexist: googleapi: Error 404: Could not find descriptor for metric ''custom.googleapis.com/nuxeo/dropwizard5_nuxeo.works.global.queue.completed_gauge_idonotexist''., notFound"}
Filtering on wrong label also gives a different error:
unable to get metric custom.googleapis.com|nuxeo|dropwizard5_nuxeo.works.global.queue.completed_gauge:
unable to fetch metrics from custom metrics API: Metric label: \"queue\" is
not allowed"}]'
Looking at the StackDriver adapters we can see errors:
kubectl logs -f -n custom-metrics custom-metrics-stackdriver-adapter-7454cc95c5-vpzc5 E1125 20:54:56.702406 1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed E1125 20:54:56.702423 1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed E1125 20:54:56.702516 1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed E1125 20:54:56.702567 1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed E1125 20:54:56.792694 1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed E1125 20:54:56.793280 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"} E1125 20:54:56.794387 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"} E1125 20:54:56.795492 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"} E1125 20:54:56.796566 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"} E1125 20:54:56.797690 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}
Looking at the Nuxeo Server level, I can find strange WARN:
kubectl logs -f -n tenant1 nuxeo-company-a-worker-54dd9c8497-9cnsk 2020-11-25T20:03:54,372 WARN [CreateTimeSeriesExporter] ApiException thrown when exporting TimeSeries. com.google.api.gax.rpc.UnavailableException: io.grpc.StatusRuntimeException: UNAVAILABLE: HTTP/2 error code: NO_ERROR Received Goaway max_age at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:69) ~[gax-1.47.1.jar:1.47.1] at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72) ~[gax-grpc-1.47.1.jar:1.47.1] at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60) ~[gax-grpc-1.47.1.jar:1.47.1] at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97) ~[gax-grpc-1.47.1.jar:1.47.1] at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68) ~[api-common-1.8.1.jar:?] at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1074) ~[guava-30.0-jre.jar:?] at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30) ~[guava-30.0-jre.jar:?] at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1213) ~[guava-30.0-jre.jar:?] at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:983) ~[guava-30.0-jre.jar:?] at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:771) ~[guava-30.0-jre.jar:?] at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:522) ~[grpc-stub-1.27.2.jar:1.27.2] at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:497) ~[grpc-stub-1.27.2.jar:1.27.2] at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:426) ~[grpc-core-1.27.2.jar:1.27.2] at io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66) ~[grpc-core-1.27.2.jar:1.27.2] at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:689) ~[grpc-core-1.27.2.jar:1.27.2] at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$900(ClientCallImpl.java:577) ~[grpc-core-1.27.2.jar:1.27.2] at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:751) ~[grpc-core-1.27.2.jar:1.27.2] at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:740) ~[grpc-core-1.27.2.jar:1.27.2] at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[grpc-core-1.27.2.jar:1.27.2] at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) ~[grpc-core-1.27.2.jar:1.27.2] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] at java.lang.Thread.run(Thread.java:834) [?:?] Suppressed: com.google.api.gax.rpc.AsyncTaskException: Asynchronous task failed at com.google.api.gax.rpc.ApiExceptions.callAndTranslateApiException(ApiExceptions.java:57) ~[gax-1.47.1.jar:1.47.1] at com.google.api.gax.rpc.UnaryCallable.call(UnaryCallable.java:112) ~[gax-1.47.1.jar:1.47.1] at com.google.cloud.monitoring.v3.MetricServiceClient.createTimeSeries(MetricServiceClient.java:1156) ~[google-cloud-monitoring-1.82.0.jar:1.82.0] at io.opencensus.exporter.stats.stackdriver.CreateTimeSeriesExporter.export(CreateTimeSeriesExporter.java:85) [opencensus-exporter-stats-stackdriver-0.27.1.jar:0.27.1] at io.opencensus.exporter.stats.stackdriver.CreateMetricDescriptorExporter.export(CreateMetricDescriptorExporter.java:157) [opencensus-exporter-stats-stackdriver-0.27.1.jar:0.27.1] at io.opencensus.exporter.metrics.util.MetricReader.readAndExport(MetricReader.java:167) [opencensus-exporter-metrics-util-0.27.1.jar:0.27.1] at io.opencensus.exporter.metrics.util.IntervalMetricReader$Worker.readAndExport(IntervalMetricReader.java:177) [opencensus-exporter-metrics-util-0.27.1.jar:0.27.1] at io.opencensus.exporter.metrics.util.IntervalMetricReader$Worker.run(IntervalMetricReader.java:170) [opencensus-exporter-metrics-util-0.27.1.jar:0.27.1] at java.lang.Thread.run(Thread.java:834) [?:?] Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: HTTP/2 error code: NO_ERROR Received Goaway max_age at io.grpc.Status.asRuntimeException(Status.java:533) ~[grpc-api-1.27.2.jar:1.27.2] ... 15 more
But not sure that there is a direct correspondence between the Error on the StackDriver adapter and the error on the Nuxeo side:
- the metrics are available in "Google Metrics"
- see https://github.com/census-instrumentation/opencensus-java/issues/869
I tried to use a metric with no other dimention, but got the same result:
HPA was unable to compute the replica count: unable to get metric custom.googleapis.com|nuxeo|dropwizard5_nuxeo.repositories.repository.documents.create_counter: no metrics returned from custom metrics API"
The custom.metrics API seems to work ok
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq | grep -B 1 -C 7 queue.completed_gauge { "name": "*/custom.googleapis.com|nuxeo|dropwizard5_nuxeo.works.global.queue.completed_gauge", "singularName": "", "namespaced": true, "kind": "MetricValueList", "verbs": [ "get" ] },
Switching to the v1beta2 API or adding the tenant namespace has no effect:
kubectl get -n tenant1 --raw /apis/custom.metrics.k8s.io/v1beta2 | jq | grep -B 1 -C 7 queue.completed_gauge { "name": "*/custom.googleapis.com|nuxeo|dropwizard5_nuxeo.works.global.queue.completed_gauge", "singularName": "", "namespaced": true, "kind": "MetricValueList", "verbs": [ "get" ] },
So, it seems that the API exists, works, but there is no data ...
kubectl get -n tenant1 --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/tenant1/pods/*/custom.googleapis.com|nuxeo|dropwizard5_nuxeo.works.global.queue.completed_gauge" {"kind":"MetricValueList","apiVersion":"custom.metrics.k8s.io/v1beta1","metadata":{"selfLink":"/apis/custom.metrics.k8s.io/v1beta1/namespaces/tenant1/pods/%2A/custom.googleapis.com%7Cnuxeo%7Cdropwizard5_nuxeo.works.global.queue.completed_gauge"},"items":[]}
It is all about "labels" ?
Looking at the Readme for the Custom StackDriver metrics Adapter I see:
"Metrics attached to Kubernetes objects, such as Pod or Node, can be retrieved via Custom Metrics API."
=> this may mean that our metrics can only be retrieved via External Metrics ?
"your metric descriptor needs to meet following requirements: metricType = DOUBLE or INT64"
resource_type set to one of k8s_pod, k8s_node for new resource model.
All resource labels for your monitored resource set to correct values in particluar pod_id, pod_name, namespace_name
When looking at the go sample code, it seems clear that the labels are expected to contain
- "project_id": projectId,
- "location": location,
- "cluster_name": clusterName,
- "namespace_name": namespace,
- "pod_name": name,
My current understanding is that we should be calling StackdriverStatsConfiguration.setConstantLabels() to set the proper labels and enable usage from within HPA.
At this point, I do not see how to handle change that from the outside.
- is required by
-
NXP-29707 Support GCP Stackdriver monitoring for metrics and traces
- Resolved