[NXP-29902] Allow to leverage Nuxeo StackDriver Metrics inside K8S HPA - Nuxeo Issue Tracker

XML

Word

Printable

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Docker Image, Monitoring

Description

Goal

Nuxeo exposes metrics via StackDriver.
Inside K8S, we can define HorizontalPodAutoScaling that can add/remove pods to a deployment depending on metrics.
There is a support for "custom metrics" in K8S and we would like to make K8S deployment scale based on Nuxeo metrics:

scale-out Nuxeo API nodes if requests throughput becomes too big
scale-out worker nodes if the lag becomes to big
...

Tests and experimentations done so far

Activated StackDriver integration at the Nuxeo level

    metrics.stackdriver.enabled=true
    metrics.stackdriver.gcpProjectId=XXX
    metrics.enabled=true

Metrics explorer

I am then able to read metrics from Metric Explorer:

fetch k8s_container
| metric
    'custom.googleapis.com/nuxeo/dropwizard5_nuxeo.works.global.queue.completed_gauge'
| filter
    (resource.cluster_name == 'multitenants') && (metric.queue == 'default')
| group_by 1m, [value_completed_gauge_mean: mean(value.completed_gauge)]
| every 1m
| group_by [],
    [value_completed_gauge_mean_aggregate:
       aggregate(value_completed_gauge_mean)]

Deploy the StackDriver Adapter

Folowing the doc I deployed the custom-metrics-stackdriver-adapter


kubectl create clusterrolebinding cluster-admin-binding \
    --clusterrole cluster-admin --user "$(gcloud config get-value account)"

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml

The underlying pod seems to start without issue:

kubectl get pods -n custom-metrics
NAME                                                  READY   STATUS    RESTARTS   AGE
custom-metrics-stackdriver-adapter-7454cc95c5-vpzc5   1/1     Running   0          67m

API and CRD seem to be ok

kubectl get apiservices |grep metrics
v1beta1.custom.metrics.k8s.io          custom-metrics/custom-metrics-stackdriver-adapter   True        69m
v1beta1.external.metrics.k8s.io        custom-metrics/custom-metrics-stackdriver-adapter   True        69m
v1beta1.metrics.k8s.io                 kube-system/metrics-server                          True        33d
v1beta2.custom.metrics.k8s.io          custom-metrics/custom-metrics-stackdriver-adapter   True        69m

Create the HPA

I created an HorizontalPodAutoscaler from the GKE UI (workload => deployement => Autoscaling)

kubectl get hpa -n tenant1
NAME                     REFERENCE                           TARGETS       MINPODS   MAXPODS   REPLICAS   AGE
nuxeo-company-a-api      Deployment/nuxeo-company-a-api      <unknown>/1   1         4         2          3h56m
nuxeo-company-a-worker   Deployment/nuxeo-company-a-worker   <unknown>/1   1         3         1          48m

kubectl get hpa -n tenant1 nuxeo-company-a-worker -o yaml
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  annotations:
    autoscaling.alpha.kubernetes.io/conditions: '[{"type":"AbleToScale","status":"True","lastTransitionTime":"2020-11-25T19:56:52Z","reason":"SucceededGetScale","message":"the
      HPA controller was able to get the target''s current scale"},{"type":"ScalingActive","status":"False","lastTransitionTime":"2020-11-25T19:56:52Z","reason":"FailedGetPodsMetric","message":"the
      HPA was unable to compute the replica count: unable to get metric custom.googleapis.com|nuxeo|dropwizard5_nuxeo.works.global.queue.completed_gauge:
      no metrics returned from custom metrics API"}]'
    autoscaling.alpha.kubernetes.io/metrics: '[{"type":"Pods","pods":{"metricName":"custom.googleapis.com|nuxeo|dropwizard5_nuxeo.works.global.queue.completed_gauge","targetAverageValue":"1","selector":{}}}]'
  creationTimestamp: "2020-11-25T19:56:30Z"
  name: nuxeo-company-a-worker
  namespace: tenant1
  resourceVersion: "13343121"
  selfLink: /apis/autoscaling/v1/namespaces/tenant1/horizontalpodautoscalers/nuxeo-company-a-worker
  uid: 9cac3b05-61eb-447f-bb6d-2580b9c029e9
spec:
  maxReplicas: 3
  minReplicas: 1
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nuxeo-company-a-worker
status:
  currentReplicas: 1
  desiredReplicas: 0

Here the metrics do not make much sense, but I wanted to be sure to have a metric with a non-null value.

Still I have the error

no metrics returned from custom metrics API

It is worth mentioning that if I use a Metric that does not exist the error is different:

the server could not find the
      descriptor for metric custom.googleapis.com/nuxeo/dropwizard5_nuxeo.works.global.queue.completed_gauge_idonotexist:
      googleapi: Error 404: Could not find descriptor for metric ''custom.googleapis.com/nuxeo/dropwizard5_nuxeo.works.global.queue.completed_gauge_idonotexist''.,
      notFound"}

Filtering on wrong label also gives a different error:

 unable to get metric custom.googleapis.com|nuxeo|dropwizard5_nuxeo.works.global.queue.completed_gauge:
      unable to fetch metrics from custom metrics API: Metric label: \"queue\" is
      not allowed"}]'

Looking at the StackDriver adapters we can see errors:

kubectl logs -f  -n custom-metrics custom-metrics-stackdriver-adapter-7454cc95c5-vpzc5
E1125 20:54:56.702406       1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed
E1125 20:54:56.702423       1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed
E1125 20:54:56.702516       1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed
E1125 20:54:56.702567       1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed
E1125 20:54:56.792694       1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed
E1125 20:54:56.793280       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}
E1125 20:54:56.794387       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}
E1125 20:54:56.795492       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}
E1125 20:54:56.796566       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}
E1125 20:54:56.797690       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}

Looking at the Nuxeo Server level, I can find strange WARN:

kubectl logs -f  -n tenant1 nuxeo-company-a-worker-54dd9c8497-9cnsk

2020-11-25T20:03:54,372 WARN  [CreateTimeSeriesExporter] ApiException thrown when exporting TimeSeries.
com.google.api.gax.rpc.UnavailableException: io.grpc.StatusRuntimeException: UNAVAILABLE: HTTP/2 error code: NO_ERROR
Received Goaway
max_age
	at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:69) ~[gax-1.47.1.jar:1.47.1]
	at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72) ~[gax-grpc-1.47.1.jar:1.47.1]
	at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60) ~[gax-grpc-1.47.1.jar:1.47.1]
	at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97) ~[gax-grpc-1.47.1.jar:1.47.1]
	at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68) ~[api-common-1.8.1.jar:?]
	at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1074) ~[guava-30.0-jre.jar:?]
	at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30) ~[guava-30.0-jre.jar:?]
	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1213) ~[guava-30.0-jre.jar:?]
	at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:983) ~[guava-30.0-jre.jar:?]
	at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:771) ~[guava-30.0-jre.jar:?]
	at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:522) ~[grpc-stub-1.27.2.jar:1.27.2]
	at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:497) ~[grpc-stub-1.27.2.jar:1.27.2]
	at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:426) ~[grpc-core-1.27.2.jar:1.27.2]
	at io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66) ~[grpc-core-1.27.2.jar:1.27.2]
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:689) ~[grpc-core-1.27.2.jar:1.27.2]
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$900(ClientCallImpl.java:577) ~[grpc-core-1.27.2.jar:1.27.2]
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:751) ~[grpc-core-1.27.2.jar:1.27.2]
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:740) ~[grpc-core-1.27.2.jar:1.27.2]
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[grpc-core-1.27.2.jar:1.27.2]
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) ~[grpc-core-1.27.2.jar:1.27.2]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]
	Suppressed: com.google.api.gax.rpc.AsyncTaskException: Asynchronous task failed
		at com.google.api.gax.rpc.ApiExceptions.callAndTranslateApiException(ApiExceptions.java:57) ~[gax-1.47.1.jar:1.47.1]
		at com.google.api.gax.rpc.UnaryCallable.call(UnaryCallable.java:112) ~[gax-1.47.1.jar:1.47.1]
		at com.google.cloud.monitoring.v3.MetricServiceClient.createTimeSeries(MetricServiceClient.java:1156) ~[google-cloud-monitoring-1.82.0.jar:1.82.0]
		at io.opencensus.exporter.stats.stackdriver.CreateTimeSeriesExporter.export(CreateTimeSeriesExporter.java:85) [opencensus-exporter-stats-stackdriver-0.27.1.jar:0.27.1]
		at io.opencensus.exporter.stats.stackdriver.CreateMetricDescriptorExporter.export(CreateMetricDescriptorExporter.java:157) [opencensus-exporter-stats-stackdriver-0.27.1.jar:0.27.1]
		at io.opencensus.exporter.metrics.util.MetricReader.readAndExport(MetricReader.java:167) [opencensus-exporter-metrics-util-0.27.1.jar:0.27.1]
		at io.opencensus.exporter.metrics.util.IntervalMetricReader$Worker.readAndExport(IntervalMetricReader.java:177) [opencensus-exporter-metrics-util-0.27.1.jar:0.27.1]
		at io.opencensus.exporter.metrics.util.IntervalMetricReader$Worker.run(IntervalMetricReader.java:170) [opencensus-exporter-metrics-util-0.27.1.jar:0.27.1]
		at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: HTTP/2 error code: NO_ERROR
Received Goaway
max_age
	at io.grpc.Status.asRuntimeException(Status.java:533) ~[grpc-api-1.27.2.jar:1.27.2]
	... 15 more

But not sure that there is a direct correspondence between the Error on the StackDriver adapter and the error on the Nuxeo side:

the metrics are available in "Google Metrics"
see https://github.com/census-instrumentation/opencensus-java/issues/869

I tried to use a metric with no other dimention, but got the same result:

HPA was unable to compute the replica count: unable to get metric custom.googleapis.com|nuxeo|dropwizard5_nuxeo.repositories.repository.documents.create_counter:
      no metrics returned from custom metrics API"

The custom.metrics API seems to work ok

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq | grep -B 1 -C 7 queue.completed_gauge
    {
      "name": "*/custom.googleapis.com|nuxeo|dropwizard5_nuxeo.works.global.queue.completed_gauge",
      "singularName": "",
      "namespaced": true,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    },

Switching to the v1beta2 API or adding the tenant namespace has no effect:

kubectl get -n tenant1 --raw /apis/custom.metrics.k8s.io/v1beta2 | jq | grep -B 1 -C 7 queue.completed_gauge
    {
      "name": "*/custom.googleapis.com|nuxeo|dropwizard5_nuxeo.works.global.queue.completed_gauge",
      "singularName": "",
      "namespaced": true,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    },

So, it seems that the API exists, works, but there is no data ...

kubectl get -n tenant1 --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/tenant1/pods/*/custom.googleapis.com|nuxeo|dropwizard5_nuxeo.works.global.queue.completed_gauge"

{"kind":"MetricValueList","apiVersion":"custom.metrics.k8s.io/v1beta1","metadata":{"selfLink":"/apis/custom.metrics.k8s.io/v1beta1/namespaces/tenant1/pods/%2A/custom.googleapis.com%7Cnuxeo%7Cdropwizard5_nuxeo.works.global.queue.completed_gauge"},"items":[]}

It is all about "labels" ?

Looking at the Readme for the Custom StackDriver metrics Adapter I see:

"Metrics attached to Kubernetes objects, such as Pod or Node, can be retrieved via Custom Metrics API."

https://github.com/GoogleCloudPlatform/k8s-stackdriver/blob/master/custom-metrics-stackdriver-adapter/README.md#metrics-available-from-stackdriver

=> this may mean that our metrics can only be retrieved via External Metrics ?

"your metric descriptor needs to meet following requirements: metricType = DOUBLE or INT64"

resource_type set to one of k8s_pod, k8s_node for new resource model.

All resource labels for your monitored resource set to correct values in particluar pod_id, pod_name, namespace_name

https://github.com/GoogleCloudPlatform/k8s-stackdriver/blob/master/custom-metrics-stackdriver-adapter/README.md#export-custom-metrics-to-stackdriver

When looking at the go sample code, it seems clear that the labels are expected to contain

"project_id": projectId,
"location": location,
"cluster_name": clusterName,
"namespace_name": namespace,
"pod_name": name,

My current understanding is that we should be calling StackdriverStatsConfiguration.setConstantLabels() to set the proper labels and enable usage from within HPA.

At this point, I do not see how to handle change that from the outside.

Attachments

Issue Links

is required by

NXP-29707 Support GCP Stackdriver monitoring for metrics and traces

Resolved

Activity

People

Assignee:

Benoit Delbosc

Reporter:

Thierry Delprat

Participants:

Benoit Delbosc, Thierry Delprat

Votes:

0 Vote for this issue

Watchers:

2 Start watching this issue

Dates

Created:

2020-11-25 23:07

Updated:

2020-12-04 21:38