Uploaded image for project: 'Nuxeo Platform'
  1. Nuxeo Platform
  2. NXP-29902

Allow to leverage Nuxeo StackDriver Metrics inside K8S HPA

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Docker Image, Monitoring

      Description

      Goal

      Nuxeo exposes metrics via StackDriver.
      Inside K8S, we can define HorizontalPodAutoScaling that can add/remove pods to a deployment depending on metrics.
      There is a support for "custom metrics" in K8S and we would like to make K8S deployment scale based on Nuxeo metrics:

      • scale-out Nuxeo API nodes if requests throughput becomes too big
      • scale-out worker nodes if the lag becomes to big
      • ...

      Tests and experimentations done so far

      Activated StackDriver integration at the Nuxeo level

          metrics.stackdriver.enabled=true
          metrics.stackdriver.gcpProjectId=XXX
          metrics.enabled=true
      

      Metrics explorer

      I am then able to read metrics from Metric Explorer:

      fetch k8s_container
      | metric
          'custom.googleapis.com/nuxeo/dropwizard5_nuxeo.works.global.queue.completed_gauge'
      | filter
          (resource.cluster_name == 'multitenants') && (metric.queue == 'default')
      | group_by 1m, [value_completed_gauge_mean: mean(value.completed_gauge)]
      | every 1m
      | group_by [],
          [value_completed_gauge_mean_aggregate:
             aggregate(value_completed_gauge_mean)]
      

      Deploy the StackDriver Adapter

      Folowing the doc I deployed the custom-metrics-stackdriver-adapter

      
      kubectl create clusterrolebinding cluster-admin-binding \
          --clusterrole cluster-admin --user "$(gcloud config get-value account)"
      
      kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml
      
      

      The underlying pod seems to start without issue:

      kubectl get pods -n custom-metrics
      NAME                                                  READY   STATUS    RESTARTS   AGE
      custom-metrics-stackdriver-adapter-7454cc95c5-vpzc5   1/1     Running   0          67m
      

      API and CRD seem to be ok

      kubectl get apiservices |grep metrics
      v1beta1.custom.metrics.k8s.io          custom-metrics/custom-metrics-stackdriver-adapter   True        69m
      v1beta1.external.metrics.k8s.io        custom-metrics/custom-metrics-stackdriver-adapter   True        69m
      v1beta1.metrics.k8s.io                 kube-system/metrics-server                          True        33d
      v1beta2.custom.metrics.k8s.io          custom-metrics/custom-metrics-stackdriver-adapter   True        69m
      

      Create the HPA

      I created an HorizontalPodAutoscaler from the GKE UI (workload => deployement => Autoscaling)

      kubectl get hpa -n tenant1
      NAME                     REFERENCE                           TARGETS       MINPODS   MAXPODS   REPLICAS   AGE
      nuxeo-company-a-api      Deployment/nuxeo-company-a-api      <unknown>/1   1         4         2          3h56m
      nuxeo-company-a-worker   Deployment/nuxeo-company-a-worker   <unknown>/1   1         3         1          48m
      
      kubectl get hpa -n tenant1 nuxeo-company-a-worker -o yaml
      apiVersion: autoscaling/v1
      kind: HorizontalPodAutoscaler
      metadata:
        annotations:
          autoscaling.alpha.kubernetes.io/conditions: '[{"type":"AbleToScale","status":"True","lastTransitionTime":"2020-11-25T19:56:52Z","reason":"SucceededGetScale","message":"the
            HPA controller was able to get the target''s current scale"},{"type":"ScalingActive","status":"False","lastTransitionTime":"2020-11-25T19:56:52Z","reason":"FailedGetPodsMetric","message":"the
            HPA was unable to compute the replica count: unable to get metric custom.googleapis.com|nuxeo|dropwizard5_nuxeo.works.global.queue.completed_gauge:
            no metrics returned from custom metrics API"}]'
          autoscaling.alpha.kubernetes.io/metrics: '[{"type":"Pods","pods":{"metricName":"custom.googleapis.com|nuxeo|dropwizard5_nuxeo.works.global.queue.completed_gauge","targetAverageValue":"1","selector":{}}}]'
        creationTimestamp: "2020-11-25T19:56:30Z"
        name: nuxeo-company-a-worker
        namespace: tenant1
        resourceVersion: "13343121"
        selfLink: /apis/autoscaling/v1/namespaces/tenant1/horizontalpodautoscalers/nuxeo-company-a-worker
        uid: 9cac3b05-61eb-447f-bb6d-2580b9c029e9
      spec:
        maxReplicas: 3
        minReplicas: 1
        scaleTargetRef:
          apiVersion: apps/v1
          kind: Deployment
          name: nuxeo-company-a-worker
      status:
        currentReplicas: 1
        desiredReplicas: 0
      

      Here the metrics do not make much sense, but I wanted to be sure to have a metric with a non-null value.

      Still I have the error

      no metrics returned from custom metrics API
      

      It is worth mentioning that if I use a Metric that does not exist the error is different:

      the server could not find the
            descriptor for metric custom.googleapis.com/nuxeo/dropwizard5_nuxeo.works.global.queue.completed_gauge_idonotexist:
            googleapi: Error 404: Could not find descriptor for metric ''custom.googleapis.com/nuxeo/dropwizard5_nuxeo.works.global.queue.completed_gauge_idonotexist''.,
            notFound"}
      

      Filtering on wrong label also gives a different error:

       unable to get metric custom.googleapis.com|nuxeo|dropwizard5_nuxeo.works.global.queue.completed_gauge:
            unable to fetch metrics from custom metrics API: Metric label: \"queue\" is
            not allowed"}]'
      

      Looking at the StackDriver adapters we can see errors:

      kubectl logs -f  -n custom-metrics custom-metrics-stackdriver-adapter-7454cc95c5-vpzc5
      E1125 20:54:56.702406       1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed
      E1125 20:54:56.702423       1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed
      E1125 20:54:56.702516       1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed
      E1125 20:54:56.702567       1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed
      E1125 20:54:56.792694       1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed
      E1125 20:54:56.793280       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}
      E1125 20:54:56.794387       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}
      E1125 20:54:56.795492       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}
      E1125 20:54:56.796566       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}
      E1125 20:54:56.797690       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}
      

      Looking at the Nuxeo Server level, I can find strange WARN:

      kubectl logs -f  -n tenant1 nuxeo-company-a-worker-54dd9c8497-9cnsk
      
      2020-11-25T20:03:54,372 WARN  [CreateTimeSeriesExporter] ApiException thrown when exporting TimeSeries.
      com.google.api.gax.rpc.UnavailableException: io.grpc.StatusRuntimeException: UNAVAILABLE: HTTP/2 error code: NO_ERROR
      Received Goaway
      max_age
      	at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:69) ~[gax-1.47.1.jar:1.47.1]
      	at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:72) ~[gax-grpc-1.47.1.jar:1.47.1]
      	at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:60) ~[gax-grpc-1.47.1.jar:1.47.1]
      	at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97) ~[gax-grpc-1.47.1.jar:1.47.1]
      	at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:68) ~[api-common-1.8.1.jar:?]
      	at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1074) ~[guava-30.0-jre.jar:?]
      	at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30) ~[guava-30.0-jre.jar:?]
      	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1213) ~[guava-30.0-jre.jar:?]
      	at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:983) ~[guava-30.0-jre.jar:?]
      	at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:771) ~[guava-30.0-jre.jar:?]
      	at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:522) ~[grpc-stub-1.27.2.jar:1.27.2]
      	at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:497) ~[grpc-stub-1.27.2.jar:1.27.2]
      	at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:426) ~[grpc-core-1.27.2.jar:1.27.2]
      	at io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66) ~[grpc-core-1.27.2.jar:1.27.2]
      	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:689) ~[grpc-core-1.27.2.jar:1.27.2]
      	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$900(ClientCallImpl.java:577) ~[grpc-core-1.27.2.jar:1.27.2]
      	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:751) ~[grpc-core-1.27.2.jar:1.27.2]
      	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:740) ~[grpc-core-1.27.2.jar:1.27.2]
      	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[grpc-core-1.27.2.jar:1.27.2]
      	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) ~[grpc-core-1.27.2.jar:1.27.2]
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
      	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
      	at java.lang.Thread.run(Thread.java:834) [?:?]
      	Suppressed: com.google.api.gax.rpc.AsyncTaskException: Asynchronous task failed
      		at com.google.api.gax.rpc.ApiExceptions.callAndTranslateApiException(ApiExceptions.java:57) ~[gax-1.47.1.jar:1.47.1]
      		at com.google.api.gax.rpc.UnaryCallable.call(UnaryCallable.java:112) ~[gax-1.47.1.jar:1.47.1]
      		at com.google.cloud.monitoring.v3.MetricServiceClient.createTimeSeries(MetricServiceClient.java:1156) ~[google-cloud-monitoring-1.82.0.jar:1.82.0]
      		at io.opencensus.exporter.stats.stackdriver.CreateTimeSeriesExporter.export(CreateTimeSeriesExporter.java:85) [opencensus-exporter-stats-stackdriver-0.27.1.jar:0.27.1]
      		at io.opencensus.exporter.stats.stackdriver.CreateMetricDescriptorExporter.export(CreateMetricDescriptorExporter.java:157) [opencensus-exporter-stats-stackdriver-0.27.1.jar:0.27.1]
      		at io.opencensus.exporter.metrics.util.MetricReader.readAndExport(MetricReader.java:167) [opencensus-exporter-metrics-util-0.27.1.jar:0.27.1]
      		at io.opencensus.exporter.metrics.util.IntervalMetricReader$Worker.readAndExport(IntervalMetricReader.java:177) [opencensus-exporter-metrics-util-0.27.1.jar:0.27.1]
      		at io.opencensus.exporter.metrics.util.IntervalMetricReader$Worker.run(IntervalMetricReader.java:170) [opencensus-exporter-metrics-util-0.27.1.jar:0.27.1]
      		at java.lang.Thread.run(Thread.java:834) [?:?]
      Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: HTTP/2 error code: NO_ERROR
      Received Goaway
      max_age
      	at io.grpc.Status.asRuntimeException(Status.java:533) ~[grpc-api-1.27.2.jar:1.27.2]
      	... 15 more
      
      

      But not sure that there is a direct correspondence between the Error on the StackDriver adapter and the error on the Nuxeo side:

      I tried to use a metric with no other dimention, but got the same result:

      HPA was unable to compute the replica count: unable to get metric custom.googleapis.com|nuxeo|dropwizard5_nuxeo.repositories.repository.documents.create_counter:
            no metrics returned from custom metrics API"
      

      The custom.metrics API seems to work ok

      kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq | grep -B 1 -C 7 queue.completed_gauge
          {
            "name": "*/custom.googleapis.com|nuxeo|dropwizard5_nuxeo.works.global.queue.completed_gauge",
            "singularName": "",
            "namespaced": true,
            "kind": "MetricValueList",
            "verbs": [
              "get"
            ]
          },
      

      Switching to the v1beta2 API or adding the tenant namespace has no effect:

      kubectl get -n tenant1 --raw /apis/custom.metrics.k8s.io/v1beta2 | jq | grep -B 1 -C 7 queue.completed_gauge
          {
            "name": "*/custom.googleapis.com|nuxeo|dropwizard5_nuxeo.works.global.queue.completed_gauge",
            "singularName": "",
            "namespaced": true,
            "kind": "MetricValueList",
            "verbs": [
              "get"
            ]
          },
      

      So, it seems that the API exists, works, but there is no data ...

      kubectl get -n tenant1 --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/tenant1/pods/*/custom.googleapis.com|nuxeo|dropwizard5_nuxeo.works.global.queue.completed_gauge"
      
      {"kind":"MetricValueList","apiVersion":"custom.metrics.k8s.io/v1beta1","metadata":{"selfLink":"/apis/custom.metrics.k8s.io/v1beta1/namespaces/tenant1/pods/%2A/custom.googleapis.com%7Cnuxeo%7Cdropwizard5_nuxeo.works.global.queue.completed_gauge"},"items":[]}
      
      

      It is all about "labels" ?

      Looking at the Readme for the Custom StackDriver metrics Adapter I see:

      "Metrics attached to Kubernetes objects, such as Pod or Node, can be retrieved via Custom Metrics API."

      https://github.com/GoogleCloudPlatform/k8s-stackdriver/blob/master/custom-metrics-stackdriver-adapter/README.md#metrics-available-from-stackdriver

      => this may mean that our metrics can only be retrieved via External Metrics ?

      "your metric descriptor needs to meet following requirements: metricType = DOUBLE or INT64"

      resource_type set to one of k8s_pod, k8s_node for new resource model.

      All resource labels for your monitored resource set to correct values in particluar pod_id, pod_name, namespace_name

      https://github.com/GoogleCloudPlatform/k8s-stackdriver/blob/master/custom-metrics-stackdriver-adapter/README.md#export-custom-metrics-to-stackdriver

      When looking at the go sample code, it seems clear that the labels are expected to contain

      • "project_id": projectId,
      • "location": location,
      • "cluster_name": clusterName,
      • "namespace_name": namespace,
      • "pod_name": name,

      My current understanding is that we should be calling StackdriverStatsConfiguration.setConstantLabels() to set the proper labels and enable usage from within HPA.

      At this point, I do not see how to handle change that from the outside.

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated: