Uploaded image for project: 'Nuxeo Platform'
  1. Nuxeo Platform
  2. NXP-31997

Avoid stream failure on stream/metrics computation during MSK rolling upgrade

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2023.1, 2021.42
    • Component/s: Streams

      Description

      The single computation StreamMetricsComputation stream/metrics in the cluster that is elected to report all consumer lags and latencies as dropwizzard metrics encounters regularly stream failures.

      This always happens during MSK rolling upgrade, even if the upgrade takes ~5min for a broker, the stream/metrics computation list all topics and all consumers and requires all brokers to respond, in large MSK cluster it can take up to 30min to complete, the current retry policy (5 retries 1s backoff) is not enough to avoid the failure.

      The retry policy should keep retrying for 30min.

      stream/metricsPool-00,in:0,inCheckpoint:0,out:0,lastRead:1689297154153,lastTimer:1689336796434,wm:0,loop:9103844,timer
      
      Computation: stream/metrics fails last record: null, after retries.
      
      org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by times in 30000ms
      

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: