-
Type: Bug
-
Status: Resolved
-
Priority: Major
-
Resolution: Fixed
-
Affects Version/s: None
-
Component/s: Streams
-
Tags:
-
Sprint:nxplatform #93
-
Story Points:3
The single computation StreamMetricsComputation stream/metrics in the cluster that is elected to report all consumer lags and latencies as dropwizzard metrics encounters regularly stream failures.
This always happens during MSK rolling upgrade, even if the upgrade takes ~5min for a broker, the stream/metrics computation list all topics and all consumers and requires all brokers to respond, in large MSK cluster it can take up to 30min to complete, the current retry policy (5 retries 1s backoff) is not enough to avoid the failure.
The retry policy should keep retrying for 30min.
stream/metricsPool-00,in:0,inCheckpoint:0,out:0,lastRead:1689297154153,lastTimer:1689336796434,wm:0,loop:9103844,timer
Computation: stream/metrics fails last record: null, after retries.
org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by times in 30000ms
- is related to
-
NXP-32006 On partition rebalance revocation, computation context should be Spare
- Resolved