The goal is to have an overview of the stream processing at the cluster level in order to:
- quickly understand where there is a bottleneck or problem without having to ssh
- take a decision on how to scale-out, scale-down, tune the existing configuration
- build a dedicated scaling metric that can be used by a horizontal auto scaler (HPA).
Because Nuxeo Stream is used at a low level it will cover all async processing: Async listeners, WorkManager, Bulk Service, and of course Nuxeo Stream (when using Kafka).
We want a representation at the cluster level that includes:
- all streams used with their number of partitions
- all Nuxeo nodes that participate in the async processing, with the number of threads for each computation
- the lag and latency for each consumer group
- computations failures
- eventually for each node: CPU usage, JVM memory pressure
The idea is to report all processor topologies on node start
NXP-29934) and create a specific stream metrics reporter ( NXP-29933) that informs about activities. A computation will aggregate both streams and build a representation that will be exposed as REST ( NXP-29935).