-
Type: Bug
-
Status: Resolved
-
Priority: Minor
-
Resolution: Fixed
-
Affects Version/s: 9.10, 10.10
-
Fix Version/s: 10.10-HF17, 11.1, 2021.0
-
Component/s: Streams
-
Release Notes Summary:The processor recovery after stream retention period is exhausted has been improved.
-
Epic Link:
-
Backlog priority:900
-
Team:PLATFORM
-
Sprint:nxplatform 11.1.20
-
Story Points:3
The recovery procedure on the following case needs to be improved:
- on day D a processor is in failure and stops
- producers continue to append records
Despite the errors in logs, metrics (NXP-27471) or probes (NXP-27164) nothing is done during the stream retention period which is by default 4 days for CQ and 7 days for Kafka.
We are starting losing data and when we try to recover from this situation, first, we fix the cause of the failure (disk full, service down ...)
second, we do a rolling restart of the Nuxeo instance in order to restart the processor.
But because the retention period is exhausted the last persisted position (committed offset) is not anymore valid, the records for the day D have been deleted by the retention policy.
On CQ this raises an error because it is impossible to move to the last committed offset, preventing Nuxeo to start properly (NXP-28020):
ERROR [main] [org.nuxeo.osgi.OSGiAdapter] Error during Framework Listener execution : class org.nuxeo.runtime.osgi.OSGiRuntimeService java.lang.IllegalStateException: Unable to move to the last committed offset, ChronicleLogTailer{basePath='/opt/nuxeo-server-10.10-tomcat/nxserver/data/stream/audit/audit', id=AuditLogWriter:audit-00, closed=false, codec=org.nuxeo.lib.stream.codec.NoCodec@43165282} offset: 77584289235576 at org.nuxeo.lib.stream.log.chronicle.ChronicleLogTailer.toLastCommitted(ChronicleLogTailer.java:175) ~[nuxeo-stream-10.10-HF06.jar:?] at org.nuxeo.lib.stream.log.chronicle.ChronicleLogTailer.<init>(ChronicleLogTailer.java:82) ~[nuxeo-stream-10.10-HF06.jar:?] at org.nuxeo.lib.stream.log.chronicle.ChronicleLogAppender.createTailer(ChronicleLogAppender.java:315) ~[nuxeo-stream-10.10-HF06.jar:?] at org.nuxeo.lib.stream.log.chronicle.ChronicleLogManager.lambda$doCreateTailer$3(ChronicleLogManager.java:215) ~[nuxeo-stream-10.10-HF06.jar:?] at java.util.ArrayList.forEach(ArrayList.java:1257) ~[?:1.8.0_201] at org.nuxeo.lib.stream.log.chronicle.ChronicleLogManager.doCreateTailer(ChronicleLogManager.java:214) ~[nuxeo-stream-10.10-HF06.jar:?] at org.nuxeo.lib.stream.log.internals.AbstractLogManager.createTailer(AbstractLogManager.java:96) ~[nuxeo-stream-10.10-HF06.jar:?] at org.nuxeo.lib.stream.computation.log.LogStreamManager.createTailer(LogStreamManager.java:117) ~[nuxeo-stream-10.10-HF06.jar:?] at org.nuxeo.lib.stream.computation.log.ComputationRunner.<init>(ComputationRunner.java:113) ~[nuxeo-stream-10.10-HF06.jar:?] at org.nuxeo.lib.stream.computation.log.ComputationPool.lambda$start$0(ComputationPool.java:88) ~[nuxeo-stream-10.10-HF06.jar:?] at java.util.ArrayList.forEach(ArrayList.java:1257) ~[?:1.8.0_201] at org.nuxeo.lib.stream.computation.log.ComputationPool.start(ComputationPool.java:87) ~[nuxeo-stream-10.10-HF06.jar:?] at java.util.ArrayList.forEach(ArrayList.java:1257) ~[?:1.8.0_201] at org.nuxeo.lib.stream.computation.log.LogStreamProcessor.start(LogStreamProcessor.java:97) ~[nuxeo-stream-10.10-HF06.jar:?]
On Kafka, the consumer option auto.offset.reset is always set to earliest so a consumer will start from the beginning when the committed position point to a deleted record.
On CQ the consumer position needs to be reset manually using stream.sh position command or by removing the CQ offset files on disk.
This should be improved so we don't need another intervention an error should be logged and the consumer should start from the beginning (like Kafka does).
- is related to
-
NXP-28020 Nuxeo still starts when ChronicleLogTailer fails to start
- Resolved