[NXP-28077] CQ Ease processor recovery after stream retention period is exhausted - Nuxeo Issue Tracker

XML

Word

Printable

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 9.10, 10.10
Fix Version/s: 10.10-HF17, 11.1, 2021.0
Component/s: Streams

Release Notes Summary:
The processor recovery after stream retention period is exhausted has been improved.
Epic Link:
Stream Scalability
Tags:
Backlog priority:
900
Team:
PLATFORM
Sprint:
nxplatform 11.1.20
Story Points:
3

Description

The recovery procedure on the following case needs to be improved:

on day D a processor is in failure and stops
producers continue to append records

Despite the errors in logs, metrics (~~NXP-27471~~) or probes (~~NXP-27164~~) nothing is done during the stream retention period which is by default 4 days for CQ and 7 days for Kafka.

We are starting losing data and when we try to recover from this situation, first, we fix the cause of the failure (disk full, service down ...)
second, we do a rolling restart of the Nuxeo instance in order to restart the processor.

But because the retention period is exhausted the last persisted position (committed offset) is not anymore valid, the records for the day D have been deleted by the retention policy.

On CQ this raises an error because it is impossible to move to the last committed offset, preventing Nuxeo to start properly (~~NXP-28020~~):

ERROR [main] [org.nuxeo.osgi.OSGiAdapter] Error during Framework Listener execution : class org.nuxeo.runtime.osgi.OSGiRuntimeService
java.lang.IllegalStateException: Unable to move to the last committed offset, ChronicleLogTailer{basePath='/opt/nuxeo-server-10.10-tomcat/nxserver/data/stream/audit/audit', id=AuditLogWriter:audit-00, closed=false, codec=org.nuxeo.lib.stream.codec.NoCodec@43165282} offset: 77584289235576
	at org.nuxeo.lib.stream.log.chronicle.ChronicleLogTailer.toLastCommitted(ChronicleLogTailer.java:175) ~[nuxeo-stream-10.10-HF06.jar:?]
	at org.nuxeo.lib.stream.log.chronicle.ChronicleLogTailer.<init>(ChronicleLogTailer.java:82) ~[nuxeo-stream-10.10-HF06.jar:?]
	at org.nuxeo.lib.stream.log.chronicle.ChronicleLogAppender.createTailer(ChronicleLogAppender.java:315) ~[nuxeo-stream-10.10-HF06.jar:?]
	at org.nuxeo.lib.stream.log.chronicle.ChronicleLogManager.lambda$doCreateTailer$3(ChronicleLogManager.java:215) ~[nuxeo-stream-10.10-HF06.jar:?]
	at java.util.ArrayList.forEach(ArrayList.java:1257) ~[?:1.8.0_201]
	at org.nuxeo.lib.stream.log.chronicle.ChronicleLogManager.doCreateTailer(ChronicleLogManager.java:214) ~[nuxeo-stream-10.10-HF06.jar:?]
	at org.nuxeo.lib.stream.log.internals.AbstractLogManager.createTailer(AbstractLogManager.java:96) ~[nuxeo-stream-10.10-HF06.jar:?]
	at org.nuxeo.lib.stream.computation.log.LogStreamManager.createTailer(LogStreamManager.java:117) ~[nuxeo-stream-10.10-HF06.jar:?]
	at org.nuxeo.lib.stream.computation.log.ComputationRunner.<init>(ComputationRunner.java:113) ~[nuxeo-stream-10.10-HF06.jar:?]
	at org.nuxeo.lib.stream.computation.log.ComputationPool.lambda$start$0(ComputationPool.java:88) ~[nuxeo-stream-10.10-HF06.jar:?]
	at java.util.ArrayList.forEach(ArrayList.java:1257) ~[?:1.8.0_201]
	at org.nuxeo.lib.stream.computation.log.ComputationPool.start(ComputationPool.java:87) ~[nuxeo-stream-10.10-HF06.jar:?]
	at java.util.ArrayList.forEach(ArrayList.java:1257) ~[?:1.8.0_201]
	at org.nuxeo.lib.stream.computation.log.LogStreamProcessor.start(LogStreamProcessor.java:97) ~[nuxeo-stream-10.10-HF06.jar:?]

On Kafka, the consumer option auto.offset.reset is always set to earliest so a consumer will start from the beginning when the committed position point to a deleted record.

On CQ the consumer position needs to be reset manually using stream.sh position command or by removing the CQ offset files on disk.

This should be improved so we don't need another intervention an error should be logged and the consumer should start from the beginning (like Kafka does).

Attachments

Issue Links

is related to

NXP-28020 Nuxeo still starts when ChronicleLogTailer fails to start

Resolved

Activity

People

Assignee:

Benoit Delbosc

Reporter:

Benoit Delbosc

Participants:

Benoit Delbosc, Jenkins

Votes:

0 Vote for this issue

Watchers:

2 Start watching this issue

Dates

Created:

2019-09-25 15:12

Updated:

2020-12-17 16:34

Resolved:

2019-10-15 07:08

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: