Uploaded image for project: 'Nuxeo Platform'
  1. Nuxeo Platform
  2. NXP-28077

CQ Ease processor recovery after stream retention period is exhausted



    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 9.10, 10.10
    • Fix Version/s: 10.10-HF17, 11.1, 2021.0
    • Component/s: Streams


      The recovery procedure on the following case needs to be improved:

      • on day D a processor is in failure and stops
      • producers continue to append records

      Despite the errors in logs, metrics (NXP-27471) or probes (NXP-27164) nothing is done during the stream retention period which is by default 4 days for CQ and 7 days for Kafka.

      We are starting losing data and when we try to recover from this situation, first, we fix the cause of the failure (disk full, service down ...)
      second, we do a rolling restart of the Nuxeo instance in order to restart the processor.

      But because the retention period is exhausted the last persisted position (committed offset) is not anymore valid, the records for the day D have been deleted by the retention policy.

      On CQ this raises an error because it is impossible to move to the last committed offset, preventing Nuxeo to start properly (NXP-28020):

      ERROR [main] [org.nuxeo.osgi.OSGiAdapter] Error during Framework Listener execution : class org.nuxeo.runtime.osgi.OSGiRuntimeService
      java.lang.IllegalStateException: Unable to move to the last committed offset, ChronicleLogTailer{basePath='/opt/nuxeo-server-10.10-tomcat/nxserver/data/stream/audit/audit', id=AuditLogWriter:audit-00, closed=false, codec=org.nuxeo.lib.stream.codec.NoCodec@43165282} offset: 77584289235576
      	at org.nuxeo.lib.stream.log.chronicle.ChronicleLogTailer.toLastCommitted(ChronicleLogTailer.java:175) ~[nuxeo-stream-10.10-HF06.jar:?]
      	at org.nuxeo.lib.stream.log.chronicle.ChronicleLogTailer.<init>(ChronicleLogTailer.java:82) ~[nuxeo-stream-10.10-HF06.jar:?]
      	at org.nuxeo.lib.stream.log.chronicle.ChronicleLogAppender.createTailer(ChronicleLogAppender.java:315) ~[nuxeo-stream-10.10-HF06.jar:?]
      	at org.nuxeo.lib.stream.log.chronicle.ChronicleLogManager.lambda$doCreateTailer$3(ChronicleLogManager.java:215) ~[nuxeo-stream-10.10-HF06.jar:?]
      	at java.util.ArrayList.forEach(ArrayList.java:1257) ~[?:1.8.0_201]
      	at org.nuxeo.lib.stream.log.chronicle.ChronicleLogManager.doCreateTailer(ChronicleLogManager.java:214) ~[nuxeo-stream-10.10-HF06.jar:?]
      	at org.nuxeo.lib.stream.log.internals.AbstractLogManager.createTailer(AbstractLogManager.java:96) ~[nuxeo-stream-10.10-HF06.jar:?]
      	at org.nuxeo.lib.stream.computation.log.LogStreamManager.createTailer(LogStreamManager.java:117) ~[nuxeo-stream-10.10-HF06.jar:?]
      	at org.nuxeo.lib.stream.computation.log.ComputationRunner.<init>(ComputationRunner.java:113) ~[nuxeo-stream-10.10-HF06.jar:?]
      	at org.nuxeo.lib.stream.computation.log.ComputationPool.lambda$start$0(ComputationPool.java:88) ~[nuxeo-stream-10.10-HF06.jar:?]
      	at java.util.ArrayList.forEach(ArrayList.java:1257) ~[?:1.8.0_201]
      	at org.nuxeo.lib.stream.computation.log.ComputationPool.start(ComputationPool.java:87) ~[nuxeo-stream-10.10-HF06.jar:?]
      	at java.util.ArrayList.forEach(ArrayList.java:1257) ~[?:1.8.0_201]
      	at org.nuxeo.lib.stream.computation.log.LogStreamProcessor.start(LogStreamProcessor.java:97) ~[nuxeo-stream-10.10-HF06.jar:?]

      On Kafka, the consumer option auto.offset.reset is always set to earliest so a consumer will start from the beginning when the committed position point to a deleted record.

      On CQ the consumer position needs to be reset manually using stream.sh position command or by removing the CQ offset files on disk.

      This should be improved so we don't need another intervention an error should be logged and the consumer should start from the beginning (like Kafka does).


          Issue Links



              • Votes:
                0 Vote for this issue
                2 Start watching this issue


                • Created:

                  Time Tracking

                  Original Estimate - Not Specified
                  Not Specified
                  Remaining Estimate - 0 minutes
                  Time Spent - 3 hours