Uploaded image for project: 'Nuxeo Platform'
  1. Nuxeo Platform
  2. NXP-28077

CQ Ease processor recovery after stream retention period is exhausted

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 9.10, 10.10
    • Fix Version/s: 10.10-HF17, 11.1, 2021.0
    • Component/s: Streams

      Description

      The recovery procedure on the following case needs to be improved:

      • on day D a processor is in failure and stops
      • producers continue to append records

      Despite the errors in logs, metrics (NXP-27471) or probes (NXP-27164) nothing is done during the stream retention period which is by default 4 days for CQ and 7 days for Kafka.

      We are starting losing data and when we try to recover from this situation, first, we fix the cause of the failure (disk full, service down ...)
      second, we do a rolling restart of the Nuxeo instance in order to restart the processor.

      But because the retention period is exhausted the last persisted position (committed offset) is not anymore valid, the records for the day D have been deleted by the retention policy.

      On CQ this raises an error because it is impossible to move to the last committed offset, preventing Nuxeo to start properly (NXP-28020):

      ERROR [main] [org.nuxeo.osgi.OSGiAdapter] Error during Framework Listener execution : class org.nuxeo.runtime.osgi.OSGiRuntimeService
      java.lang.IllegalStateException: Unable to move to the last committed offset, ChronicleLogTailer{basePath='/opt/nuxeo-server-10.10-tomcat/nxserver/data/stream/audit/audit', id=AuditLogWriter:audit-00, closed=false, codec=org.nuxeo.lib.stream.codec.NoCodec@43165282} offset: 77584289235576
      	at org.nuxeo.lib.stream.log.chronicle.ChronicleLogTailer.toLastCommitted(ChronicleLogTailer.java:175) ~[nuxeo-stream-10.10-HF06.jar:?]
      	at org.nuxeo.lib.stream.log.chronicle.ChronicleLogTailer.<init>(ChronicleLogTailer.java:82) ~[nuxeo-stream-10.10-HF06.jar:?]
      	at org.nuxeo.lib.stream.log.chronicle.ChronicleLogAppender.createTailer(ChronicleLogAppender.java:315) ~[nuxeo-stream-10.10-HF06.jar:?]
      	at org.nuxeo.lib.stream.log.chronicle.ChronicleLogManager.lambda$doCreateTailer$3(ChronicleLogManager.java:215) ~[nuxeo-stream-10.10-HF06.jar:?]
      	at java.util.ArrayList.forEach(ArrayList.java:1257) ~[?:1.8.0_201]
      	at org.nuxeo.lib.stream.log.chronicle.ChronicleLogManager.doCreateTailer(ChronicleLogManager.java:214) ~[nuxeo-stream-10.10-HF06.jar:?]
      	at org.nuxeo.lib.stream.log.internals.AbstractLogManager.createTailer(AbstractLogManager.java:96) ~[nuxeo-stream-10.10-HF06.jar:?]
      	at org.nuxeo.lib.stream.computation.log.LogStreamManager.createTailer(LogStreamManager.java:117) ~[nuxeo-stream-10.10-HF06.jar:?]
      	at org.nuxeo.lib.stream.computation.log.ComputationRunner.<init>(ComputationRunner.java:113) ~[nuxeo-stream-10.10-HF06.jar:?]
      	at org.nuxeo.lib.stream.computation.log.ComputationPool.lambda$start$0(ComputationPool.java:88) ~[nuxeo-stream-10.10-HF06.jar:?]
      	at java.util.ArrayList.forEach(ArrayList.java:1257) ~[?:1.8.0_201]
      	at org.nuxeo.lib.stream.computation.log.ComputationPool.start(ComputationPool.java:87) ~[nuxeo-stream-10.10-HF06.jar:?]
      	at java.util.ArrayList.forEach(ArrayList.java:1257) ~[?:1.8.0_201]
      	at org.nuxeo.lib.stream.computation.log.LogStreamProcessor.start(LogStreamProcessor.java:97) ~[nuxeo-stream-10.10-HF06.jar:?]
      

      On Kafka, the consumer option auto.offset.reset is always set to earliest so a consumer will start from the beginning when the committed position point to a deleted record.

      On CQ the consumer position needs to be reset manually using stream.sh position command or by removing the CQ offset files on disk.

      This should be improved so we don't need another intervention an error should be logged and the consumer should start from the beginning (like Kafka does).

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0 minutes
                  0m
                  Logged:
                  Time Spent - 3 hours
                  3h

                    PagerDuty

                    Error rendering 'com.pagerduty.jira-server-plugin:PagerDuty'. Please contact your Jira administrators.