Uploaded image for project: 'Nuxeo Platform'
  1. Nuxeo Platform
  2. NXP-25400

Chronicle Queue retention conflict with offset tracker

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Won't Fix
    • Affects Version/s: 10.2
    • Fix Version/s: 10.10
    • Component/s: Streams
    • Tags:
    • Story Points:
      3

      Description

      It appears - maybe after CQ upgrade NXP-25231 - that if a consumer doesn't commit its position there can be a conflict on start if the CQ retention purge some data.

      When creating a consumer a tailer is created and it searches for the last committed position, this is done by reading an offset log in the backward direction,
      because there is no committed position it reads all the records and if the purge has deleted the oldest cq4 file an error is raised.

      Ex of traceback:

      2018-07-12 10:30:36,961 ERROR [localhost-startStop-1] [org.nuxeo.osgi.OSGiAdapter] Error during Framework Listener execution : class org.nuxeo.runtime.osgi.OSGiRuntimeService
      java.lang.IllegalStateException: Expected file to exist for cycle: 17720, file: /var/lib/nuxeo/stream/bulk/counter/offset-bulkCounter/20180708.cq4.
      minCycle: 17721, maxCycle: 17724
      Available files: [20180711.cq4, 20180710.cq4, 20180709.cq4, 20180712.cq4]
          at net.openhft.chronicle.queue.impl.single.SingleChronicleQueue$StoreSupplier.nextCycle(SingleChronicleQueue.java:935)
          at net.openhft.chronicle.queue.impl.WireStorePool.nextCycle(WireStorePool.java:107)
          at net.openhft.chronicle.queue.impl.single.SingleChronicleQueue.nextCycle(SingleChronicleQueue.java:432)
          at net.openhft.chronicle.queue.impl.single.SingleChronicleQueueExcerpts$StoreTailer.nextIndexWithNextAvailableCycle0(SingleChronicleQueueExcerpts.java:1278)
          at net.openhft.chronicle.queue.impl.single.SingleChronicleQueueExcerpts$StoreTailer.nextIndexWithNextAvailableCycle(SingleChronicleQueueExcerpts.java:1234)
          at net.openhft.chronicle.queue.impl.single.SingleChronicleQueueExcerpts$StoreTailer.beyondStartOfCycleBackward(SingleChronicleQueueExcerpts.java:1110)
          at net.openhft.chronicle.queue.impl.single.SingleChronicleQueueExcerpts$StoreTailer.beyondStartOfCycle(SingleChronicleQueueExcerpts.java:1068)
          at net.openhft.chronicle.queue.impl.single.SingleChronicleQueueExcerpts$StoreTailer.next0(SingleChronicleQueueExcerpts.java:1033)
          at net.openhft.chronicle.queue.impl.single.SingleChronicleQueueExcerpts$StoreTailer.readingDocument(SingleChronicleQueueExcerpts.java:956)
          at net.openhft.chronicle.queue.impl.single.SingleChronicleQueueExcerpts$StoreTailer.readingDocument(SingleChronicleQueueExcerpts.java:891)
          at net.openhft.chronicle.wire.MarshallableIn.readBytes(MarshallableIn.java:63)
          at org.nuxeo.lib.stream.log.chronicle.ChronicleLogOffsetTracker.readLastCommittedOffset(ChronicleLogOffsetTracker.java:128)
          at org.nuxeo.lib.stream.log.chronicle.ChronicleLogOffsetTracker.getLastCommittedOffset(ChronicleLogOffsetTracker.java:109)
          at org.nuxeo.lib.stream.log.chronicle.ChronicleLogTailer.toLastCommitted(ChronicleLogTailer.java:171)
          at org.nuxeo.lib.stream.log.chronicle.ChronicleLogTailer.<init>(ChronicleLogTailer.java:83)
          at org.nuxeo.lib.stream.log.chronicle.ChronicleLogAppender.createTailer(ChronicleLogAppender.java:207)
          at org.nuxeo.lib.stream.log.chronicle.ChronicleLogManager.lambda$doCreateTailer$3(ChronicleLogManager.java:208)
          at java.util.ArrayList.forEach(ArrayList.java:1257)
          at org.nuxeo.lib.stream.log.chronicle.ChronicleLogManager.doCreateTailer(ChronicleLogManager.java:207)
          at org.nuxeo.lib.stream.log.internals.AbstractLogManager.createTailer(AbstractLogManager.java:96)
          at org.nuxeo.lib.stream.computation.log.ComputationRunner.<init>(ComputationRunner.java:117)
      

      Restarting the computation (nuxeo) will fix the pb because the purge has already been done.

      But restarting the next day will raise the same pb.

      Note that so far we don't have this case in Nuxeo,
      the problem was visible in 10.2-SNAP because of an imcomplete implementation of BAF, that is now deactivated in 10.2.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                bdelbosc Benoit Delbosc
                Participants:
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0 minutes
                  0m
                  Logged:
                  Time Spent - 1 hour
                  1h