[NXP-25496] MongoDB replication is missing some documents - Nuxeo Issue Tracker

XML

Word

Printable

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 9.10-HF13
Fix Version/s: 9.10-HF14, 10.3
Component/s: Replication

Epic Link:
HA/DR Nuxeo deployment on Openshift

Description

On a DR test with 9k+ document, the secondary site receives only 1,5k documents.
As the elasticsearch replication received 9k+ documents, it seems that the problem is on the source and that we miss documents.

Here is the explanation of the problem:

1. We are in a processTimer call, consuming MongoDB oplog.
2. A Kafka rebalance happens
3. A call to MongoDBComputation#init() happens
4. Start timestamp of the query is updated by what has been already committed to Kafka
5. processTimer ends by consuming more logs
6. a new processTimer starts with the update query (TS > lastCommitedTimestamp) *but* the n-th page of that query
7. we miss the n-th first pages

Here is the interesting part of server.log

2018-07-29 22:29:12,264 DEBUG [MongoDBComputation] ProduceRecord at : Timestamp{seconds=1532903352, inc=12}
2018-07-29 22:29:12,267 DEBUG [MongoDBComputation] ProduceRecord at : Timestamp{seconds=1532903352, inc=13}
2018-07-29 22:29:12,267 DEBUG [MongoDBComputation] Committing after 500 documents to replicate, page: 1
2018-07-29 22:29:12,267 DEBUG [MongoDBComputation] CheckPoint at : Timestamp{seconds=1532903352, inc=13}
-------------------- From Kafka -------------------------------------------------------------------------
2018-07-29 22:29:13,406 INFO [GroupCoordinator 0]: Preparing to rebalance group nuxeo-mongodb-oplog with old generation 2 (__consumer_offsets-42) (kafka.coordinator.group.GroupCoordinator)
---------------------------------------------------------------------------------------------------------
2018-07-29 22:29:16,411 INFO  [MongoDBComputation] Initializing MongoDBComputation
2018-07-29 22:29:16,411 INFO  [MongoDBComputation] Fetching MonogDB start timestamp
2018-07-29 22:29:17,629 INFO  [MongoDBComputation] Fetched MonogDB start timestamp: Timestamp{seconds=1532903352, inc=13}

            ^
            |  1000 lost records = 2 pages of batchSize
            v


2018-07-29 22:29:43,722 DEBUG [MongoDBComputation] ProduceRecord at : Timestamp{seconds=1532903383, inc=97}
2018-07-29 22:29:43,743 DEBUG [MongoDBComputation] ProduceRecord at : Timestamp{seconds=1532903383, inc=98}
2018-07-29 22:29:43,811 DEBUG [MongoDBComputation] ProduceRecord at : Timestamp{seconds=1532903383, inc=99}

Fix is to init the page number at Computation init.

Attachments

Activity

People

Assignee:

Damien Metzler

Reporter:

Damien Metzler

Participants:

Anahide Tchertchian, Damien Metzler, Jenkins

Votes:

0 Vote for this issue

Watchers:

3 Start watching this issue

Dates

Created:

2018-07-27 07:58

Updated:

2018-10-08 12:33

Resolved:

2018-10-08 12:33