-
Type: Bug
-
Status: Resolved
-
Priority: Blocker
-
Resolution: Fixed
-
Affects Version/s: 9.10-HF13
-
Component/s: Replication
-
Epic Link:
On a DR test with 9k+ document, the secondary site receives only 1,5k documents.
As the elasticsearch replication received 9k+ documents, it seems that the problem is on the source and that we miss documents.
Here is the explanation of the problem:
1. We are in a processTimer call, consuming MongoDB oplog.
2. A Kafka rebalance happens
3. A call to MongoDBComputation#init() happens
4. Start timestamp of the query is updated by what has been already committed to Kafka
5. processTimer ends by consuming more logs
6. a new processTimer starts with the update query (TS > lastCommitedTimestamp) *but* the n-th page of that query
7. we miss the n-th first pages
Here is the interesting part of server.log
2018-07-29 22:29:12,264 DEBUG [MongoDBComputation] ProduceRecord at : Timestamp{seconds=1532903352, inc=12} 2018-07-29 22:29:12,267 DEBUG [MongoDBComputation] ProduceRecord at : Timestamp{seconds=1532903352, inc=13} 2018-07-29 22:29:12,267 DEBUG [MongoDBComputation] Committing after 500 documents to replicate, page: 1 2018-07-29 22:29:12,267 DEBUG [MongoDBComputation] CheckPoint at : Timestamp{seconds=1532903352, inc=13} -------------------- From Kafka ------------------------------------------------------------------------- 2018-07-29 22:29:13,406 INFO [GroupCoordinator 0]: Preparing to rebalance group nuxeo-mongodb-oplog with old generation 2 (__consumer_offsets-42) (kafka.coordinator.group.GroupCoordinator) --------------------------------------------------------------------------------------------------------- 2018-07-29 22:29:16,411 INFO [MongoDBComputation] Initializing MongoDBComputation 2018-07-29 22:29:16,411 INFO [MongoDBComputation] Fetching MonogDB start timestamp 2018-07-29 22:29:17,629 INFO [MongoDBComputation] Fetched MonogDB start timestamp: Timestamp{seconds=1532903352, inc=13} ^ | 1000 lost records = 2 pages of batchSize v 2018-07-29 22:29:43,722 DEBUG [MongoDBComputation] ProduceRecord at : Timestamp{seconds=1532903383, inc=97} 2018-07-29 22:29:43,743 DEBUG [MongoDBComputation] ProduceRecord at : Timestamp{seconds=1532903383, inc=98} 2018-07-29 22:29:43,811 DEBUG [MongoDBComputation] ProduceRecord at : Timestamp{seconds=1532903383, inc=99}
Fix is to init the page number at Computation init.