Resync should not send operations without sequence number #40433

dnhatn · 2019-03-26T02:20:07Z

Primary-replica resync in a mixed-cluster between 6.x and 5.6 can send operations without sequence number to a replica which already processed operations with sequence number. This leads to the failure of that replica for we trip the sequence number assertion when writing resync operations without sequence number to translog.

elasticmachine · 2019-03-26T02:20:08Z

Pinging @elastic/es-distributed

server/src/main/java/org/elasticsearch/index/translog/Translog.java

bleskes · 2019-03-26T15:55:42Z

@dnhatn I think I'm missing something. The resync is in charge of verifying operations above the global checkpoint. We currently starting to sync from the global checkpoint + 1 , which seems to only be negative if the global checkpoint is uknown on the new primary. I think we talked before (although I don't remember the details) about bootstrapping the global checkpoint in these cases. I think the right solution is there (and also asserting we always have a starting sequence number for a resync). Is there any reason for you to go down the route of explicit filtering rather than checking the starting sequence number?

dnhatn · 2019-03-26T16:52:28Z

@bleskes Thank you for looking :)

Bootstrapping the global checkpoint would work too - I can make that change. I did not have any preference for my implementation, merely followed the pattern we did in #27580. Moreover, if we have an assigned global checkpoint then we will never use operations without seqno; therefore I think we should exclude them from translog snapshot.

bleskes · 2019-03-27T11:20:16Z

Bootstrapping the global checkpoint would work too - I can make that change. I did not have any preference for my implementation, merely followed the pattern we did in #27580

I'm not sure I follow the reference to the PR as it's about flushing, but that aside - anything we can do to make the situation as "normal" as possible the better it is. Under normal operation (I.e., no BWC mode) we always have a global checkpoint, so that's why I favor that direction.

ywelsch · 2019-03-29T15:21:26Z

There is another related issue here: https://discuss.elastic.co/t/extremely-large-translog-files-per-shard-in-elasticsearch-6-2-4/174041/5 as a gcp of UNASSIGNED_SEQ_NO can cause a resync to send the full translog, which can cause high loads during a rolling upgrade.

Bootstrapping the global checkpoint would work too - I can make that change.

I'm a little uncomfortable making that change in 6.7.1, as it might have a broader impact, especially in the presence of 5.x indices with no sequence numbers. I wonder if we can have the more contained change here for now to be backported to 6.7 and follow-up with a more comprehensive investigation of getting rid of UNASSIGNED_SEQ_NO in various places as we know that in 7.x all docs will have sequence numbers. WDYT @bleskes?

bleskes · 2019-03-29T15:25:30Z

WDYT @bleskes?
+1

dnhatn · 2019-03-31T01:57:24Z

@bleskes and @ywelsch This is ready again. Can you please have another look? Thank you!

server/src/main/java/org/elasticsearch/index/shard/PrimaryReplicaSyncer.java

dnhatn · 2019-04-03T00:53:08Z

@ywelsch I pushed 61ad34d to use a weaker form which won't send operations without sequence numbers. Can you have another look? Thank you!

ywelsch

LGTM

server/src/main/java/org/elasticsearch/index/shard/PrimaryReplicaSyncer.java

dnhatn · 2019-04-04T02:16:51Z

Thanks @bleskes and @ywelsch.

Primary-replica resync in a mixed-cluster between 6.x and 5.6 can send operations without sequence number to a replica which already processed operations with sequence number. This leads to the failure of that replica for we trip the sequence number assertion when writing resync operations without sequence number to translog.

Primary-replica resyncer should send only operations with seqno

15df15d

dnhatn added >bug WIP :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v7.0.0 v8.0.0 v7.2.0 v6.7.1 labels Mar 26, 2019

dnhatn requested a review from bleskes March 26, 2019 02:20

dnhatn requested a review from ywelsch March 26, 2019 02:20

dnhatn commented Mar 26, 2019

View reviewed changes

server/src/main/java/org/elasticsearch/index/translog/Translog.java Outdated Show resolved Hide resolved

dnhatn added 2 commits March 26, 2019 03:57

add simple test for translog

182a4c7

assertion

5e11143

colings86 added v6.7.2 and removed v6.7.1 labels Mar 30, 2019

dnhatn added 3 commits March 30, 2019 14:07

Merge branch 'master' into resync-ops

f7fc025

Revert changes

d944dbb

don’t send any operation if gcp is still unassigned

07f66d6

dnhatn changed the title ~~Primary-replica resync should not send ops without seqno~~ Resync should not send ops if global checkpoint unassigned Mar 31, 2019

dnhatn removed the WIP label Mar 31, 2019

dnhatn requested review from bleskes and removed request for bleskes and ywelsch March 31, 2019 01:57

dnhatn requested a review from ywelsch March 31, 2019 01:57

always returns the mock snapshot

00c0007

ywelsch reviewed Apr 2, 2019

View reviewed changes

server/src/main/java/org/elasticsearch/index/shard/PrimaryReplicaSyncer.java Outdated Show resolved Hide resolved

dnhatn added 2 commits April 2, 2019 20:50

use weaker form

38195d8

Merge branch 'master' into resync-ops

61ad34d

dnhatn changed the title ~~Resync should not send ops if global checkpoint unassigned~~ Resync should not send operations without sequence number Apr 3, 2019

dnhatn requested a review from ywelsch April 3, 2019 00:53

ywelsch approved these changes Apr 3, 2019

View reviewed changes

server/src/main/java/org/elasticsearch/index/shard/PrimaryReplicaSyncer.java Outdated Show resolved Hide resolved

dnhatn added 2 commits April 3, 2019 18:48

Merge branch 'master' into resync-ops

84d4ebb

stale comment

874495c

dnhatn merged commit c737943 into elastic:master Apr 4, 2019

dnhatn deleted the resync-ops branch April 4, 2019 02:17

dnhatn added the backport pending label Apr 4, 2019

dnhatn mentioned this pull request Apr 5, 2019

Primary replica resync should not send ops without seqno #40881

Merged

dnhatn removed the backport pending label Apr 5, 2019

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Resync should not send operations without sequence number #40433

Resync should not send operations without sequence number #40433

Uh oh!

dnhatn commented Mar 26, 2019 •

edited

Loading

Uh oh!

elasticmachine commented Mar 26, 2019

Uh oh!

Uh oh!

bleskes commented Mar 26, 2019

Uh oh!

dnhatn commented Mar 26, 2019 •

edited

Loading

Uh oh!

bleskes commented Mar 27, 2019

Uh oh!

ywelsch commented Mar 29, 2019

Uh oh!

bleskes commented Mar 29, 2019

Uh oh!

dnhatn commented Mar 31, 2019

Uh oh!

Uh oh!

dnhatn commented Apr 3, 2019

Uh oh!

ywelsch left a comment

Uh oh!

Uh oh!

dnhatn commented Apr 4, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Resync should not send operations without sequence number #40433

Resync should not send operations without sequence number #40433

Uh oh!

Conversation

dnhatn commented Mar 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Mar 26, 2019

Uh oh!

Uh oh!

bleskes commented Mar 26, 2019

Uh oh!

dnhatn commented Mar 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bleskes commented Mar 27, 2019

Uh oh!

ywelsch commented Mar 29, 2019

Uh oh!

bleskes commented Mar 29, 2019

Uh oh!

dnhatn commented Mar 31, 2019

Uh oh!

Uh oh!

dnhatn commented Apr 3, 2019

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dnhatn commented Apr 4, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dnhatn commented Mar 26, 2019 •

edited

Loading

dnhatn commented Mar 26, 2019 •

edited

Loading