Retry failed replication due to transient errors #55633

Tim-Brooks · 2020-04-22T22:53:20Z

Currently a failed replication action will fail an entire replica. This
includes when replication fails due to potentially short lived transient
issues such as network distruptions or circuit breaking errors.

This commit implements retries using the retryable action.

Currently a failed replication action will fail an entire replica. This includes when replication fails due to potentially short lived transient issues such as network distruptions or circuit breaking errors. This commit adds the concept of a retryable action. A retryable action will be retryed in face of certain errors. The action will be retried after an exponentially increasing backoff period. After defined time, the action will timeout.

elasticmachine · 2020-04-22T22:53:22Z

Pinging @elastic/es-distributed (:Distributed/CRUD)

Tim-Brooks · 2020-04-23T03:50:54Z

Hey @ywelsch this isa POC for handling the replication group logic. I know that we discussed the cluster state listener approach. But as I was working on that I felt like I was replicating the ReplicationTracker logic. So I wanted to try an approach reusing the ReplicationTracker logic first.

ywelsch

Thanks Tim. Going with the replicationgroup listener is a viable option I think. I have left some comments on the structure and ways to possibly simplify things a bit.

server/src/main/java/org/elasticsearch/index/seqno/ReplicationTracker.java

ywelsch · 2020-04-23T11:13:34Z

...er/src/main/java/org/elasticsearch/action/support/replication/PendingReplicationActions.java

+
+    private final Map<String, Map<Object, RetryableAction<?>>> onGoingReplicationActions = ConcurrentCollections.newConcurrentMap();
+
+    public void addPendingAction(String nodeId, Object actionKey, RetryableAction<?> replicationAction) {


should we just use the RetryableAction as key?

...er/src/main/java/org/elasticsearch/action/support/replication/PendingReplicationActions.java

server/src/main/java/org/elasticsearch/action/support/replication/ReplicationOperation.java

server/src/main/java/org/elasticsearch/index/seqno/ReplicationTracker.java

ywelsch · 2020-04-23T12:14:47Z

server/src/main/java/org/elasticsearch/index/seqno/ReplicationTracker.java

+                }
+            }
+        }
+        this.replicationGroup = newReplicationGroup;


I think that I would prefer for this to be set before we call the methods on pendingReplication, just to be sure that an exception there does not mess up the state in this class.

This work was to prevent the race where the ReplicationGroup is returned by the replication tracker with a new node to a ReplicationOperation. This operation attempts to start a replication request, but the listener to PendingReplicationActions has not yet been called so the request is immediately cancelled.

I made your change. But I think spurious cancellations are possible.

ok, I see now. One option is to turn it around then, but use try - finally to make sure that the state is updated even when the listener throws.

Another option (currently preferring this one for its generality) is that we have a versioning concept on ReplicationGroup (i.e. knowing which one is newer than another one) and that we explicitly update PendingReplicationActions whenever we capture the ReplicationGroup in IndexShard (most times the update will be a NOOP and should not need any locking on PendingReplicationActions).

server/src/main/java/org/elasticsearch/index/seqno/ReplicationTracker.java

...er/src/main/java/org/elasticsearch/action/support/replication/PendingReplicationActions.java

Tim-Brooks · 2020-04-23T23:41:44Z

@ywelsch - As an FYI I plan to add tests once we decide on the specific approach.

ywelsch

I see two options:

tracking replication group changes during the whole replication (i.e. from beginning where we send replication request first time to eventually successfully/unsuccessfully completing after X retries). The advantage of this approach is that we can react more quickly to replication group changes (and don't even need to wait for the respective replication request to complete before we can reach a conclusion about its success). The disadvantage is that we will have to track every outgoing request in the system, which comes with a certain overhead (listener registration / condition checking / deregistration).
tracking replication group changes only while retries are locally scheduled. As soon as the replication request is resent, we would remove replication group tracking (and readd it before rescheduling). The advantage of this approach is that no tracking and listener registration would be required under normal operation when there are no retries. The disadvantage is that we are less reactive to nodes that might have long GCs and which the master has already failed (i.e. thrown out of the replication group).

Currently I'm leaning for option 1, if we can make it fast (have minimal overhead with the listener registration stuff). Let's discuss this when you're online

server/src/main/java/org/elasticsearch/index/seqno/ReplicationTracker.java

server/src/main/java/org/elasticsearch/index/shard/ReplicationGroup.java

server/src/main/java/org/elasticsearch/action/support/replication/ReplicationOperation.java

Tim-Brooks · 2020-04-28T20:56:19Z

@ywelsch @dnhatn

One test that is failing here is PrimaryAllocationIT#testPrimaryReplicaResyncFailed. This is my understanding of the test:

We index some stuff
Purposefully create a partition between two replicas
Kill the primary
Check that one of the replicas is failed due to resync failing.

This test is now inconsistently failing because the assertBusy times out after 1 minute. But our retries for the rsync times-out after 1 minute (defined by the timeout for the request which is the default). This raises the question of what we want the timeout to be. I set this to be the request timeout which is configurable for external bulk shard requests. But I assume this is not configurable for internal operations? Do we want some type of setting like we introduced with peer recovery?

I could fix the test by introducing a different non-retryble error for the failure reason. But I assume the timeout conversation is something we wanted to have anyway.,

Tim-Brooks · 2020-04-28T22:15:02Z

@ywelsch - In the most recent test run I got a test failure related to an issue I described here.

ReplicationTracker gets updated with a new replication group
Different thread performs a ReplicationOperation and gets that replication group.
Operation calls PendingReplicationActions#addPendingAction, but the operation is immediately cancelled because the PendingReplicationActions#accept method has not been called.
PendingReplicationActions#addPendingAction is called.

I can get this to consistently fail by added a sleep in PendingReplicationActions#addPendingAction. If I cache a version of ReplicationGroup in the PendingReplicationActions and use that the test succeeds even with the sleep.

Thoughts on the correct approach? I assume we have to fix this at this point since it is failing tests.

ywelsch · 2020-04-29T08:27:04Z

This raises the question of what we want the timeout to be. I set this to be the request timeout which is configurable for external bulk shard requests. But I assume this is not configurable for internal operations? Do we want some type of setting like we introduced with peer recovery?

yes, a cluster-level setting like for peer recoveries would be best here I think. Given that the timeout determines how failure-resilient the cluster will be (i.e. how quickly it will start to fail shards), it's probably best not to mix it with the request-level timeout, which is more about how long the cluster should try to bring the given request to execution.

For our integration tests (ESIntegTestCase), we should then inject a random timeout in all tests (and explicitly use higher timeout for those tests where care about the added resilience).

Tim-Brooks · 2020-04-29T15:19:56Z

@ywelsch I have updated this PR with the versioning approach and adding the timeout setting.

test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java

ywelsch

LGTM. Thanks Tim! As follow-up, we can look at possibly exposing some information in the tasks API about a replication request that is pending replication (after indexing into primary completed), just so that we see that we're possibly waiting for a retry (see TransportReplication.setPhase).

dnhatn

Nice work! I left some minor comments. Thanks Tim!

dnhatn · 2020-04-29T22:44:15Z

...er/src/main/java/org/elasticsearch/action/support/replication/PendingReplicationActions.java

+            }
+        }
+
+        threadPool.executor(ThreadPool.Names.GENERIC).execute(() -> toCancel.stream()


Maybe share this logic with close?

dnhatn · 2020-04-29T22:44:39Z

...er/src/main/java/org/elasticsearch/action/support/replication/PendingReplicationActions.java

+
+    @Override
+    public void accept(ReplicationGroup replicationGroup) {
+        if (replicationGroup.getVersion() - replicationGroupVersion > 0) {


maybe just replicationGroup.getVersion() > replicationGroupVersion?

dnhatn · 2020-04-29T22:45:00Z

...er/src/main/java/org/elasticsearch/action/support/replication/PendingReplicationActions.java

+    public void accept(ReplicationGroup replicationGroup) {
+        if (replicationGroup.getVersion() - replicationGroupVersion > 0) {
+            synchronized (this) {
+                if (replicationGroup.getVersion() - replicationGroupVersion > 0) {


maybe just replicationGroup.getVersion() > replicationGroupVersion ?

Currently a failed replication action will fail an entire replica. This includes when replication fails due to potentially short lived transient issues such as network distruptions or circuit breaking errors. This commit implements retries using the retryable action.

Tim-Brooks added 9 commits March 26, 2020 16:31

Fix

59dda7a

Merge remote-tracking branch 'upstream/master' into retry_replication

eab09d4

Changes

7aa25e0

Merge remote-tracking branch 'upstream/master' into retry_replication

20fc73d

Change

448f35b

Merge remote-tracking branch 'upstream/master' into retry_replication

3e2a918

Changes

00bd157

Changes

f775f95

Tim-Brooks added >non-issue :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. v8.0.0 v7.8.0 labels Apr 22, 2020

Tim-Brooks requested a review from ywelsch April 22, 2020 22:53

Changes

83f11ae

Tim-Brooks added the WIP label Apr 23, 2020

ywelsch reviewed Apr 23, 2020

View reviewed changes

Tim-Brooks added 6 commits April 23, 2020 15:10

Merge remote-tracking branch 'upstream/master' into retry_replication

e33efc2

Changes

099d784

Change

94fdb6e

Put if absent

6eec964

Changes

0bbbd21

Whitespace

a4a23df

Tim-Brooks requested a review from ywelsch April 23, 2020 23:41

ywelsch reviewed Apr 24, 2020

View reviewed changes

Tim-Brooks added 2 commits April 24, 2020 12:13

Merge remote-tracking branch 'upstream/master' into retry_replication

e5dac25

Merge remote-tracking branch 'upstream/master' into retry_replication

0ca3da2

Mute to ensure all tests are run

83e57d1

Changes

7f33e2e

Tim-Brooks requested a review from ywelsch April 29, 2020 15:19

Tim-Brooks added 4 commits April 29, 2020 09:38

Add

44b607a

double check

f9ca9d1

Merge remote-tracking branch 'upstream/master' into retry_replication

d5876f5

npe

2f75aea

ywelsch reviewed Apr 29, 2020

View reviewed changes

test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java Outdated Show resolved Hide resolved

Use correct random

11d35e4

Tim-Brooks requested a review from ywelsch April 29, 2020 18:21

ywelsch approved these changes Apr 29, 2020

View reviewed changes

dnhatn approved these changes Apr 29, 2020

View reviewed changes

Chnage

c77b873

Tim-Brooks merged commit b2b32d7 into elastic:master Apr 30, 2020

Tim-Brooks added the backport pending label Apr 30, 2020

Tim-Brooks added v7.9.0 and removed v7.8.0 labels Jun 8, 2020

Tim-Brooks mentioned this pull request Jul 8, 2020

Creating effective back-pressure in ES Write Path #59116

Open

Tim-Brooks removed the backport pending label Jul 14, 2020

Tim-Brooks mentioned this pull request Jul 17, 2020

Improve pending indexing metrics and back pressure #59263

Open

13 tasks

dnhatn mentioned this pull request Aug 18, 2020

improve ccr ShardFollowNodeTask computeDelay #49920

Closed

dnhatn mentioned this pull request Sep 5, 2020

CCR should retry on CircuitBreakingException #62013

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021


		private final Map<String, Map<Object, RetryableAction<?>>> onGoingReplicationActions = ConcurrentCollections.newConcurrentMap();

		public void addPendingAction(String nodeId, Object actionKey, RetryableAction<?> replicationAction) {

Retry failed replication due to transient errors #55633

Retry failed replication due to transient errors #55633

Uh oh!

Conversation

Tim-Brooks commented Apr 22, 2020

Uh oh!

elasticmachine commented Apr 22, 2020

Uh oh!

Tim-Brooks commented Apr 23, 2020

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ywelsch Apr 23, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ywelsch Apr 23, 2020

Choose a reason for hiding this comment

Uh oh!

Tim-Brooks Apr 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywelsch Apr 29, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Tim-Brooks commented Apr 23, 2020

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Tim-Brooks commented Apr 28, 2020

Uh oh!

Tim-Brooks commented Apr 28, 2020

Uh oh!

ywelsch commented Apr 29, 2020

Uh oh!

Tim-Brooks commented Apr 29, 2020

Uh oh!

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

dnhatn left a comment

Choose a reason for hiding this comment

Uh oh!

dnhatn Apr 29, 2020

Choose a reason for hiding this comment

Uh oh!

dnhatn Apr 29, 2020

Choose a reason for hiding this comment

Uh oh!

dnhatn Apr 29, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Tim-Brooks Apr 23, 2020 •

edited

Loading