Prepare ShardFollowNodeTask to bootstrap when it fall behind leader shard #37562

martijnvg · 2019-01-17T10:32:42Z

Changed the shard changes api to include a special metadata in the exception being thrown
to indicate that the ops are no longer there.
Changed ShardFollowNodeTask to handle this exception with special metadata
and mark a shard as fallen behind its leader shard. The shard follow task
will then abort its on going replication.

The code that does the restore from ccr repository still needs to be added.
This change should make that change a bit easier.

Relates to #35975

…hard * Changed the shard changes api to include a special metadata in the exception being thrown to indicate that the ops are no longer there. * Changed ShardFollowNodeTask to handle this exception with special metadata and mark a shard as fallen behind its leader shard. The shard follow task will then abort its on going replication. The code that does the restore from ccr repository still needs to be added. This change should make that change a bit easier. Relates to elastic#35975

elasticmachine · 2019-01-17T10:32:44Z

Pinging @elastic/es-distributed

dnhatn · 2019-01-17T15:24:41Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardChangesAction.java

                    String message = "Operations are no longer available for replicating. Maybe increase the retention setting [" +
                        IndexSettings.INDEX_SOFT_DELETES_RETENTION_OPERATIONS_SETTING.getKey() + "]?";
-                    listener.onFailure(new ElasticsearchException(message, e));
+                    // Make it easy to detect this error in ShardFollowNodeTask:


Maybe introduce a new exception (extends from IllegalStateException) which indicates that the requesting history is no longer available. We can throw that exception in the LuceneChangesSnapshot. WDYT?

Yes, that make sense! I will make sure we thrown that new exception in LuceneChangesSnapshot and instead of the message checking that happens here then just do an instanceof check for this new exception.

Thanks @martijnvg :)

…d_leader

martijnvg · 2019-01-18T08:53:13Z

@dnhatn I've updated the PR.

dnhatn

Thanks @martijnvg. I left two comments to discuss.

dnhatn · 2019-01-18T09:20:00Z

server/src/main/java/org/elasticsearch/index/engine/OperationsMissingException.java

+ */
+public final class OperationsMissingException extends IllegalStateException {
+
+    OperationsMissingException(String message) {


Maybe name it MissingHistoryOperationsException?

dnhatn · 2019-01-18T09:20:43Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

                        failedReadRequests++;
                        fetchExceptions.put(from, Tuple.tuple(retryCounter, ExceptionsHelper.convertToElastic(e)));
                    }
+                    if (e instanceof ElasticsearchException) {


Can we serialize the newly added exception and use it here?

But then MissingHistoryOperationsException needs to extend ElasticsearchException, otherwise it cannot be serialized.

Regardless of that I always understood that we want to minimize the number of exceptions that are serializable. This is why I added a header to ElasticsearchException instead of creating a new one. Does this still apply? Or do we really want to intoduce a new serializable exception here?

Regardless of that I always understood that we want to minimize the number of exceptions that are serializable.

If this is the case, let's use the header or metadata here.

…d_leader

dnhatn

@martijnvg I left two comments.

dnhatn · 2019-01-23T00:01:13Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardChangesAction.java

+                    // Make it easy to detect this error in ShardFollowNodeTask:
+                    // (adding a metadata header instead of introducing a new exception that extends ElasticsearchException)
+                    ElasticsearchException wrapper = new ElasticsearchException(message, e);
+                    wrapper.addMetadata(Ccr.FALLEN_BEHIND_LEADER_SHARD_METADATA_KEY);


I think we should name this key to indicate that the requesting changes range is no longer available rather than "fallen behind leader shard" because this service should not know anything about the follower.

dnhatn · 2019-01-23T01:38:42Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

    }

+    void handleFallenBehindLeaderShard(Exception e) {
+        if (fallenBehindLeaderShard.compareAndSet(false, true)) {


Let's make this method noop for now (i.e., without the state fallenBehindLeaderShard). I feel we need a more robust approach to avoid the scenario where an outstanding request can trigger another restore while the shard was restored already.

…d_leader

martijnvg · 2019-01-23T10:47:59Z

@dnhatn I've updated the PR.

martijnvg · 2019-01-23T11:57:19Z

run the gradle build tests 1

martijnvg · 2019-01-23T11:58:31Z

@elasticmachine run elasticsearch-ci/2

martijnvg · 2019-01-23T11:58:38Z

@elasticmachine run elasticsearch-ci/1

This reverts commit 80c0efe.

martijnvg · 2019-01-23T20:18:44Z

@dnhatn As discussed, I have removed the fallen_behind_leader_shard stats field and fallenBehindLeaderShard field, because it is unknown whether a simple flag is sufficient enough
to indicate we need to fallback to a file based restore.

dnhatn

LGTM. Thanks @martijnvg.

Tim-Brooks · 2019-01-23T21:51:51Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardFollowNodeTask.java

                    }
+                    if (e instanceof ElasticsearchException) {
+                        ElasticsearchException elasticsearchException = (ElasticsearchException) e;
+                        if (elasticsearchException.getMetadataKeys().contains(Ccr.REQUESTED_OPS_MISSING_METADATA_KEY)) {


I have been working on this PR locally with some other work of mine. When the listener is called, the exception is wrapped in a RemoteTransportException. So the check if the metadata is present fails even if the underlying exception has the metadata. I think that this needs to be unwrapped further. Blocking on a future unwraps the exception which is why your test passes (opposed to the listener infrastructure).

I think also because the MissingHistoryOperationsException was not handeled properly in shard changes action, the ElasticsearchException with the metadata never made it to here.

Tim-Brooks · 2019-01-23T22:39:28Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/action/ShardChangesAction.java

            ActionListener<Response> wrappedListener = ActionListener.wrap(listener::onResponse, e -> {
                Throwable cause = ExceptionsHelper.unwrapCause(e);
-                if (cause instanceof IllegalStateException && cause.getMessage().contains("Not all operations between from_seqno [")) {
+                if (cause instanceof MissingHistoryOperationsException) {


I don't think this is serializing properly. I find this exception at a breakpoint here. I think that our exception serializing code is doing an instanceof check, seeing that this is an IllegalStateException and it is identified as an IllegalStateException on deserialization.

java.lang.IllegalStateException: Not all operations between from_seqno [17] and to_seqno [34] found; expected seqno [17]; found [Index{id='6', type='doc', seqNo=18, primaryTerm=1, version=2, autoGeneratedIdTimestamp=-1}] at org.elasticsearch.index.engine.LuceneChangesSnapshot.rangeCheck(LuceneChangesSnapshot.java:155) at org.elasticsearch.index.engine.LuceneChangesSnapshot.next(LuceneChangesSnapshot.java:138) at org.elasticsearch.xpack.ccr.action.ShardChangesAction.getOperations(ShardChangesAction.java:527) at org.elasticsearch.xpack.ccr.action.ShardChangesAction$TransportAction.shardOperation(ShardChangesAction.java:340) at org.elasticsearch.xpack.ccr.action.ShardChangesAction$TransportAction.shardOperation(ShardChangesAction.java:319) at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$1.doRun(TransportSingleShardAction.java:117) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41) at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.lang.Thread.run(Thread.java:834)

I made a wrong assumption here. I thought that the exception here didn't get serialized, but it does get serialized. (when the request gets send to a different shard copy then the local node has).

This means that throwing the MissingHistoryOperationsException here does not work. We need to throw an exception that extends from ElasticticsearchException and that sets specific metadata in the LuceneSnapshot.java . I will make this change now.

…d_leader

…xception with a special header instead. MissingHistoryOperationsException couldn't be used because it ended up getting serialized in certain cases and this exception was then not handled correctly.

martijnvg · 2019-01-24T09:38:54Z

@dnhatn @tbrooks8 I've updated the PR. I removed the MissingHistoryOperationsException and instead throw now ResourceNotFoundException with headers from LuceneSnapshot#rangeCheck(...). This is needed, becauseMissingHistoryOperationsException was being serialized and then it gets wrapped in RemoteTransportException and then we can't handle it properly.

@tbrooks8 Can you verify whether this change works in your PR?

…otFoundException with" This reverts commit 3025289.

a different place where it never gets serialized and there covert it into the wrapper exception.

…d_leader

martijnvg · 2019-01-25T08:59:09Z

@dnhatn and I talked about keeping the MissingHistoryOperationsException the was removed. We think we can still use it, but we need to handle the exception in a different place.

So I reverted my previous commit and instead of handeling the MissingHistoryOperationsException in doExecute(...) method, the MissingHistoryOperationsException is handled in getOperations(...) which is always executed locally on the node that has the shard copy.

The only minor downside of this approach is that will potentially wrap a MissingHistoryOperationsException more than once. A shard copy may not have the requested ops while another shard copy does have the requested ops. In the end that shouldn't be a problem because TransportSingleShardAction always keep track of the latest failure.

Tim-Brooks · 2019-01-28T03:44:31Z

I reran my test and found the handleFallenBehindLeaderShard was called this time.

dnhatn

Thanks @martijnvg for an extra iteration.

…hard (#37562) * Changed `LuceneSnapshot` to throw an `OperationsMissingException` if the requested ops are missing. * Changed the shard changes api to handle the `OperationsMissingException` and wrap the exception into `ResourceNotFound` exception and include metadata to indicate the requested range can no longer be retrieved. * Changed `ShardFollowNodeTask` to handle this `ResourceNotFound` exception with the included metdata header. Relates to #35975

martijnvg added >non-issue v7.0.0 :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features v6.7.0 labels Jan 17, 2019

martijnvg requested review from Tim-Brooks, dnhatn and jasontedor January 17, 2019 10:32

dnhatn reviewed Jan 17, 2019

View reviewed changes

martijnvg added 2 commits January 18, 2019 09:32

Merge remote-tracking branch 'es/master' into ccr_handle_fallen_behin…

7104b10

…d_leader

added OperationsMissingException

13bc091

dnhatn reviewed Jan 18, 2019

View reviewed changes

martijnvg added 5 commits January 18, 2019 10:37

rename

72a6754

Merge remote-tracking branch 'es/master' into ccr_handle_fallen_behin…

d2740b3

…d_leader

Merge remote-tracking branch 'es/master' into ccr_handle_fallen_behin…

91b42ee

…d_leader

Merge remote-tracking branch 'es/master' into ccr_handle_fallen_behin…

23ff60d

…d_leader

fixed compile errors

d0ba657

dnhatn reviewed Jan 23, 2019

View reviewed changes

martijnvg added 5 commits January 23, 2019 11:21

fixed docs

80c0efe

NOOP

ba6f317

rename

d9ae2df

Merge remote-tracking branch 'es/master' into ccr_handle_fallen_behin…

b8b8592

…d_leader

fixed test

d5d6e98

Revert "fixed docs"

02e110a

This reverts commit 80c0efe.

martijnvg added 2 commits January 23, 2019 21:12

removed fallen_behind_leader_shard field

273360d

adjusted comment

e3468e0

dnhatn approved these changes Jan 23, 2019

View reviewed changes

Tim-Brooks requested changes Jan 23, 2019

View reviewed changes

Tim-Brooks reviewed Jan 23, 2019

View reviewed changes

Tim-Brooks mentioned this pull request Jan 23, 2019

Implement proof of concept ccr recovery #37790

Closed

martijnvg added 2 commits January 24, 2019 08:58

Merge remote-tracking branch 'es/master' into ccr_handle_fallen_behin…

8a98448

…d_leader

Removed MissingHistoryOperationsException and throw ResourceNotFoundE…

3025289

…xception with a special header instead. MissingHistoryOperationsException couldn't be used because it ended up getting serialized in certain cases and this exception was then not handled correctly.

martijnvg added 3 commits January 25, 2019 09:20

Revert "Removed MissingHistoryOperationsException and throw ResourceN…

83f1227

…otFoundException with" This reverts commit 3025289.

Brought back MissingHistoryOperationsException, but catch in

23b4b4d

a different place where it never gets serialized and there covert it into the wrapper exception.

Merge remote-tracking branch 'es/master' into ccr_handle_fallen_behin…

4795575

…d_leader

Tim-Brooks approved these changes Jan 28, 2019

View reviewed changes

dnhatn approved these changes Jan 28, 2019

View reviewed changes

martijnvg merged commit 4e1a779 into elastic:master Jan 28, 2019

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Prepare ShardFollowNodeTask to bootstrap when it fall behind leader shard #37562

Prepare ShardFollowNodeTask to bootstrap when it fall behind leader shard #37562

Uh oh!

Conversation

martijnvg commented Jan 17, 2019

Uh oh!

elasticmachine commented Jan 17, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martijnvg commented Jan 18, 2019

Uh oh!

dnhatn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnhatn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martijnvg commented Jan 23, 2019

Uh oh!

martijnvg commented Jan 23, 2019

Uh oh!

martijnvg commented Jan 23, 2019

Uh oh!

martijnvg commented Jan 23, 2019

Uh oh!

martijnvg commented Jan 23, 2019

Uh oh!

dnhatn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martijnvg commented Jan 24, 2019

Uh oh!

martijnvg commented Jan 25, 2019

Uh oh!

Tim-Brooks commented Jan 28, 2019

Uh oh!

dnhatn left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants