Check for closed connection while opening #26932

jasontedor · 2017-10-09T17:43:01Z

While opening a connection to a node, a channel can subsequently close. If this happens, a future callback whose purpose is to close all other channels and disconnect from the node will fire. However, this future will not be ready to close all the channels because the connection will not be exposed to the future callback yet. Since this callback is run once, we will never try to disconnect from this node again and we will be left with a closed channel. This commit adds a check that all channels are open before exposing the channel and throws a general connection exception. In this case, the usual connection retry logic will take over.

* master: update Lucene version for 6.0-RC2 version Calculate and cache result when advanceExact is called (elastic#26920) Test query builder bwc against previous supported versions instead of just the current version. Set minimum_master_nodes on rolling-upgrade test (elastic#26911) Return List instead of an array from settings (elastic#26903) remove _primary and _replica shard preferences (elastic#26791) fixing typo in datehistogram-aggregation.asciidoc (elastic#26924) [API] Added the `terminate_after` parameter to the REST spec for "Count" API Setup debug logging for qa.full-cluster-restart Enable BWC testing against other remotes

bleskes

Thx @jasontedor . I left one important comment.

bleskes · 2017-10-09T17:48:51Z

test/framework/src/test/java/org/elasticsearch/transport/nio/SimpleNioTransportTests.java

+    @Override
+    protected void close(NioChannel nioChannel) {
+        try {
+            nioChannel.getCloseFuture().awaitClose();


Out of curiosity - this reads weird - we wait on close but not actively close? is that OK?

Oh, that was silly. I pushed d69c92e.

bleskes · 2017-10-09T17:51:01Z

core/src/main/java/org/elasticsearch/transport/TcpTransport.java

                };
-                nodeChannels = connectToChannels(node, connectionProfile, onClose);
+                nodeChannels = connectToChannels(node, connectionProfile, this::onChannelOpen, onClose);
+                if (!Arrays.stream(nodeChannels.channels).allMatch(this::isOpen)) {


I this this to be moved to after connectionRef.set(nodeChannels); is called. Otherwise we still have a race condition when things go wrong later on?

Yes, I pushed c4ae4c6.

jasontedor · 2017-10-09T18:44:52Z

@bleskes This is ready for you again.

s1monw · 2017-10-09T20:37:26Z

core/src/main/java/org/elasticsearch/transport/TcpTransport.java

                nodeChannels = new NodeChannels(nodeChannels, version); // clone the channels - we now have the correct version
                transportService.onConnectionOpened(nodeChannels);
                connectionRef.set(nodeChannels);
+                if (!Arrays.stream(nodeChannels.channels).allMatch(this::isOpen)) {


can we use == false?

I pushed 1445272.

s1monw · 2017-10-09T20:37:34Z

core/src/main/java/org/elasticsearch/transport/TcpTransport.java

+     *
+     * @param channel the opened channel
+     */
+    protected void onChannelOpen(final Channel channel) {


I am not sure I understand this change. Why do we have to have this onChannelOpen callback? Is it for testing only? I wonder if we can find a different way to do this, it's a no-op in production

Yes, it's for testing so we can hook in and quietly close a channel after it's been successfully opened to trigger the issue. I'm open to suggestions on an alternative approach to achieve the same, but this is the cleanest that I found.

Yesterday I had tried to use the connection listener to hook in and disconnect; the problem with this is that it was not reliable because it happens on a background thread and would only sporadically succeed to close a channel in time. @s1monw had the clever and sneaky idea to hook into the executor so we can replace it with one that happens synchronously on the same thread and thus we are ensured it happens before the connection finishes. I have pushed this change.

jasontedor · 2017-10-09T20:41:37Z

Thanks @bleskes and @s1monw I've responded to all of your feedback.

Tim-Brooks

So generally LGTM be I did have one concern about if we had address all the potential issues. Also our method contracts are getting kind of complicated and we depend on both nio and netty properly implementing similar connection failure logic.

I'm not sure if we should maybe move to a different contact here at some point. Such as:

protected abstract List<Future<Channel>> connectToChannels(DiscoveryNode node,
                                                               ConnectionProfile connectionProfile,
                                                               Consumer<Channel> onChannelClose) throws IOException;

In this scenarios the list of futures returned are the connection futures.

And all the implementation does is initiate the connection process and attach the close listener. It is up to TcpTransport to block on a connection and do all the validation ensuring that the connections are setup properly. I'm not sure I've thought through all the implications of that. But it is something to consider in the future.

Tim-Brooks · 2017-10-09T23:17:06Z

test/framework/src/main/java/org/elasticsearch/transport/AbstractSimpleTransportTestCase.java

+            final AtomicBoolean first = new AtomicBoolean();
+            service =
+                    buildService("service", version0, clusterSettings, Settings.EMPTY, true, true, channel -> {
+                        if (!first.compareAndSet(false, true)) {


I understand that this works. But the booleans (as labelled) seem backwards to me.

Isn't it:

final AtomicBoolean first = new AtomicBoolean(true) ... if (!first.compareAndSet(true, false)) { }

?

I pushed 53a9b2c.

Tim-Brooks · 2017-10-09T23:24:18Z

core/src/main/java/org/elasticsearch/transport/TcpTransport.java

                nodeChannels = new NodeChannels(nodeChannels, version); // clone the channels - we now have the correct version
                transportService.onConnectionOpened(nodeChannels);
                connectionRef.set(nodeChannels);
+                if (Arrays.stream(nodeChannels.channels).allMatch(this::isOpen) == false) {


I guess this solution solves the majority of the cases where the original issue arises but don't we still have an issue if this check occurs and immediately after it succeeds, concurrently the other side disconnects? The event loop is a different thread so the close listener could still be executed prior the nodes channels being returned from this scope.

In order to be safe don't we need to check if all the channels are still open AFTER we put it in the connectedNodes map in the connectToNode method?

I think it's enough to check they are open after we've set the reference to the channels because the problem arises if we close before that reference is set; after that reference is set we are covered by the close listener?

Oh yeah that’s probably right. I was thinking about the connection map that is not manipulated until later.

jasontedor · 2017-10-10T01:31:35Z

@tbrooks8 I think your suggestion for returning the futures and lifting the channel connection handling logic into TcpTransport is a good one and I had considered something similar. The reason I went against is because this change needs to go in to 5.6 where I want to avoid a bigger refactoring and I think what we have here is good enough? We can explore the future suggestion in a follow-up?

jasontedor · 2017-10-10T01:32:47Z

@tbrooks8 @bleskes @s1monw This is ready for you again.

jasontedor · 2017-10-10T14:44:07Z

@s1monw This is ready for you.

* master: Fix handling of paths containing parentheses Allow only a fixed-size receive predictor (elastic#26165) Add Homebrew instructions to getting started ingest: Fix bug that prevent date_index_name processor from accepting timestamps specified as a json number Scripting: Fix expressions to temporarily support filter scripts (elastic#26824) Docs: Add note to contributing docs warning against tool based refactoring (elastic#26936) Fix thread context handling of headers overriding (elastic#26068) SearchWhileCreatingIndexIT: remove usage of _only_nodes

s1monw

LGTM

jasontedor · 2017-10-10T17:34:33Z

I discussed with @bleskes via another channel and he is okay with this being merged.

While opening a connection to a node, a channel can subsequently close. If this happens, a future callback whose purpose is to close all other channels and disconnect from the node will fire. However, this future will not be ready to close all the channels because the connection will not be exposed to the future callback yet. Since this callback is run once, we will never try to disconnect from this node again and we will be left with a closed channel. This commit adds a check that all channels are open before exposing the channel and throws a general connection exception. In this case, the usual connection retry logic will take over. Relates #26932

jasontedor added :Distributed Coordination/Network Http and internode communication implementations >bug review v5.6.4 v6.0.0 v6.1.0 v7.0.0 labels Oct 9, 2017

jasontedor requested review from Tim-Brooks, bleskes, s1monw and ywelsch October 9, 2017 17:43

jasontedor added 2 commits October 9, 2017 13:44

Simplify channel usage

f247454

bleskes suggested changes Oct 9, 2017

View reviewed changes

jasontedor added 2 commits October 9, 2017 14:17

Fix close call

d69c92e

Ordering

c4ae4c6

jasontedor added 4 commits October 9, 2017 14:49

Revert inadvertent ordering change

31a7fbf

Fix NIO on channel open callback

cb7bb95

Fix mock TCP transport test

0743b17

Add Javadocs

d9b60eb

s1monw suggested changes Oct 9, 2017

View reviewed changes

== false for Simon <3

1445272

Tim-Brooks reviewed Oct 9, 2017

View reviewed changes

Change boolean for clarity

53a9b2c

Another == false for Simon

695da73

jasontedor added 5 commits October 10, 2017 05:58

Fix line length

6bed4a7

Different testing plan

efc74ce

revert leftover formatting changes

3317cc1

revert leftover formatting changes

199f572

revert leftover formatting changes

78ff3d6

jasontedor added 4 commits October 10, 2017 11:07

More formatting

36c2753

Remove import

0fb5f11

Remove import

450c39f

s1monw approved these changes Oct 10, 2017

View reviewed changes

jasontedor merged commit 4c06b8f into elastic:master Oct 10, 2017

jasontedor deleted the check-for-close branch October 10, 2017 17:57

ywelsch mentioned this pull request Oct 13, 2017

Tasks stuck forever #25863

Closed

lcawl added v6.0.0-rc2 and removed v6.0.0 labels Oct 30, 2017

jasontedor mentioned this pull request Nov 20, 2017

Shards get stuck initializing on a large cluster (5.5.1) #26293

Closed

lcawl removed the v6.1.0 label Dec 12, 2017

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Check for closed connection while opening #26932

Check for closed connection while opening #26932

Uh oh!

Conversation

jasontedor commented Oct 9, 2017

Uh oh!

bleskes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jasontedor commented Oct 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jasontedor commented Oct 9, 2017

Uh oh!

Tim-Brooks left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jasontedor commented Oct 10, 2017

Uh oh!

jasontedor commented Oct 10, 2017

Uh oh!

jasontedor commented Oct 10, 2017

Uh oh!

s1monw left a comment

Choose a reason for hiding this comment

Uh oh!

jasontedor commented Oct 10, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants