Skip to content

Conversation

@jasontedor
Copy link
Member

While opening a connection to a node, a channel can subsequently close. If this happens, a future callback whose purpose is to close all other channels and disconnect from the node will fire. However, this future will not be ready to close all the channels because the connection will not be exposed to the future callback yet. Since this callback is run once, we will never try to disconnect from this node again and we will be left with a closed channel. This commit adds a check that all channels are open before exposing the channel and throws a general connection exception. In this case, the usual connection retry logic will take over.

While opening a connection to a node, a channel can subsequently
close. If this happens, a future callback whose purpose is to close all
other channels and disconnect from the node will fire. However, this
future will not be ready to close all the channels because the
connection will not be exposed to the future callback yet. Since this
callback is run once, we will never try to disconnect from this node
again and we will be left with a closed channel. This commit adds a
check that all channels are open before exposing the channel and throws
a general connection exception. In this case, the usual connection retry
logic will take over.
* master:
  update Lucene version for 6.0-RC2 version
  Calculate and cache result when advanceExact is called (elastic#26920)
  Test query builder bwc against previous supported versions instead of just the current version.
  Set minimum_master_nodes on rolling-upgrade test (elastic#26911)
  Return List instead of an array from settings (elastic#26903)
  remove _primary and _replica shard preferences (elastic#26791)
  fixing typo in datehistogram-aggregation.asciidoc (elastic#26924)
  [API] Added the `terminate_after` parameter to the REST spec for "Count" API
  Setup debug logging for qa.full-cluster-restart
  Enable BWC testing against other remotes
Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx @jasontedor . I left one important comment.

@Override
protected void close(NioChannel nioChannel) {
try {
nioChannel.getCloseFuture().awaitClose();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity - this reads weird - we wait on close but not actively close? is that OK?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that was silly. I pushed d69c92e.

};
nodeChannels = connectToChannels(node, connectionProfile, onClose);
nodeChannels = connectToChannels(node, connectionProfile, this::onChannelOpen, onClose);
if (!Arrays.stream(nodeChannels.channels).allMatch(this::isOpen)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I this this to be moved to after connectionRef.set(nodeChannels); is called. Otherwise we still have a race condition when things go wrong later on?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I pushed c4ae4c6.

@jasontedor
Copy link
Member Author

@bleskes This is ready for you again.

nodeChannels = new NodeChannels(nodeChannels, version); // clone the channels - we now have the correct version
transportService.onConnectionOpened(nodeChannels);
connectionRef.set(nodeChannels);
if (!Arrays.stream(nodeChannels.channels).allMatch(this::isOpen)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use == false?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 1445272.

*
* @param channel the opened channel
*/
protected void onChannelOpen(final Channel channel) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand this change. Why do we have to have this onChannelOpen callback? Is it for testing only? I wonder if we can find a different way to do this, it's a no-op in production

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's for testing so we can hook in and quietly close a channel after it's been successfully opened to trigger the issue. I'm open to suggestions on an alternative approach to achieve the same, but this is the cleanest that I found.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yesterday I had tried to use the connection listener to hook in and disconnect; the problem with this is that it was not reliable because it happens on a background thread and would only sporadically succeed to close a channel in time. @s1monw had the clever and sneaky idea to hook into the executor so we can replace it with one that happens synchronously on the same thread and thus we are ensured it happens before the connection finishes. I have pushed this change.

@jasontedor
Copy link
Member Author

Thanks @bleskes and @s1monw I've responded to all of your feedback.

Copy link
Contributor

@Tim-Brooks Tim-Brooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So generally LGTM be I did have one concern about if we had address all the potential issues. Also our method contracts are getting kind of complicated and we depend on both nio and netty properly implementing similar connection failure logic.

I'm not sure if we should maybe move to a different contact here at some point. Such as:

protected abstract List<Future<Channel>> connectToChannels(DiscoveryNode node,
                                                               ConnectionProfile connectionProfile,
                                                               Consumer<Channel> onChannelClose) throws IOException;

In this scenarios the list of futures returned are the connection futures.

And all the implementation does is initiate the connection process and attach the close listener. It is up to TcpTransport to block on a connection and do all the validation ensuring that the connections are setup properly. I'm not sure I've thought through all the implications of that. But it is something to consider in the future.

final AtomicBoolean first = new AtomicBoolean();
service =
buildService("service", version0, clusterSettings, Settings.EMPTY, true, true, channel -> {
if (!first.compareAndSet(false, true)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that this works. But the booleans (as labelled) seem backwards to me.

Isn't it:

final AtomicBoolean first = new AtomicBoolean(true)
...
if (!first.compareAndSet(true, false)) {
}

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 53a9b2c.

nodeChannels = new NodeChannels(nodeChannels, version); // clone the channels - we now have the correct version
transportService.onConnectionOpened(nodeChannels);
connectionRef.set(nodeChannels);
if (Arrays.stream(nodeChannels.channels).allMatch(this::isOpen) == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this solution solves the majority of the cases where the original issue arises but don't we still have an issue if this check occurs and immediately after it succeeds, concurrently the other side disconnects? The event loop is a different thread so the close listener could still be executed prior the nodes channels being returned from this scope.

In order to be safe don't we need to check if all the channels are still open AFTER we put it in the connectedNodes map in the connectToNode method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's enough to check they are open after we've set the reference to the channels because the problem arises if we close before that reference is set; after that reference is set we are covered by the close listener?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah that’s probably right. I was thinking about the connection map that is not manipulated until later.

@jasontedor
Copy link
Member Author

@tbrooks8 I think your suggestion for returning the futures and lifting the channel connection handling logic into TcpTransport is a good one and I had considered something similar. The reason I went against is because this change needs to go in to 5.6 where I want to avoid a bigger refactoring and I think what we have here is good enough? We can explore the future suggestion in a follow-up?

@jasontedor
Copy link
Member Author

@tbrooks8 @bleskes @s1monw This is ready for you again.

@jasontedor
Copy link
Member Author

@s1monw This is ready for you.

* master:
  Fix handling of paths containing parentheses
  Allow only a fixed-size receive predictor (elastic#26165)
  Add Homebrew instructions to getting started
  ingest: Fix bug that prevent date_index_name processor from accepting timestamps specified as a json number
  Scripting: Fix expressions to temporarily support filter scripts (elastic#26824)
  Docs: Add note to contributing docs warning against tool based refactoring (elastic#26936)
  Fix thread context handling of headers overriding (elastic#26068)
  SearchWhileCreatingIndexIT: remove usage of _only_nodes
Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jasontedor
Copy link
Member Author

I discussed with @bleskes via another channel and he is okay with this being merged.

@jasontedor jasontedor merged commit 4c06b8f into elastic:master Oct 10, 2017
jasontedor added a commit that referenced this pull request Oct 10, 2017
While opening a connection to a node, a channel can subsequently
close. If this happens, a future callback whose purpose is to close all
other channels and disconnect from the node will fire. However, this
future will not be ready to close all the channels because the
connection will not be exposed to the future callback yet. Since this
callback is run once, we will never try to disconnect from this node
again and we will be left with a closed channel. This commit adds a
check that all channels are open before exposing the channel and throws
a general connection exception. In this case, the usual connection retry
logic will take over.

Relates #26932
jasontedor added a commit that referenced this pull request Oct 10, 2017
While opening a connection to a node, a channel can subsequently
close. If this happens, a future callback whose purpose is to close all
other channels and disconnect from the node will fire. However, this
future will not be ready to close all the channels because the
connection will not be exposed to the future callback yet. Since this
callback is run once, we will never try to disconnect from this node
again and we will be left with a closed channel. This commit adds a
check that all channels are open before exposing the channel and throws
a general connection exception. In this case, the usual connection retry
logic will take over.

Relates #26932
jasontedor added a commit that referenced this pull request Oct 10, 2017
While opening a connection to a node, a channel can subsequently
close. If this happens, a future callback whose purpose is to close all
other channels and disconnect from the node will fire. However, this
future will not be ready to close all the channels because the
connection will not be exposed to the future callback yet. Since this
callback is run once, we will never try to disconnect from this node
again and we will be left with a closed channel. This commit adds a
check that all channels are open before exposing the channel and throws
a general connection exception. In this case, the usual connection retry
logic will take over.

Relates #26932
@jasontedor jasontedor deleted the check-for-close branch October 10, 2017 17:57
@ywelsch ywelsch mentioned this pull request Oct 13, 2017
@lcawl lcawl added v6.0.0-rc2 and removed v6.0.0 labels Oct 30, 2017
@lcawl lcawl removed the v6.1.0 label Dec 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants