-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Check for closed connection while opening #26932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
While opening a connection to a node, a channel can subsequently close. If this happens, a future callback whose purpose is to close all other channels and disconnect from the node will fire. However, this future will not be ready to close all the channels because the connection will not be exposed to the future callback yet. Since this callback is run once, we will never try to disconnect from this node again and we will be left with a closed channel. This commit adds a check that all channels are open before exposing the channel and throws a general connection exception. In this case, the usual connection retry logic will take over.
* master: update Lucene version for 6.0-RC2 version Calculate and cache result when advanceExact is called (elastic#26920) Test query builder bwc against previous supported versions instead of just the current version. Set minimum_master_nodes on rolling-upgrade test (elastic#26911) Return List instead of an array from settings (elastic#26903) remove _primary and _replica shard preferences (elastic#26791) fixing typo in datehistogram-aggregation.asciidoc (elastic#26924) [API] Added the `terminate_after` parameter to the REST spec for "Count" API Setup debug logging for qa.full-cluster-restart Enable BWC testing against other remotes
bleskes
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx @jasontedor . I left one important comment.
| @Override | ||
| protected void close(NioChannel nioChannel) { | ||
| try { | ||
| nioChannel.getCloseFuture().awaitClose(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity - this reads weird - we wait on close but not actively close? is that OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that was silly. I pushed d69c92e.
| }; | ||
| nodeChannels = connectToChannels(node, connectionProfile, onClose); | ||
| nodeChannels = connectToChannels(node, connectionProfile, this::onChannelOpen, onClose); | ||
| if (!Arrays.stream(nodeChannels.channels).allMatch(this::isOpen)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I this this to be moved to after connectionRef.set(nodeChannels); is called. Otherwise we still have a race condition when things go wrong later on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I pushed c4ae4c6.
|
@bleskes This is ready for you again. |
| nodeChannels = new NodeChannels(nodeChannels, version); // clone the channels - we now have the correct version | ||
| transportService.onConnectionOpened(nodeChannels); | ||
| connectionRef.set(nodeChannels); | ||
| if (!Arrays.stream(nodeChannels.channels).allMatch(this::isOpen)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use == false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed 1445272.
| * | ||
| * @param channel the opened channel | ||
| */ | ||
| protected void onChannelOpen(final Channel channel) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand this change. Why do we have to have this onChannelOpen callback? Is it for testing only? I wonder if we can find a different way to do this, it's a no-op in production
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's for testing so we can hook in and quietly close a channel after it's been successfully opened to trigger the issue. I'm open to suggestions on an alternative approach to achieve the same, but this is the cleanest that I found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yesterday I had tried to use the connection listener to hook in and disconnect; the problem with this is that it was not reliable because it happens on a background thread and would only sporadically succeed to close a channel in time. @s1monw had the clever and sneaky idea to hook into the executor so we can replace it with one that happens synchronously on the same thread and thus we are ensured it happens before the connection finishes. I have pushed this change.
Tim-Brooks
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So generally LGTM be I did have one concern about if we had address all the potential issues. Also our method contracts are getting kind of complicated and we depend on both nio and netty properly implementing similar connection failure logic.
I'm not sure if we should maybe move to a different contact here at some point. Such as:
protected abstract List<Future<Channel>> connectToChannels(DiscoveryNode node,
ConnectionProfile connectionProfile,
Consumer<Channel> onChannelClose) throws IOException;
In this scenarios the list of futures returned are the connection futures.
And all the implementation does is initiate the connection process and attach the close listener. It is up to TcpTransport to block on a connection and do all the validation ensuring that the connections are setup properly. I'm not sure I've thought through all the implications of that. But it is something to consider in the future.
| final AtomicBoolean first = new AtomicBoolean(); | ||
| service = | ||
| buildService("service", version0, clusterSettings, Settings.EMPTY, true, true, channel -> { | ||
| if (!first.compareAndSet(false, true)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that this works. But the booleans (as labelled) seem backwards to me.
Isn't it:
final AtomicBoolean first = new AtomicBoolean(true)
...
if (!first.compareAndSet(true, false)) {
}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed 53a9b2c.
| nodeChannels = new NodeChannels(nodeChannels, version); // clone the channels - we now have the correct version | ||
| transportService.onConnectionOpened(nodeChannels); | ||
| connectionRef.set(nodeChannels); | ||
| if (Arrays.stream(nodeChannels.channels).allMatch(this::isOpen) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this solution solves the majority of the cases where the original issue arises but don't we still have an issue if this check occurs and immediately after it succeeds, concurrently the other side disconnects? The event loop is a different thread so the close listener could still be executed prior the nodes channels being returned from this scope.
In order to be safe don't we need to check if all the channels are still open AFTER we put it in the connectedNodes map in the connectToNode method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's enough to check they are open after we've set the reference to the channels because the problem arises if we close before that reference is set; after that reference is set we are covered by the close listener?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah that’s probably right. I was thinking about the connection map that is not manipulated until later.
|
@tbrooks8 I think your suggestion for returning the futures and lifting the channel connection handling logic into |
|
@s1monw This is ready for you. |
* master: Fix handling of paths containing parentheses Allow only a fixed-size receive predictor (elastic#26165) Add Homebrew instructions to getting started ingest: Fix bug that prevent date_index_name processor from accepting timestamps specified as a json number Scripting: Fix expressions to temporarily support filter scripts (elastic#26824) Docs: Add note to contributing docs warning against tool based refactoring (elastic#26936) Fix thread context handling of headers overriding (elastic#26068) SearchWhileCreatingIndexIT: remove usage of _only_nodes
s1monw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
I discussed with @bleskes via another channel and he is okay with this being merged. |
While opening a connection to a node, a channel can subsequently close. If this happens, a future callback whose purpose is to close all other channels and disconnect from the node will fire. However, this future will not be ready to close all the channels because the connection will not be exposed to the future callback yet. Since this callback is run once, we will never try to disconnect from this node again and we will be left with a closed channel. This commit adds a check that all channels are open before exposing the channel and throws a general connection exception. In this case, the usual connection retry logic will take over. Relates #26932
While opening a connection to a node, a channel can subsequently close. If this happens, a future callback whose purpose is to close all other channels and disconnect from the node will fire. However, this future will not be ready to close all the channels because the connection will not be exposed to the future callback yet. Since this callback is run once, we will never try to disconnect from this node again and we will be left with a closed channel. This commit adds a check that all channels are open before exposing the channel and throws a general connection exception. In this case, the usual connection retry logic will take over. Relates #26932
While opening a connection to a node, a channel can subsequently close. If this happens, a future callback whose purpose is to close all other channels and disconnect from the node will fire. However, this future will not be ready to close all the channels because the connection will not be exposed to the future callback yet. Since this callback is run once, we will never try to disconnect from this node again and we will be left with a closed channel. This commit adds a check that all channels are open before exposing the channel and throws a general connection exception. In this case, the usual connection retry logic will take over. Relates #26932
While opening a connection to a node, a channel can subsequently close. If this happens, a future callback whose purpose is to close all other channels and disconnect from the node will fire. However, this future will not be ready to close all the channels because the connection will not be exposed to the future callback yet. Since this callback is run once, we will never try to disconnect from this node again and we will be left with a closed channel. This commit adds a check that all channels are open before exposing the channel and throws a general connection exception. In this case, the usual connection retry logic will take over.