-
Notifications
You must be signed in to change notification settings - Fork 25.6k
SniffNodesSampler should close connection after handling responses #24632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
With the current implementation, SniffNodesSampler might close the current connection right after a request is sent but before the response is correctly handled. This causes to timeouts in the transport client when the sniffing is activated. closes elastic#24575 closes elastic#24557
|
@bleskes @jasontedor the PR does not have tests, I created it to point it to you what I think is the cause of #24575 (and also the deleted #24557). I'd be happy if you could confirm or invalidate that this is the cause of transport client exceptions. |
s1monw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the fix looks good to me, left some comments. good catch.
| Transport.Connection connectionToClose = null; | ||
|
|
||
| @Override | ||
| public void onAfter() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also need to close the connection in public void onFailure(Exception e) { since we might get rejected or something like this.
| @@ -0,0 +1,7 @@ | |||
| appender.console.type = Console | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is unrelated?
| /** | ||
| * Created by tanguy on 11/05/17. | ||
| */ | ||
| public class TestClient { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is unrelated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this should not have been commited, thanks.
|
|
||
| void closeConnection() { | ||
| IOUtils.closeWhileHandlingException(connectionToClose); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testing will be tricky but doable. I have some ideas here similar to what I did on RemoteClusterConnectionTests where we basically mock the calls to clusterstate and return a pre-build state but we can also put some sleeps into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote a test which would have failed before the fix. That would be great if you can have a look.
|
one thing that I am puzzled about is why this causes timeouts instead of triggering onException on the handler? I think the reason is that we don't notify the TransportService when a connection is closed but only if a node is disconnected that is a different bug here. Both are independent and should be handled independently. so I think your fix is sufficient for the issues referenced in the description. |
…port handlers Today we prune transport handlers in TransporService when a node is disconnected. This can cause connections to starve in the TransportService if the connection is opened as a short living connection ie. without sharing the connection to a node via registering in the transport itself. This change now moves to pruning based on the connections cache key to ensure we notify handlers as soon as the connection is closed for all connections not just for registered connections. Relates to elastic#24632 Relates to elastic#24575 Relates to elastic#24557
|
I opened #24639 for the notification part |
|
I also marked this as a blocker for 5.4.1 |
bleskes
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch. LGTM (although I left one suggestion).
| public void handleResponse(ClusterStateResponse response) { | ||
| clusterStateResponses.put(nodeToPing, response); | ||
| latch.countDown(); | ||
| closeConnection(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should unify the latch.countDown() and closeConnection() into a single method called "onDone" on the AbstractRunnable that everyone calls? this it's less trappy and people wouldn't forget to do one but not the other.
I agree - I didn't spot this problem but my knowledge of the TransportService is limited, I'm glad you already proposed a fix. I updated the PR according to your comments. |
s1monw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 2 thx for the test
…port handlers (#24639) Today we prune transport handlers in TransportService when a node is disconnected. This can cause connections to starve in the TransportService if the connection is opened as a short living connection ie. without sharing the connection to a node via registering in the transport itself. This change now moves to pruning based on the connections cache key to ensure we notify handlers as soon as the connection is closed for all connections not just for registered connections. Relates to #24632 Relates to #24575 Relates to #24557
…port handlers (#24639) Today we prune transport handlers in TransportService when a node is disconnected. This can cause connections to starve in the TransportService if the connection is opened as a short living connection ie. without sharing the connection to a node via registering in the transport itself. This change now moves to pruning based on the connections cache key to ensure we notify handlers as soon as the connection is closed for all connections not just for registered connections. Relates to #24632 Relates to #24575 Relates to #24557
…port handlers (#24639) Today we prune transport handlers in TransportService when a node is disconnected. This can cause connections to starve in the TransportService if the connection is opened as a short living connection ie. without sharing the connection to a node via registering in the transport itself. This change now moves to pruning based on the connections cache key to ensure we notify handlers as soon as the connection is closed for all connections not just for registered connections. Relates to #24632 Relates to #24575 Relates to #24557
| @Override | ||
| public void onAfter() { | ||
| void onDone() { | ||
| IOUtils.closeWhileHandlingException(connectionToClose); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we call the latch in a finally block just to be absolutely sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
|
I'm getting exactly this error with the version 5.4.1 across the board. What's the quickest way for me to recover? |
|
@konste I just ran another test this morning with a fresh 5.4.1 installation and a PreBuiltTransportClient with sniff option set to true and everything worked as expected (while the error was really obvious and appears at startup time). Can you please provide the logs of both transport client and elasticsearch node please? As well as the transport client settings? |
|
@tlrx Sorry I had to restore functionality ASAP and lost the repro. |
With the current implementation,
SniffNodesSamplermight close thecurrent connection right after a request is sent but before the response
is correctly handled. This causes to timeouts in the transport client
when the sniffing is activated in all versions since #22828.
closes #24575
closes #24557