Skip to content

Conversation

@tlrx
Copy link
Member

@tlrx tlrx commented May 11, 2017

With the current implementation, SniffNodesSampler might close the
current connection right after a request is sent but before the response
is correctly handled. This causes to timeouts in the transport client
when the sniffing is activated in all versions since #22828.

closes #24575
closes #24557

With the current implementation, SniffNodesSampler might close the
current connection right after a request is sent but before the response
is correctly handled. This causes to timeouts in the transport client
when the sniffing is activated.

closes elastic#24575
closes elastic#24557
@tlrx tlrx added :Distributed Coordination/Network Http and internode communication implementations review v5.4.1 v5.5.0 v6.0.0-alpha1 labels May 11, 2017
@tlrx
Copy link
Member Author

tlrx commented May 11, 2017

@bleskes @jasontedor the PR does not have tests, I created it to point it to you what I think is the cause of #24575 (and also the deleted #24557). I'd be happy if you could confirm or invalidate that this is the cause of transport client exceptions.

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fix looks good to me, left some comments. good catch.

Transport.Connection connectionToClose = null;

@Override
public void onAfter() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also need to close the connection in public void onFailure(Exception e) { since we might get rejected or something like this.

@@ -0,0 +1,7 @@
appender.console.type = Console
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is unrelated?

/**
* Created by tanguy on 11/05/17.
*/
public class TestClient {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is unrelated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should not have been commited, thanks.


void closeConnection() {
IOUtils.closeWhileHandlingException(connectionToClose);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testing will be tricky but doable. I have some ideas here similar to what I did on RemoteClusterConnectionTests where we basically mock the calls to clusterstate and return a pre-build state but we can also put some sleeps into it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote a test which would have failed before the fix. That would be great if you can have a look.

@s1monw
Copy link
Contributor

s1monw commented May 12, 2017

one thing that I am puzzled about is why this causes timeouts instead of triggering onException on the handler? I think the reason is that we don't notify the TransportService when a connection is closed but only if a node is disconnected that is a different bug here. Both are independent and should be handled independently. so I think your fix is sufficient for the issues referenced in the description.

s1monw added a commit to s1monw/elasticsearch that referenced this pull request May 12, 2017
…port handlers

Today we prune transport handlers in TransporService when a node is disconnected.
This can cause connections to starve in the TransportService if the connection is
opened as a short living connection ie. without sharing the connection to a node
via registering in the transport itself. This change now moves to pruning based
on the connections cache key to ensure we notify handlers as soon as the connection
is closed for all connections not just for registered connections.

Relates to elastic#24632
Relates to elastic#24575
Relates to elastic#24557
@s1monw
Copy link
Contributor

s1monw commented May 12, 2017

I opened #24639 for the notification part

@s1monw s1monw added the blocker label May 12, 2017
@s1monw
Copy link
Contributor

s1monw commented May 12, 2017

I also marked this as a blocker for 5.4.1

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch. LGTM (although I left one suggestion).

public void handleResponse(ClusterStateResponse response) {
clusterStateResponses.put(nodeToPing, response);
latch.countDown();
closeConnection();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should unify the latch.countDown() and closeConnection() into a single method called "onDone" on the AbstractRunnable that everyone calls? this it's less trappy and people wouldn't forget to do one but not the other.

@tlrx
Copy link
Member Author

tlrx commented May 12, 2017

one thing that I am puzzled about is why this causes timeouts instead of triggering onException on the handler? I think the reason is that we don't notify the TransportService when a connection is closed but only if a node is disconnected that is a different bug here. Both are independent and should be handled independently. so I think your fix is sufficient for the issues referenced in the description.

I agree - I didn't spot this problem but my knowledge of the TransportService is limited, I'm glad you already proposed a fix.

I updated the PR according to your comments.

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 2 thx for the test

s1monw added a commit that referenced this pull request May 12, 2017
…port handlers (#24639)

Today we prune transport handlers in TransportService when a node is disconnected.
This can cause connections to starve in the TransportService if the connection is
opened as a short living connection ie. without sharing the connection to a node
via registering in the transport itself. This change now moves to pruning based
on the connections cache key to ensure we notify handlers as soon as the connection
is closed for all connections not just for registered connections.

Relates to #24632
Relates to #24575
Relates to #24557
s1monw added a commit that referenced this pull request May 12, 2017
…port handlers (#24639)

Today we prune transport handlers in TransportService when a node is disconnected.
This can cause connections to starve in the TransportService if the connection is
opened as a short living connection ie. without sharing the connection to a node
via registering in the transport itself. This change now moves to pruning based
on the connections cache key to ensure we notify handlers as soon as the connection
is closed for all connections not just for registered connections.

Relates to #24632
Relates to #24575
Relates to #24557
s1monw added a commit that referenced this pull request May 12, 2017
…port handlers (#24639)

Today we prune transport handlers in TransportService when a node is disconnected.
This can cause connections to starve in the TransportService if the connection is
opened as a short living connection ie. without sharing the connection to a node
via registering in the transport itself. This change now moves to pruning based
on the connections cache key to ensure we notify handlers as soon as the connection
is closed for all connections not just for registered connections.

Relates to #24632
Relates to #24575
Relates to #24557
@Override
public void onAfter() {
void onDone() {
IOUtils.closeWhileHandlingException(connectionToClose);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we call the latch in a finally block just to be absolutely sure

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@tlrx tlrx merged commit f8df2a2 into elastic:master May 12, 2017
@tlrx
Copy link
Member Author

tlrx commented May 12, 2017

Thanks @s1monw @bleskes

tlrx added a commit that referenced this pull request May 12, 2017
…24632)

With the current implementation, SniffNodesSampler might close the
current connection right after a request is sent but before the response
is correctly handled. This causes to timeouts in the transport client
when the sniffing is activated.

closes #24575
closes #24557
tlrx added a commit that referenced this pull request May 12, 2017
…24632)

With the current implementation, SniffNodesSampler might close the
current connection right after a request is sent but before the response
is correctly handled. This causes to timeouts in the transport client
when the sniffing is activated.

closes #24575
closes #24557
@tlrx tlrx deleted the closing-connection-in-sniffer branch May 15, 2017 09:18
@konste
Copy link

konste commented Jun 7, 2017

I'm getting exactly this error with the version 5.4.1 across the board. What's the quickest way for me to recover?

@tlrx
Copy link
Member Author

tlrx commented Jun 7, 2017

@konste I just ran another test this morning with a fresh 5.4.1 installation and a PreBuiltTransportClient with sniff option set to true and everything worked as expected (while the error was really obvious and appears at startup time).

Can you please provide the logs of both transport client and elasticsearch node please? As well as the transport client settings?

@konste
Copy link

konste commented Jun 7, 2017

@tlrx Sorry I had to restore functionality ASAP and lost the repro.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

5 participants