SniffNodesSampler should close connection after handling responses #24632

tlrx · 2017-05-11T20:53:30Z

With the current implementation, SniffNodesSampler might close the
current connection right after a request is sent but before the response
is correctly handled. This causes to timeouts in the transport client
when the sniffing is activated in all versions since #22828.

closes #24575
closes #24557

With the current implementation, SniffNodesSampler might close the current connection right after a request is sent but before the response is correctly handled. This causes to timeouts in the transport client when the sniffing is activated. closes elastic#24575 closes elastic#24557

tlrx · 2017-05-11T20:56:22Z

@bleskes @jasontedor the PR does not have tests, I created it to point it to you what I think is the cause of #24575 (and also the deleted #24557). I'd be happy if you could confirm or invalidate that this is the cause of transport client exceptions.

s1monw

the fix looks good to me, left some comments. good catch.

s1monw · 2017-05-12T06:02:29Z

core/src/main/java/org/elasticsearch/client/transport/TransportClientNodesService.java

                        Transport.Connection connectionToClose = null;

-                        @Override
-                        public void onAfter() {


we also need to close the connection in public void onFailure(Exception e) { since we might get rejected or something like this.

s1monw · 2017-05-12T06:02:38Z

client/transport/src/main/resources/log4j2.properties

@@ -0,0 +1,7 @@
+appender.console.type = Console


this is unrelated?

s1monw · 2017-05-12T06:02:42Z

client/transport/src/main/java/TestClient.java

+/**
+ * Created by tanguy on 11/05/17.
+ */
+public class TestClient {


this is unrelated?

Yes, this should not have been commited, thanks.

s1monw · 2017-05-12T06:04:13Z

core/src/main/java/org/elasticsearch/client/transport/TransportClientNodesService.java

+
+                                    void closeConnection() {
+                                        IOUtils.closeWhileHandlingException(connectionToClose);
+                                    }


testing will be tricky but doable. I have some ideas here similar to what I did on RemoteClusterConnectionTests where we basically mock the calls to clusterstate and return a pre-build state but we can also put some sleeps into it.

I wrote a test which would have failed before the fix. That would be great if you can have a look.

s1monw · 2017-05-12T06:10:36Z

one thing that I am puzzled about is why this causes timeouts instead of triggering onException on the handler? I think the reason is that we don't notify the TransportService when a connection is closed but only if a node is disconnected that is a different bug here. Both are independent and should be handled independently. so I think your fix is sufficient for the issues referenced in the description.

…port handlers Today we prune transport handlers in TransporService when a node is disconnected. This can cause connections to starve in the TransportService if the connection is opened as a short living connection ie. without sharing the connection to a node via registering in the transport itself. This change now moves to pruning based on the connections cache key to ensure we notify handlers as soon as the connection is closed for all connections not just for registered connections. Relates to elastic#24632 Relates to elastic#24575 Relates to elastic#24557

s1monw · 2017-05-12T07:16:43Z

I opened #24639 for the notification part

s1monw · 2017-05-12T07:28:44Z

I also marked this as a blocker for 5.4.1

bleskes

Great catch. LGTM (although I left one suggestion).

bleskes · 2017-05-12T09:44:42Z

core/src/main/java/org/elasticsearch/client/transport/TransportClientNodesService.java

                                    public void handleResponse(ClusterStateResponse response) {
                                        clusterStateResponses.put(nodeToPing, response);
                                        latch.countDown();
+                                        closeConnection();


maybe we should unify the latch.countDown() and closeConnection() into a single method called "onDone" on the AbstractRunnable that everyone calls? this it's less trappy and people wouldn't forget to do one but not the other.

tlrx · 2017-05-12T12:49:51Z

one thing that I am puzzled about is why this causes timeouts instead of triggering onException on the handler? I think the reason is that we don't notify the TransportService when a connection is closed but only if a node is disconnected that is a different bug here. Both are independent and should be handled independently. so I think your fix is sufficient for the issues referenced in the description.

I agree - I didn't spot this problem but my knowledge of the TransportService is limited, I'm glad you already proposed a fix.

I updated the PR according to your comments.

s1monw

LGTM 2 thx for the test

…port handlers (#24639) Today we prune transport handlers in TransportService when a node is disconnected. This can cause connections to starve in the TransportService if the connection is opened as a short living connection ie. without sharing the connection to a node via registering in the transport itself. This change now moves to pruning based on the connections cache key to ensure we notify handlers as soon as the connection is closed for all connections not just for registered connections. Relates to #24632 Relates to #24575 Relates to #24557

s1monw · 2017-05-12T14:27:15Z

core/src/main/java/org/elasticsearch/client/transport/TransportClientNodesService.java

-                        @Override
-                        public void onAfter() {
+                        void onDone() {
                            IOUtils.closeWhileHandlingException(connectionToClose);


can we call the latch in a finally block just to be absolutely sure

tlrx · 2017-05-12T14:39:07Z

Thanks @s1monw @bleskes

…24632) With the current implementation, SniffNodesSampler might close the current connection right after a request is sent but before the response is correctly handled. This causes to timeouts in the transport client when the sniffing is activated. closes #24575 closes #24557

konste · 2017-06-07T02:39:41Z

I'm getting exactly this error with the version 5.4.1 across the board. What's the quickest way for me to recover?

tlrx · 2017-06-07T07:00:16Z

@konste I just ran another test this morning with a fresh 5.4.1 installation and a PreBuiltTransportClient with sniff option set to true and everything worked as expected (while the error was really obvious and appears at startup time).

Can you please provide the logs of both transport client and elasticsearch node please? As well as the transport client settings?

konste · 2017-06-07T14:52:19Z

@tlrx Sorry I had to restore functionality ASAP and lost the repro.

tlrx added :Distributed Coordination/Network Http and internode communication implementations review v5.4.1 v5.5.0 v6.0.0-alpha1 labels May 11, 2017

s1monw suggested changes May 12, 2017

View reviewed changes

s1monw mentioned this pull request May 12, 2017

Notify onConnectionClosed rather than onNodeDisconnect to prune transport handlers #24639

Merged

s1monw added the blocker label May 12, 2017

bleskes approved these changes May 12, 2017

View reviewed changes

Apply feedback

8b7d5c8

s1monw approved these changes May 12, 2017

View reviewed changes

Close connection before counting down the latch

8843201

s1monw reviewed May 12, 2017

View reviewed changes

add finally block

1cc3f1c

tlrx merged commit f8df2a2 into elastic:master May 12, 2017

tlrx deleted the closing-connection-in-sniffer branch May 15, 2017 09:18

clintongormley added the >bug label May 16, 2017

spring-projects-issues mentioned this pull request Dec 31, 2020

Using spring-boot-starter-web jar to connect elastic search is causing issue NoNodeAvailableException [DATAES-339] spring-projects/spring-data-elasticsearch#913

Closed

SniffNodesSampler should close connection after handling responses #24632

SniffNodesSampler should close connection after handling responses #24632

Uh oh!

Conversation

tlrx commented May 11, 2017

Uh oh!

tlrx commented May 11, 2017

Uh oh!

s1monw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

s1monw commented May 12, 2017

Uh oh!

s1monw commented May 12, 2017

Uh oh!

s1monw commented May 12, 2017

Uh oh!

bleskes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlrx commented May 12, 2017

Uh oh!

s1monw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlrx commented May 12, 2017

Uh oh!

konste commented Jun 7, 2017

Uh oh!

tlrx commented Jun 7, 2017

Uh oh!

konste commented Jun 7, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants