Skip to content

Conversation

@obourgain
Copy link
Contributor

There seems to be a connection leak in the TransportClient when using removeTransportAddress()

Stating the problem

When adding an address with TransportClient.addTransportAddresses(), the TransportClientNodesService adds a DiscoveryNode to listedNodes and ends up calling nodeSampler.sample(). In my case I have the SimpleNodeSampler.
For the new address (not already connected), the sampling is doing a 'light connection' with TransportService.connectToNodeLight(), which delegates to an instance of Transport. In my case, the Transport impl is a NettyTransport, which will open a Channel to the target node.

So here we have the newly created DiscoveryNode referenced in the listedNodes List of the TransportClientNodesService and a Channel cached in the NettyTransport
's connectedNodes map.

When removing an address from the TransportClient, the TransportClientNodesService is doing the following:

  1. removing the address from the listedNodes
  2. removing the address from the nodes and disconnecting
  3. calling the NodeSampler to refresh

What I believe is missing is closing the connection when removing the node from listedNodes, because we have an opened Channel cached in NettyTransport.
This PR fixes this by asking the Transport to disconnect from the listed node upon removal to close the light connection.

How to reproduce

I am using an Elastic Cloud cluster, doing periodically a DNS resolution of the name and adding/removing addresses from the TransportClient accordingly.
For each address removed, a connection is leaked, and is visible with tools like lsof or with a debugger.

Closing note

I am using Elasticsearch 2.4, and while I know that upgrading to 5 is the way froward, I don't have time to do it right now, so I would appreciate a backport of this fix to the 2.4 branch to avoid having to build and deploy a fork of the client.

@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@bleskes bleskes self-assigned this Sep 6, 2017
@rjernst
Copy link
Member

rjernst commented Sep 7, 2017

@elasticmachine test this please

@obourgain
Copy link
Contributor Author

After further investigation using the branch 5.x, the issue seems to have been fixed.
After looking into the code it was fixed in #22828, so my PR is not relevant w.r.t master, but still I would appreciate if it was merged in 2.4 to avoid having to maintain a custom fork until I can migrate to 5.x

@javanna
Copy link
Member

javanna commented Nov 6, 2017

sorry but we don't plan on backporting #22828 to 2.4. Thanks a lot for your great analysis though and for sending your PR. Closing as this has been fixed already from 5.4 onwards.

@javanna javanna closed this Nov 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants