-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
This is a failure I have seen recently on the zen2 branch (quite rarely, and unable to reproduce on master). For instance on commit b01d321 I ran this in a loop:
./gradlew :server:integTest -Dtests.class=org.elasticsearch.action.admin.indices.stats.IndicesStatsBlocksIT -Dtests.iters=200 -Dtests.failfast=true
On the 7th time round one of the tests failed with the following stack trace:
2> REPRODUCE WITH: ./gradlew :server:integTest -Dtests.seed=B5E24AD11854BA9B -Dtests.class=org.elasticsearch.action.admin.indices.stats.IndicesStatsBlocksIT -Dtests.method="testIndicesStatsWithBlocks {seed=[B5E24AD11854BA9B:99DFE9CCD7C698F7]}" -Dtests.security.manager=true -Dtests.locale=ig-NG -Dtests.timezone=Africa/Freetown -Dcompiler.java=11 -Druntime.java=11
FAILURE 1.40s | IndicesStatsBlocksIT.testIndicesStatsWithBlocks {seed=[B5E24AD11854BA9B:99DFE9CCD7C698F7]} <<< FAILURES!
> Throwable #1: java.lang.AssertionError: still open connections: {{127.0.0.1:37253}{nNBmVQAAQACCxD28_____w}{127.0.0.1}{127.0.0.1:37253}=[org.elasticsearch.test.transport.StubbableTransport$WrappedConnection@702fbe9b]}
> at __randomizedtesting.SeedInfo.seed([B5E24AD11854BA9B:99DFE9CCD7C698F7]:0)
> at org.elasticsearch.test.transport.MockTransportService.doClose(MockTransportService.java:625)
> at org.elasticsearch.common.component.AbstractLifecycleComponent.close(AbstractLifecycleComponent.java:100)
> at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103)
> at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:85)
> at org.elasticsearch.node.Node.close(Node.java:862)
> at org.elasticsearch.test.InternalTestCluster$NodeAndClient.closeNode(InternalTestCluster.java:916)
> at org.elasticsearch.test.InternalTestCluster$NodeAndClient.close(InternalTestCluster.java:993)
> at org.elasticsearch.core.internal.io.IOUtils.closeWhileHandlingException(IOUtils.java:145)
> at org.elasticsearch.test.InternalTestCluster.close(InternalTestCluster.java:810)
> at org.elasticsearch.test.ESIntegTestCase.afterInternal(ESIntegTestCase.java:587)
> at org.elasticsearch.test.ESIntegTestCase.cleanUpCluster(ESIntegTestCase.java:2195)
> at jdk.internal.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
> at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> at java.base/java.lang.Thread.run(Thread.java:834)
This seems to occur when nodes are shutting down - particularly when the master shuts down then the remaining nodes will attempt to elect a new master, which first involves reconnecting to each node (not strictly necessary, see below). The open connection in question is one of these probe connections, since the remote node's name is just its transport address {127.0.0.1:37253} and not its real name.
Line 72 in a127805
| // TODO if transportService is already connected to this address then skip the handshaking |
As far as I can tell there is machinery in place to prevent this from happening so I don't think it's specifically a Zen2 issue; in Zen2 we create a bunch of new connections when a node shuts down, and Zen1 does not do this, which might be why this doesn't reproduce so easily on master.
@tbrooks8 my main question to you is whether you think this is a problem in the networking infrastructure or a problem with how we're using it in Zen2.