-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
The test SnapshotDisruptionIT.testDisruptionOnSnapshotInitialization() failed on CI today on Windows Server 2012 R2 6.3 amd64/Oracle Corporation 1.8.0_92 (64-bit).
I first thought that it was a snapshot/restore regression due to my recent changes in #28078 or #27931 but after looking at the test execution log I'm not so sure.
I wonder if in this test the cluster ends up in a situation where a listener.onFailure() call caused a stack overflow error on a network thread (that was uncaught by the usual Elasticsearch's UncaughtExceptionHandler) that caused the NIO Selector to be closed too and not listening to incoming requests.
The test starts 3 master only nodes and 1 data only node. Once the cluster is stable, it sets up a snapshot repository and creates a first snapshot to check that everything is working correctly. Then it sets up a disruption scheme that is designed to isolate the master node as soon as a snapshot-in-progress entry in INIT state is found in the cluster state. Next step in the test is to create a second snapshot that triggers the disruption scheme and waits for the cluster to elect a new master that terminates this second snapshot.
In this execution logs node_tm0, node_tm1 and node_tm2 are started as master only nodes. node_tm1 is elected as master and adds the data only node node_td3 to the cluster.
The test runs correctly to the point where the second snapshot is created:
[2018-01-10T05:23:10,388][INFO ][o.e.d.SnapshotDisruptionIT] --> starting disruption
...
[2018-01-10T05:23:10,395][INFO ][o.e.s.SnapshotsService ] [node_tm1] snapshot [test-repo:test-snap-2/vyIDc0GFSIKVtGd--HP_hQ] started
Once the disruption is started the master node node_tm1 is isolated. Other nodes think it left:
[2018-01-10T05:23:12,061][INFO ][o.e.t.d.TestZenDiscovery ] [node_tm0] master_left [{node_tm1}{05tuxEPVT5ypY7Mz6o_VRA}{p_hm4MVNRoOvl-n281Co2A}{127.0.0.1}{127.0.0.1:30101}], reason [failed to ping, tried [1] times, each with maximum [1s] timeout]
...
[2018-01-10T05:23:12,061][INFO ][o.e.t.d.TestZenDiscovery ] [node_tm2] master_left [{node_tm1}{05tuxEPVT5ypY7Mz6o_VRA}{p_hm4MVNRoOvl-n281Co2A}{127.0.0.1}{127.0.0.1:30101}], reason [failed to ping, tried [1] times, each with maximum [1s] timeout]
...
[2018-01-10T05:23:12,125][INFO ][o.e.t.d.TestZenDiscovery ] [node_td3] master_left [{node_tm1}{05tuxEPVT5ypY7Mz6o_VRA}{p_hm4MVNRoOvl-n281Co2A}{127.0.0.1}{127.0.0.1:30101}], reason [failed to ping, tried [1] times, each with maximum [1s] timeout]
So the remaining nodes elect node_tm0 as the new master node. It uses the last commited cluster state in version 25:
[2018-01-10T05:23:16,069][INFO ][o.e.c.s.ClusterApplierService] [node_tm0] new_master {node_tm0}{JvhDBfzERq6UD13DUGBUJg}{Grp3AjfZR4m4Wb4Re-p5MA}{127.0.0.1}{127.0.0.1:30100}, reason: apply cluster state (from master [master {node_tm0}{JvhDBfzERq6UD13DUGBUJg}{Grp3AjfZR4m4Wb4Re-p5MA}{127.0.0.1}{127.0.0.1:30100} committed version [25] source [zen-disco-elected-as-master ([1] nodes joined)[{node_tm2}{nPk7Z6bvRI-MufskXAhGTw}{-NZTnhd1SrSolB7FjkDwcg}{127.0.0.1}{127.0.0.1:30102}]]])
The new master node updates the cluster state to notify that it is the master now. But the publishing of the cluster state version 26 is not processed by the old master node which is still isolated:
[2018-01-10T05:23:17,075][WARN ][o.e.d.z.PublishClusterStateAction] [node_tm0] timed out waiting for all nodes to process published state [26] (timeout [1s], pending nodes: [{node_tm1} ....)
So the master node gives up, removes the old master node from the cluster state and publish a new version 27 of the cluster state where node_tm1 is removed:
[2018-01-10T05:23:17,089][INFO ][o.e.c.s.ClusterApplierService] [node_td3] removed {{node_tm1}[redacted], reason: apply cluster state (from master [master {node_tm0}[redacted] committed version [27]])
[2018-01-10T05:23:17,089][INFO ][o.e.c.s.ClusterApplierService] [node_tm2] removed {{node_tm1}[redacted], reason: apply cluster state (from master [master {node_tm0}[redacted] committed version [27]])
[2018-01-10T05:23:17,092][INFO ][o.e.c.s.ClusterApplierService] [node_tm0] removed {{node_tm1}[redacted], reason: apply cluster state (from master [master {node_tm0}[redacted] committed version [27] source [zen-disco-node-failed({node_tm1}
And the new master node cleans up the second snapshot as expected:
[2018-01-10T05:23:17,378][INFO ][o.e.s.SnapshotsService ] [node_tm0] snapshot [test-repo:test-snap-2/vyIDc0GFSIKVtGd--HP_hQ] completed with state [SUCCESS]
[2018-01-10T05:23:17,523][INFO ][o.e.s.SnapshotsService ] [node_tm0] snapshot [test-repo:test-snap-2/vyIDc0GFSIKVtGd--HP_hQ] deleted
So the test is somewhat good as its first purpose is to test that the snapshot is correctly terminated by the new master. Before the test ends, it stops the disruption and waits for the cluster to be stable again:
[2018-01-10T05:23:18,605][INFO ][o.e.d.SnapshotDisruptionIT] --> stopping disrupting
The old master node detects the timeout when it tried to publish the initial cluster state 25 (when the second snapshot is STARTED in cluster state):
[2018-01-10T05:23:40,406][WARN ][o.e.t.d.TestZenDiscovery ] [node_tm1] zen-disco-failed-to-publish,
[2018-01-10T05:23:40,407][WARN ][o.e.c.s.MasterService ] [node_tm1] failing [update_snapshot [test-repo:test-snap-2/vyIDc0GFSIKVtGd--HP_hQ]]: failed to commit cluster state version [25]
And the old master fails the snapshot locally, which is also expected.
[2018-01-10T05:23:40,408][WARN ][o.e.s.SnapshotsService ] [node_tm1] [test-snap-2/vyIDc0GFSIKVtGd--HP_hQ] failed to create snapshot
And the old master joins back the cluster...
[2018-01-10T05:23:45,588][INFO ][o.e.c.s.MasterService ] [node_tm0] zen-disco-node-join[{node_tm1}
... and this is where things get blurred for me.
It seems that node_tm1 cannot ping the other nodes, and the ESSelector loop is closed:
[2018-01-10T05:23:45,792][WARN ][o.e.t.n.MockNioTransport ] [node_tm1] send message failed [channel: NioSocketChannel{localAddress=/127.0.0.1:30101, remoteAddress=/127.0.0.1:49349}]
1> java.nio.channels.ClosedChannelException: null
1> at org.elasticsearch.nio.NioSocketChannel.closeFromSelector(NioSocketChannel.java:53) ~[elasticsearch-nio-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
1> at org.elasticsearch.nio.EventHandler.handleClose(EventHandler.java:72) ~[elasticsearch-nio-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
1> at org.elasticsearch.nio.ESSelector.closePendingChannels(ESSelector.java:239) ~[elasticsearch-nio-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
1> at org.elasticsearch.nio.ESSelector.cleanupAndCloseChannels(ESSelector.java:135) ~[elasticsearch-nio-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
1> at org.elasticsearch.nio.ESSelector.runLoop(ESSelector.java:81) ~[elasticsearch-nio-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
1> at java.lang.Thread.run(Thread.java:745) [?:1.8.0_92]
Then all connection attempts from node_tm1 failed, so the cluster cannot recover to 4 nodes and the test suite times out. I think that all errors after that are caused by the test framework trying to stop the nodes.
But I'm worried about is the StackOverflowError in logs:
jan 10, 2018 5:23:45 AM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
WARNING: Uncaught exception in thread: Thread[elasticsearch[node_tm1][es_nio_transport_worker][T#2],5,TGRP-SnapshotDisruptionIT]
java.lang.StackOverflowError
at __randomizedtesting.SeedInfo.seed([F1A2A7B82C4D4B21]:0)
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$1.onResponse(TransportMasterNodeAction.java:167)
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$1.onResponse(TransportMasterNodeAction.java:164)
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$1.onResponse(TransportMasterNodeAction.java:167)
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$1.onResponse(TransportMasterNodeAction.java:164)
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$1.onResponse(TransportMasterNodeAction.java:167)
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$1.onResponse(TransportMasterNodeAction.java:164)
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$1.onResponse(TransportMasterNodeAction.java:167)
As well as the NPEs at the beginning of the tests:
[2018-01-10T05:23:03,033][WARN ][o.e.d.z.UnicastZenPing ] [node_tm0] unexpected error while pinging
java.lang.NullPointerException: null
at java.util.stream.Collectors.lambda$toMap$58(Collectors.java:1320) ~[?:1.8.0_92]
at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169) ~[?:1.8.0_92]
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374) ~[?:1.8.0_92]
at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) ~[?:1.8.0_92]
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_92]
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_92]
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) ~[?:1.8.0_92]
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_92]
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) ~[?:1.8.0_92]
at org.elasticsearch.discovery.zen.UnicastZenPing.sendPings(UnicastZenPing.java:475) ~[main/:?]
at org.elasticsearch.discovery.zen.UnicastZenPing$1.doRun(UnicastZenPing.java:332) [main/:?]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:635) [main/:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_92]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_92]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_92]
@tbrooks8 @bleskes I have some trouble digging more this failure, so I'd be happy to have your opinion / gut feeling on this. Just an overlook would be helpfull at this stage.