[CI] StackOverflowError when executing SnapshotDisruptionIT.testDisruptionOnSnapshotInitialization

The test `SnapshotDisruptionIT.testDisruptionOnSnapshotInitialization()` failed [on CI today](https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-windows-compatibility/1092) on Windows Server 2012 R2 6.3 amd64/Oracle Corporation 1.8.0_92 (64-bit).

I first thought that it was a snapshot/restore regression due to my recent changes in #28078 or #27931 but after [looking at the test execution log](https://github.com/elastic/elasticsearch/files/1618707/consoleText.txt) I'm not so sure.

I wonder if in this test the cluster ends up in a situation where a `listener.onFailure()` call caused a stack overflow error on a network thread (that was uncaught by the usual Elasticsearch's UncaughtExceptionHandler) that caused the NIO Selector to be closed too and not listening to incoming requests.

The test starts 3 master only nodes and 1 data only node. Once the cluster is stable, it sets up a snapshot repository and creates a first snapshot to check that everything is working correctly. Then it sets up a disruption scheme that is designed to isolate the master node as soon as a snapshot-in-progress entry in INIT state is found in the cluster state. Next step in the test is to create a second snapshot that triggers the disruption scheme and waits for the cluster to elect a new master that terminates this second snapshot.

In this execution [logs](https://github.com/elastic/elasticsearch/files/1618707/consoleText.txt) `node_tm0`, `node_tm1` and `node_tm2` are started as master only nodes. `node_tm1` is elected as master and adds the data only node `node_td3` to the cluster. 

The test runs correctly to the point where the second snapshot is created:

> [2018-01-10T05:23:10,388][INFO ][o.e.d.SnapshotDisruptionIT] --> starting disruption
> ...
> [2018-01-10T05:23:10,395][INFO ][o.e.s.SnapshotsService   ] [node_tm1] snapshot [test-repo:test-snap-2/vyIDc0GFSIKVtGd--HP_hQ] started

Once the disruption is started the master node `node_tm1` is isolated. Other nodes think it left:

> [2018-01-10T05:23:12,061][INFO ][o.e.t.d.TestZenDiscovery ] [node_tm0] master_left [{node_tm1}{05tuxEPVT5ypY7Mz6o_VRA}{p_hm4MVNRoOvl-n281Co2A}{127.0.0.1}{127.0.0.1:30101}], reason [failed to ping, tried [1] times, each with  maximum [1s] timeout]
> ...
> [2018-01-10T05:23:12,061][INFO ][o.e.t.d.TestZenDiscovery ] [node_tm2] master_left [{node_tm1}{05tuxEPVT5ypY7Mz6o_VRA}{p_hm4MVNRoOvl-n281Co2A}{127.0.0.1}{127.0.0.1:30101}], reason [failed to ping, tried [1] times, each with  maximum [1s] timeout]
> ...
> [2018-01-10T05:23:12,125][INFO ][o.e.t.d.TestZenDiscovery ] [node_td3] master_left [{node_tm1}{05tuxEPVT5ypY7Mz6o_VRA}{p_hm4MVNRoOvl-n281Co2A}{127.0.0.1}{127.0.0.1:30101}], reason [failed to ping, tried [1] times, each with  maximum [1s] timeout]

So the remaining nodes elect `node_tm0` as the new master node. It uses the last commited cluster state in version 25:

> [2018-01-10T05:23:16,069][INFO ][o.e.c.s.ClusterApplierService] [node_tm0] new_master {node_tm0}{JvhDBfzERq6UD13DUGBUJg}{Grp3AjfZR4m4Wb4Re-p5MA}{127.0.0.1}{127.0.0.1:30100}, reason: apply cluster state (from master [master {node_tm0}{JvhDBfzERq6UD13DUGBUJg}{Grp3AjfZR4m4Wb4Re-p5MA}{127.0.0.1}{127.0.0.1:30100} committed version [25] source [zen-disco-elected-as-master ([1] nodes joined)[{node_tm2}{nPk7Z6bvRI-MufskXAhGTw}{-NZTnhd1SrSolB7FjkDwcg}{127.0.0.1}{127.0.0.1:30102}]]])

The new master node updates the cluster state to notify that it is the master now. But the publishing of the cluster state version 26 is not processed by the old master node which is still isolated:

> [2018-01-10T05:23:17,075][WARN ][o.e.d.z.PublishClusterStateAction] [node_tm0] timed out waiting for all nodes to process published state [26]  (timeout [1s], pending nodes: [{node_tm1} ....)

So the master node gives up, removes the old master node from the cluster state and publish a new version 27 of the cluster state where `node_tm1` is removed:

> [2018-01-10T05:23:17,089][INFO ][o.e.c.s.ClusterApplierService] [node_td3] removed {{node_tm1}[redacted], reason: apply cluster state (from master [master {node_tm0}[redacted] committed version [27]])
> 
> [2018-01-10T05:23:17,089][INFO ][o.e.c.s.ClusterApplierService] [node_tm2] removed {{node_tm1}[redacted], reason: apply cluster state (from master [master {node_tm0}[redacted] committed version [27]])
> 
> [2018-01-10T05:23:17,092][INFO ][o.e.c.s.ClusterApplierService] [node_tm0] removed {{node_tm1}[redacted], reason: apply cluster state (from master [master {node_tm0}[redacted] committed version [27] source [zen-disco-node-failed({node_tm1}

And the new master node cleans up the second snapshot as expected:

> [2018-01-10T05:23:17,378][INFO ][o.e.s.SnapshotsService   ] [node_tm0] snapshot [test-repo:test-snap-2/vyIDc0GFSIKVtGd--HP_hQ] completed with state [SUCCESS]
> [2018-01-10T05:23:17,523][INFO ][o.e.s.SnapshotsService   ] [node_tm0] snapshot [test-repo:test-snap-2/vyIDc0GFSIKVtGd--HP_hQ] deleted

So the test is somewhat good as its first purpose is to test that the snapshot is correctly terminated by the new master. Before the test ends, it stops the disruption and waits for the cluster to be stable again:

> [2018-01-10T05:23:18,605][INFO ][o.e.d.SnapshotDisruptionIT] --> stopping disrupting

The old master node detects the timeout when it tried to publish the initial cluster state 25 (when the second snapshot is STARTED in cluster state):

> [2018-01-10T05:23:40,406][WARN ][o.e.t.d.TestZenDiscovery ] [node_tm1] zen-disco-failed-to-publish, 
> 
 [2018-01-10T05:23:40,407][WARN ][o.e.c.s.MasterService    ] [node_tm1] failing [update_snapshot [test-repo:test-snap-2/vyIDc0GFSIKVtGd--HP_hQ]]: failed to commit cluster state version [25]

And the old master fails the snapshot locally, which is also expected. 

> [2018-01-10T05:23:40,408][WARN ][o.e.s.SnapshotsService   ] [node_tm1] [test-snap-2/vyIDc0GFSIKVtGd--HP_hQ] failed to create snapshot

And the old master joins back the cluster...

> [2018-01-10T05:23:45,588][INFO ][o.e.c.s.MasterService    ] [node_tm0] zen-disco-node-join[{node_tm1}

... and this is where things get blurred for me. 

It seems that `node_tm1` cannot ping the other nodes, and the ESSelector loop is closed:

> [2018-01-10T05:23:45,792][WARN ][o.e.t.n.MockNioTransport ] [node_tm1] send message failed [channel: NioSocketChannel{localAddress=/127.0.0.1:30101, remoteAddress=/127.0.0.1:49349}]
  1> java.nio.channels.ClosedChannelException: null
  1> 	at org.elasticsearch.nio.NioSocketChannel.closeFromSelector(NioSocketChannel.java:53) ~[elasticsearch-nio-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
  1> 	at org.elasticsearch.nio.EventHandler.handleClose(EventHandler.java:72) ~[elasticsearch-nio-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
  1> 	at org.elasticsearch.nio.ESSelector.closePendingChannels(ESSelector.java:239) ~[elasticsearch-nio-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
  1> 	at org.elasticsearch.nio.ESSelector.cleanupAndCloseChannels(ESSelector.java:135) ~[elasticsearch-nio-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
  1> 	at org.elasticsearch.nio.ESSelector.runLoop(ESSelector.java:81) ~[elasticsearch-nio-7.0.0-alpha1-SNAPSHOT.jar:7.0.0-alpha1-SNAPSHOT]
  1> 	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_92]

Then all connection attempts from node_tm1 failed, so the cluster cannot recover to 4 nodes and the test suite times out. I think that all errors after that are caused by the test framework trying to stop the nodes.

But I'm worried about is the StackOverflowError in logs:

> jan 10, 2018 5:23:45 AM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
> WARNING: Uncaught exception in thread: Thread[elasticsearch[node_tm1][es_nio_transport_worker][T#2],5,TGRP-SnapshotDisruptionIT]
> java.lang.StackOverflowError
> 	at __randomizedtesting.SeedInfo.seed([F1A2A7B82C4D4B21]:0)
> 	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$1.onResponse(TransportMasterNodeAction.java:167)
> 	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$1.onResponse(TransportMasterNodeAction.java:164)
> 	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$1.onResponse(TransportMasterNodeAction.java:167)
> 	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$1.onResponse(TransportMasterNodeAction.java:164)
> 	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$1.onResponse(TransportMasterNodeAction.java:167)
> 	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$1.onResponse(TransportMasterNodeAction.java:164)
> 	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$1.onResponse(TransportMasterNodeAction.java:167)
 
As well as the NPEs at the beginning of the tests:

>  [2018-01-10T05:23:03,033][WARN ][o.e.d.z.UnicastZenPing   ] [node_tm0] unexpected error while pinging
> java.lang.NullPointerException: null
> 	at java.util.stream.Collectors.lambda$toMap$58(Collectors.java:1320) ~[?:1.8.0_92]
> 	at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169) ~[?:1.8.0_92]
> 	at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374) ~[?:1.8.0_92]
> 	at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) ~[?:1.8.0_92]
> 	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_92]
> 	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_92]
> 	at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) ~[?:1.8.0_92]
> 	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_92]
> 	at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) ~[?:1.8.0_92]
> 	at org.elasticsearch.discovery.zen.UnicastZenPing.sendPings(UnicastZenPing.java:475) ~[main/:?]
> 	at org.elasticsearch.discovery.zen.UnicastZenPing$1.doRun(UnicastZenPing.java:332) [main/:?]
> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:635) [main/:?]
> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_92]
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_92]
> 	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_92]


@tbrooks8 @bleskes I have some trouble digging more this failure, so I'd be happy to have your opinion / gut feeling on this. Just an overlook would be helpfull at this stage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI] StackOverflowError when executing SnapshotDisruptionIT.testDisruptionOnSnapshotInitialization #28169

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[CI] StackOverflowError when executing SnapshotDisruptionIT.testDisruptionOnSnapshotInitialization #28169

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions