-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
There have been a number of failures in the x-pack:qa:rolling-upgrade suite recently with what appears to be the same error.
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+multijob-unix-compatibility/os=sles/1155
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.3+matrix-java-periodic/ES_BUILD_JAVA=java10,ES_RUNTIME_JAVA=java10,nodes=virtual&&linux/126
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+matrix-java-periodic/ES_BUILD_JAVA=java10,ES_RUNTIME_JAVA=java8,nodes=virtual&&linux/142
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.3+periodic/388
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.3+bwc-tests/175
The tests are failing in ESRestTestCase.waitForClusterStateUpdatesToFinish with the assertion:
{time_in_queue=15ms, time_in_queue_millis=15, source=shard-failed, executing=true, priority=HIGH, insert_order=197}
at org.junit.Assert.fail(Assert.java:88)
at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForClusterStateUpdatesToFinish$0(ESRestTestCase.java:347)
at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:755)
... 37 more
There a lots of these queued, time_in_queue goes up to 4 seconds. The assertBusy is for 30 seconds so something has blocked the updates or the cluster is very very slow. Note source=shard-failed. Sometimes the ML tests fail with a timeout on updating the persistent task state suggesting this is also due to the slow cluster state updates.
In all cases it is the twoThirdsUpgradedTestRunner that fails (2 out of 3 nodes have been upgraded) this runner has 3 extra tests that don't run in the oneThirdUpgradedTestRunner. Let's start with the idea that one of these tests is causing the failure:
'mixed_cluster/10_basic/Start scroll in mixed cluster on upgraded node that we will continue after upgrade',
'mixed_cluster/30_ml_jobs_crud/Create a job in the mixed cluster and write some data',
'mixed_cluster/40_ml_datafeed_crud/Put job and datafeed in mixed cluster'
When the tests fail this message is repeated in the logs approximately every 5 seconds
[2018-07-01T20:31:31,474][WARN ][o.e.i.c.IndicesClusterStateService] [upgraded-node-1] [[upgraded_scroll][1]] marking and sending shard failed due to [failed to create shard]
java.io.IOException: failed to obtain in-memory shard lock
at org.elasticsearch.index.IndexService.createShard(IndexService.java:392) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:565) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:152) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:550) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:527) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:230) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$6(ClusterApplierService.java:495) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at java.lang.Iterable.forEach(Iterable.java:75) [?:?]
at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:492) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:479) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:430) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:160) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:844) [?:?]
Caused by: org.elasticsearch.env.ShardLockObtainFailedException: [upgraded_scroll][1]: obtaining shard lock timed out after 5000ms
at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:678) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:597) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at org.elasticsearch.index.IndexService.createShard(IndexService.java:329) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
... 17 more
which appears to be related to the test mixed_cluster/10_basic/Start scroll in mixed cluster on upgraded node that we will continue after upgrade which creates the upgraded_scroll index
- do:
indices.create:
index: upgraded_scroll
wait_for_active_shards: all
body:
settings:
number_of_replicas: 0
index.routing.allocation.include.upgraded: true
The upgraded nodes have the node.attr.upgraded: true set so at this point the index can be allocated to 2 of the 3 nodes in the cluster.
It's not clear if this or the ML jobs is the root cause.