X-pack rolling upgrade failures

There have been a number of failures in the `x-pack:qa:rolling-upgrade` suite recently with what appears to be the same error.

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+multijob-unix-compatibility/os=sles/1155
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.3+matrix-java-periodic/ES_BUILD_JAVA=java10,ES_RUNTIME_JAVA=java10,nodes=virtual&&linux/126
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+matrix-java-periodic/ES_BUILD_JAVA=java10,ES_RUNTIME_JAVA=java8,nodes=virtual&&linux/142
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.3+periodic/388
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.3+bwc-tests/175

The tests are failing in `ESRestTestCase.waitForClusterStateUpdatesToFinish` with the assertion:

```
{time_in_queue=15ms, time_in_queue_millis=15, source=shard-failed, executing=true, priority=HIGH, insert_order=197}
        at org.junit.Assert.fail(Assert.java:88)
        at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForClusterStateUpdatesToFinish$0(ESRestTestCase.java:347)
        at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:755)
        ... 37 more
```

There a lots of these queued, time_in_queue goes up to 4 seconds. The `assertBusy` is for 30 seconds so something has blocked the updates or the cluster is very very slow. Note ` source=shard-failed`. Sometimes the ML tests fail with a timeout on updating the persistent task state suggesting this is also due to the slow cluster state updates.

In all cases it is the `twoThirdsUpgradedTestRunner` that fails (2 out of 3 nodes have been upgraded) this runner has [3 extra tests](https://github.com/elastic/elasticsearch/blob/6.x/x-pack/qa/rolling-upgrade/build.gradle#L231) that don't run in the `oneThirdUpgradedTestRunner`. Let's start with the idea that one of these tests is causing the failure:

```
 'mixed_cluster/10_basic/Start scroll in mixed cluster on upgraded node that we will continue after upgrade',
 'mixed_cluster/30_ml_jobs_crud/Create a job in the mixed cluster and write some data',
 'mixed_cluster/40_ml_datafeed_crud/Put job and datafeed in mixed cluster'
```

When the tests fail this message is repeated in the logs approximately every 5 seconds

```
[2018-07-01T20:31:31,474][WARN ][o.e.i.c.IndicesClusterStateService] [upgraded-node-1] [[upgraded_scroll][1]] marking and sending shard failed due to [failed to create shard]
java.io.IOException: failed to obtain in-memory shard lock
    at org.elasticsearch.index.IndexService.createShard(IndexService.java:392) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:565) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:152) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:550) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:527) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:230) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$6(ClusterApplierService.java:495) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at java.lang.Iterable.forEach(Iterable.java:75) [?:?]
    at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:492) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:479) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:430) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:160) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
    at java.lang.Thread.run(Thread.java:844) [?:?]
Caused by: org.elasticsearch.env.ShardLockObtainFailedException: [upgraded_scroll][1]: obtaining shard lock timed out after 5000ms
    at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:678) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:597) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.index.IndexService.createShard(IndexService.java:329) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    ... 17 more
```

which appears to be related to the test `mixed_cluster/10_basic/Start scroll in mixed cluster on upgraded node that we will continue after upgrade` which creates the `upgraded_scroll` index

```
  - do:
      indices.create:
        index: upgraded_scroll
        wait_for_active_shards: all
        body:
          settings:
            number_of_replicas: 0
            index.routing.allocation.include.upgraded: true
```            

The upgraded nodes have the `node.attr.upgraded: true` set so at this point the index can be allocated to 2 of the 3 nodes in the cluster.


It's not clear if this or the ML jobs is the root cause.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

X-pack rolling upgrade failures #31827

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

X-pack rolling upgrade failures #31827

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions