Skip to content

X-pack rolling upgrade failures #31827

@davidkyle

Description

@davidkyle

There have been a number of failures in the x-pack:qa:rolling-upgrade suite recently with what appears to be the same error.

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+multijob-unix-compatibility/os=sles/1155
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.3+matrix-java-periodic/ES_BUILD_JAVA=java10,ES_RUNTIME_JAVA=java10,nodes=virtual&&linux/126
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+matrix-java-periodic/ES_BUILD_JAVA=java10,ES_RUNTIME_JAVA=java8,nodes=virtual&&linux/142
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.3+periodic/388
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.3+bwc-tests/175

The tests are failing in ESRestTestCase.waitForClusterStateUpdatesToFinish with the assertion:

{time_in_queue=15ms, time_in_queue_millis=15, source=shard-failed, executing=true, priority=HIGH, insert_order=197}
        at org.junit.Assert.fail(Assert.java:88)
        at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForClusterStateUpdatesToFinish$0(ESRestTestCase.java:347)
        at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:755)
        ... 37 more

There a lots of these queued, time_in_queue goes up to 4 seconds. The assertBusy is for 30 seconds so something has blocked the updates or the cluster is very very slow. Note source=shard-failed. Sometimes the ML tests fail with a timeout on updating the persistent task state suggesting this is also due to the slow cluster state updates.

In all cases it is the twoThirdsUpgradedTestRunner that fails (2 out of 3 nodes have been upgraded) this runner has 3 extra tests that don't run in the oneThirdUpgradedTestRunner. Let's start with the idea that one of these tests is causing the failure:

 'mixed_cluster/10_basic/Start scroll in mixed cluster on upgraded node that we will continue after upgrade',
 'mixed_cluster/30_ml_jobs_crud/Create a job in the mixed cluster and write some data',
 'mixed_cluster/40_ml_datafeed_crud/Put job and datafeed in mixed cluster'

When the tests fail this message is repeated in the logs approximately every 5 seconds

[2018-07-01T20:31:31,474][WARN ][o.e.i.c.IndicesClusterStateService] [upgraded-node-1] [[upgraded_scroll][1]] marking and sending shard failed due to [failed to create shard]
java.io.IOException: failed to obtain in-memory shard lock
    at org.elasticsearch.index.IndexService.createShard(IndexService.java:392) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:565) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:152) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:550) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:527) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:230) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$6(ClusterApplierService.java:495) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at java.lang.Iterable.forEach(Iterable.java:75) [?:?]
    at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:492) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:479) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:430) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:160) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
    at java.lang.Thread.run(Thread.java:844) [?:?]
Caused by: org.elasticsearch.env.ShardLockObtainFailedException: [upgraded_scroll][1]: obtaining shard lock timed out after 5000ms
    at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:678) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:597) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    at org.elasticsearch.index.IndexService.createShard(IndexService.java:329) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
    ... 17 more

which appears to be related to the test mixed_cluster/10_basic/Start scroll in mixed cluster on upgraded node that we will continue after upgrade which creates the upgraded_scroll index

  - do:
      indices.create:
        index: upgraded_scroll
        wait_for_active_shards: all
        body:
          settings:
            number_of_replicas: 0
            index.routing.allocation.include.upgraded: true

The upgraded nodes have the node.attr.upgraded: true set so at this point the index can be allocated to 2 of the 3 nodes in the cluster.

It's not clear if this or the ML jobs is the root cause.

Metadata

Metadata

Assignees

No one assigned

    Labels

    :Distributed Indexing/RecoveryAnything around constructing a new shard, either from a local or a remote source.>test-failureTriaged test failures from CIv6.3.2v6.4.0

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions