Skip to content

testDataNodeRestartAfterShardSnapshotFailure fails by leaking a shard snapshot on a failed node #48526

@DaveCTurner

Description

@DaveCTurner

This test failed in this CI run: https://gradle-enterprise.elastic.co/s/rlbqh5tapantg/ (in 7.x). It shuts down a node while that node is snapshotting a shard, but the shard snapshot does not fail:

  1> [2019-10-25T01:22:40,150][INFO ][o.e.s.DedicatedClusterSnapshotRestoreIT] [testDataNodeRestartAfterShardSnapshotFailure] -->  snapshot
  1> [2019-10-25T01:22:40,160][INFO ][o.e.s.SnapshotsService   ] [node_tm0] snapshot [test-repo:test-snap/8CMeKZucSfKdomiV2gX9Lw] started
  1> [2019-10-25T01:22:40,217][INFO ][o.e.s.DedicatedClusterSnapshotRestoreIT] [testDataNodeRestartAfterShardSnapshotFailure] -->  restarting first data node, which should cause the primary shard on it to be failed
  1> [2019-10-25T01:22:40,217][INFO ][o.e.t.InternalTestCluster] [testDataNodeRestartAfterShardSnapshotFailure] Restarting node [node_td1] 
  1> [2019-10-25T01:22:40,218][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] stopping ...
  1> [2019-10-25T01:22:40,220][INFO ][o.e.c.c.Coordinator      ] [node_td1] master node [{node_tm0}{ztB71aT-Qfmpv14TI36OOg}{JRG4MuoYREqTFOGkDH3D2g}{127.0.0.1}{127.0.0.1:34059}{m}] failed, restarting discovery
  1> org.elasticsearch.transport.NodeDisconnectedException: [node_tm0][127.0.0.1:34059][disconnected] disconnected
  1> [2019-10-25T01:22:40,223][INFO ][o.e.s.m.MockRepository   ] [node_td2] [test-repo] blocking I/O operation for file [__Gee36ShzRIWE9EIdskMBfw.part0] at path [[indices][Md-z-wvRQPuh-C_U1reAfQ][1]]
  1> [2019-10-25T01:22:40,224][INFO ][o.e.c.s.MasterService    ] [node_tm0] node-left[{node_td1}{Kn1zpH90QPqXMtL1NqqVOg}{DKlTH9pkRymZtUbRoe8_rA}{127.0.0.1}{127.0.0.1:37309}{di} reason: disconnected], term: 1, version: 12, delta: removed {{node_td1}{Kn1zpH90QPqXMtL1NqqVOg}{DKlTH9pkRymZtUbRoe8_rA}{127.0.0.1}{127.0.0.1:37309}{di}}
  1> [2019-10-25T01:22:40,224][WARN ][o.e.s.SnapshotShardsService] [node_td1] [[test-idx][0]][test-repo:test-snap/8CMeKZucSfKdomiV2gX9Lw] failed to snapshot shard
  1> org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: Aborted
  1>  at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:1112) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:337) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:285) ~[main/:?]
  1>  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [main/:?]
  1>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_221]
  1>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_221]
  1>  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_221]
  1> [2019-10-25T01:22:40,226][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] stopped
  1> [2019-10-25T01:22:40,227][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] closing ...
  1> [2019-10-25T01:22:40,229][WARN ][o.e.s.SnapshotShardsService] [node_td1] [test-repo:test-snap/8CMeKZucSfKdomiV2gX9Lw] [ShardSnapshotStatus[state=FAILED, nodeId=Kn1zpH90QPqXMtL1NqqVOg, reason=[test-idx/l-lv0KpjQX6MGCl5PTKXUw][[test-idx][0]] IndexShardSnapshotFailedException[Aborted]
  1>  at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:1112)
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:337)
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:285)
  1>  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703)
  1>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  1>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  1>  at java.lang.Thread.run(Thread.java:748)
  1> , generation=null]] failed to update snapshot state
  1> org.elasticsearch.transport.SendRequestTransportException: [node_td1][127.0.0.1:37309][internal:cluster/snapshot/update_snapshot_status]
  1>  at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:704) ~[main/:?]
  1>  at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:602) ~[main/:?]
  1>  at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:577) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.lambda$sendSnapshotShardUpdate$2(SnapshotShardsService.java:471) ~[main/:?]
  1>  at org.elasticsearch.transport.TransportRequestDeduplicator.executeOnce(TransportRequestDeduplicator.java:52) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.sendSnapshotShardUpdate(SnapshotShardsService.java:457) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.notifyFailedSnapshotShard(SnapshotShardsService.java:451) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.access$400(SnapshotShardsService.java:91) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService$1.onFailure(SnapshotShardsService.java:301) ~[main/:?]
  1>  at org.elasticsearch.action.ActionListener$5.onFailure(ActionListener.java:256) ~[main/:?]
  1>  at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$snapshotShard$32(BlobStoreRepository.java:1061) ~[main/:?]
  1>  at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71) ~[main/:?]
  1>  at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:1239) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:337) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:285) ~[main/:?]
  1>  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) ~[main/:?]
  1>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_221]
  1>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_221]
  1>  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_221]
  1> Caused by: org.elasticsearch.transport.TransportException: TransportService is closed stopped can't send request
  1>  at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:686) ~[main/:?]
  1>  ... 18 more
  1> [2019-10-25T01:22:40,232][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] closed
  1> [2019-10-25T01:22:40,240][INFO ][o.e.e.NodeEnvironment    ] [testDataNodeRestartAfterShardSnapshotFailure] using [1] data paths, mounts [[/ (/dev/sda1)]], net usable_space [307gb], net total_space [349.9gb], types [xfs]
  1> [2019-10-25T01:22:40,240][INFO ][o.e.e.NodeEnvironment    ] [testDataNodeRestartAfterShardSnapshotFailure] heap size [491mb], compressed ordinary object pointers [true]
  1> [2019-10-25T01:22:40,249][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] node name [node_td1], node ID [Kn1zpH90QPqXMtL1NqqVOg], cluster name [TEST-TEST_WORKER_VM=[435]-CLUSTER_SEED=[-2213907199717514849]-HASH=[17AAA92C130]-cluster]
  1> [2019-10-25T01:22:40,249][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] version[7.6.0], pid[217518], build[unknown/unknown/unknown/unknown], OS[Linux/4.14.35-1902.6.6.el7uek.x86_64/amd64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_221/25.221-b11]
  1> [2019-10-25T01:22:40,249][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] JVM home [/var/lib/jenkins/.java/oracle-8u221-linux/jre]
  1> [2019-10-25T01:22:40,249][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] JVM arguments [-Dfile.encoding=UTF8, -Dcompiler.java=12, -Des.scripting.update.ctx_in_params=false, -Des.transport.cname_in_publish_address=true, -Dgradle.dist.lib=/var/lib/jenkins/.gradle/wrapper/dists/gradle-5.6.2-all/9st6wgf78h16so49nn74lgtbb/gradle-5.6.2/lib, -Dgradle.user.home=/var/lib/jenkins/.gradle, -Dgradle.worker.jar=/var/lib/jenkins/.gradle/caches/5.6.2/workerMain/gradle-worker.jar, -Dio.netty.noKeySetOptimization=true, -Dio.netty.noUnsafe=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Djava.awt.headless=true, -Djava.locale.providers=SPI,JRE, -Djava.security.manager=worker.org.gradle.process.internal.worker.child.BootstrapSecurityManager, -Djna.nosys=true, -Dorg.gradle.native=false, -Druntime.java=8, -Dtests.artifact=server, -Dtests.gradle=true, -Dtests.logger.level=WARN, -Dtests.security.manager=true, -Dtests.seed=14AC7937BAB45A6E, -Dtests.task=:server:integTest, -esa, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=/var/lib/jenkins/workspace/elastic+elasticsearch+7.x+multijob-unix-compatibility/os/oraclelinux-7&&immutable/server/build/heapdump, -Xms512m, -Xmx512m, -Dfile.encoding=UTF-8, -Djava.io.tmpdir=./temp, -Duser.country=US, -Duser.language=en, -Duser.variant, -ea]
  1> [2019-10-25T01:22:40,250][INFO ][o.e.p.PluginsService     ] [testDataNodeRestartAfterShardSnapshotFailure] no modules loaded
  1> [2019-10-25T01:22:40,250][INFO ][o.e.p.PluginsService     ] [testDataNodeRestartAfterShardSnapshotFailure] loaded plugin [org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT$BrokenSettingPlugin]
  1> [2019-10-25T01:22:40,250][INFO ][o.e.p.PluginsService     ] [testDataNodeRestartAfterShardSnapshotFailure] loaded plugin [org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT$TestCustomMetaDataPlugin]
  1> [2019-10-25T01:22:40,250][INFO ][o.e.p.PluginsService     ] [testDataNodeRestartAfterShardSnapshotFailure] loaded plugin [org.elasticsearch.snapshots.mockstore.MockRepository$Plugin]
  1> [2019-10-25T01:22:40,250][INFO ][o.e.p.PluginsService     ] [testDataNodeRestartAfterShardSnapshotFailure] loaded plugin [org.elasticsearch.test.ESIntegTestCase$AssertActionNamePlugin]
  1> [2019-10-25T01:22:40,250][INFO ][o.e.p.PluginsService     ] [testDataNodeRestartAfterShardSnapshotFailure] loaded plugin [org.elasticsearch.test.ESIntegTestCase$TestSeedPlugin]
  1> [2019-10-25T01:22:40,250][INFO ][o.e.p.PluginsService     ] [testDataNodeRestartAfterShardSnapshotFailure] loaded plugin [org.elasticsearch.test.MockHttpTransport$TestPlugin]
  1> [2019-10-25T01:22:40,250][INFO ][o.e.p.PluginsService     ] [testDataNodeRestartAfterShardSnapshotFailure] loaded plugin [org.elasticsearch.transport.nio.MockNioTransportPlugin]
  1> [2019-10-25T01:22:40,269][INFO ][o.e.d.DiscoveryModule    ] [testDataNodeRestartAfterShardSnapshotFailure] using discovery type [zen] and seed hosts providers [settings, file]
  1> [2019-10-25T01:22:40,290][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] initialized
  1> [2019-10-25T01:22:40,290][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] starting ...
  1> [2019-10-25T01:22:40,294][INFO ][o.e.t.TransportService   ] [testDataNodeRestartAfterShardSnapshotFailure] publish_address {127.0.0.1:38475}, bound_addresses {[::1]:35853}, {127.0.0.1:38475}
  1> [2019-10-25T01:22:40,301][INFO ][o.e.c.s.ClusterApplierService] [node_td2] removed {{node_td1}{Kn1zpH90QPqXMtL1NqqVOg}{DKlTH9pkRymZtUbRoe8_rA}{127.0.0.1}{127.0.0.1:37309}{di}}, term: 1, version: 12, reason: ApplyCommitRequest{term=1, version=12, sourceNode={node_tm0}{ztB71aT-Qfmpv14TI36OOg}{JRG4MuoYREqTFOGkDH3D2g}{127.0.0.1}{127.0.0.1:34059}{m}}
  1> [2019-10-25T01:22:40,305][INFO ][o.e.c.s.ClusterApplierService] [node_tm0] removed {{node_td1}{Kn1zpH90QPqXMtL1NqqVOg}{DKlTH9pkRymZtUbRoe8_rA}{127.0.0.1}{127.0.0.1:37309}{di}}, term: 1, version: 12, reason: Publication{term=1, version=12}
  1> [2019-10-25T01:22:40,309][INFO ][o.e.c.c.Coordinator      ] [testDataNodeRestartAfterShardSnapshotFailure] cluster UUID [6WUNaFxyS9e_GZgDQ7k3FQ]
  1> [2019-10-25T01:22:40,356][INFO ][o.e.c.s.MasterService    ] [node_tm0] node-join[{node_td1}{Kn1zpH90QPqXMtL1NqqVOg}{AdrMl80YTKu-MHo2vIUEhw}{127.0.0.1}{127.0.0.1:38475}{di} join existing leader], term: 1, version: 14, delta: added {{node_td1}{Kn1zpH90QPqXMtL1NqqVOg}{AdrMl80YTKu-MHo2vIUEhw}{127.0.0.1}{127.0.0.1:38475}{di}}
  1> [2019-10-25T01:22:40,359][INFO ][o.e.c.s.ClusterApplierService] [node_td2] added {{node_td1}{Kn1zpH90QPqXMtL1NqqVOg}{AdrMl80YTKu-MHo2vIUEhw}{127.0.0.1}{127.0.0.1:38475}{di}}, term: 1, version: 14, reason: ApplyCommitRequest{term=1, version=14, sourceNode={node_tm0}{ztB71aT-Qfmpv14TI36OOg}{JRG4MuoYREqTFOGkDH3D2g}{127.0.0.1}{127.0.0.1:34059}{m}}
  1> [2019-10-25T01:22:40,359][INFO ][o.e.c.s.ClusterApplierService] [node_td1] master node changed {previous [], current [{node_tm0}{ztB71aT-Qfmpv14TI36OOg}{JRG4MuoYREqTFOGkDH3D2g}{127.0.0.1}{127.0.0.1:34059}{m}]}, added {{node_td2}{mpwldoJOQTucDLY79bMP6w}{Ge9Glg2TTyCthw9_1wtfnQ}{127.0.0.1}{127.0.0.1:36515}{di},{node_tm0}{ztB71aT-Qfmpv14TI36OOg}{JRG4MuoYREqTFOGkDH3D2g}{127.0.0.1}{127.0.0.1:34059}{m}}, term: 1, version: 14, reason: ApplyCommitRequest{term=1, version=14, sourceNode={node_tm0}{ztB71aT-Qfmpv14TI36OOg}{JRG4MuoYREqTFOGkDH3D2g}{127.0.0.1}{127.0.0.1:34059}{m}}
  1> [2019-10-25T01:22:40,362][INFO ][o.e.s.m.MockRepository   ] [node_td1] starting mock repository with random prefix default
  1> [2019-10-25T01:22:40,442][WARN ][o.e.s.SnapshotShardsService] [node_td1] [[test-idx][0]][test-repo:test-snap/8CMeKZucSfKdomiV2gX9Lw] failed to snapshot shard
  1> org.elasticsearch.index.IndexNotFoundException: no such index [test-idx]
  1>  at org.elasticsearch.indices.IndicesService.indexServiceSafe(IndicesService.java:458) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:317) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:285) ~[main/:?]
  1>  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [main/:?]
  1>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_221]
  1>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_221]
  1>  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_221]
  1> [2019-10-25T01:22:40,444][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] started
  1> [2019-10-25T01:22:40,444][INFO ][o.e.c.s.ClusterApplierService] [node_tm0] added {{node_td1}{Kn1zpH90QPqXMtL1NqqVOg}{AdrMl80YTKu-MHo2vIUEhw}{127.0.0.1}{127.0.0.1:38475}{di}}, term: 1, version: 14, reason: Publication{term=1, version=14}
  1> [2019-10-25T01:22:40,447][INFO ][o.e.s.DedicatedClusterSnapshotRestoreIT] [testDataNodeRestartAfterShardSnapshotFailure] -->  wait for shard snapshot of first primary to show as failed
  1> [2019-10-25T01:22:40,632][INFO ][o.e.c.r.a.AllocationService] [node_tm0] Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[test-idx][0]]]).
  1> [2019-10-25T01:23:46,034][INFO ][o.e.a.a.c.r.c.TransportCleanupRepositoryAction] [node_tm0] Running cleanup operations on repository [test-repo][-1]
  1> [2019-10-25T01:23:46,036][WARN ][o.e.a.a.c.r.c.TransportCleanupRepositoryAction] [node_tm0] Failed to run repository cleanup operations on [test-repo][-1]
  1> java.lang.IllegalStateException: Cannot cleanup [test-repo] - a snapshot is currently running
  1>  at org.elasticsearch.action.admin.cluster.repositories.cleanup.TransportCleanupRepositoryAction$2.execute(TransportCleanupRepositoryAction.java:187) ~[main/:?]
  1>  at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47) ~[main/:?]
  1>  at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:702) ~[main/:?]
  1>  at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:324) ~[main/:?]
  1>  at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:219) [main/:?]
  1>  at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73) [main/:?]
  1>  at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151) [main/:?]
  1>  at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [main/:?]
  1>  at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [main/:?]
  1>  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [main/:?]
  1>  at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [main/:?]
  1>  at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [main/:?]
  1>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_221]
  1>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_221]
  1>  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_221]
  1> [2019-10-25T01:23:46,039][INFO ][o.e.s.DedicatedClusterSnapshotRestoreIT] [testDataNodeRestartAfterShardSnapshotFailure] [DedicatedClusterSnapshotRestoreIT#testDataNodeRestartAfterShardSnapshotFailure]: cleaning up after test

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions