Skip to content

Snapshot stuck in IN_PROGRESS #29118

@scottsom

Description

@scottsom

Elasticsearch version (bin/elasticsearch --version): 6.1.1

Plugins installed: [analysis-icu, analysis-kuromoji, analysis-phonetic, analysis-smartcn, analysis-stempel, analysis-ukrainian, mapper-size, repository-s3]

Also a few custom plugins that add monitoring, security, and one that initiates a snapshot periodically.

JVM version (java -version): 1.8.0_144

OS version (uname -a if on a Unix-like system): Amazon Linux

Description of the problem including expected versus actual behavior:

We initiate an S3 snapshot request and wait for it complete. These are done daily and usually only take about 20 minutes. In this case, the snapshot never returned. Our backups are now effectively useless since the IN_PROGRESS snapshot cannot be deleted, the repository cannot be deleted, and no subsequent snapshots can be created (even in a different repository).

The snapshot seems to be out of sync with the cluster state which says the snapshot was ABORTED but the snapshot still says IN_PROGRESS.

I have restarted all 3 dedicated master nodes and the snapshot is still stuck. I have yet to try a full cluster restart.

Steps to reproduce:

Unfortunately, I haven't been able to identify the root cause or a way to reproduce this.

The cluster was GREEN before and after the snapshot started (17 hours later it went YELLOW). There were 4 shards being relocated leading up to and during the snapshot. The cluster.routing.allocation.disk.watermark.low setting was breached on a number of nodes. 17 hours after the snapshot was initiated I replaced the master node. All indexes were originally created in ES 5.5.0 and the current version was reached through a series of rolling upgrades (5.5.0 -> 5.6.2 -> 6.1.1).

Provide logs (if relevant):

GET _cat/snapshots?repository=repo-name

2018-02-16t00:00:00.000z     SUCCESS 1518739205 00:00:05 1518740265 00:17:45 17.6m 17 895 0 895
2018-02-17t00:00:00.000z     SUCCESS 1518825640 00:00:40 1518826775 00:19:35 18.9m 17 895 0 895
2018-02-18t00:00:00.000z     SUCCESS 1518912034 00:00:34 1518912988 00:16:28 15.9m 17 895 0 895
2018-02-19t00:00:00.000z     SUCCESS 1518998435 00:00:35 1518999342 00:15:42 15.1m 17 895 0 895
2018-02-20t00:00:00.000z     SUCCESS 1519084807 00:00:07 1519085803 00:16:43 16.6m 17 895 0 895
2018-02-21t00:00:00.000z     SUCCESS 1519171226 00:00:26 1519172313 00:18:33 18.1m 17 895 0 895
2018-02-22t00:00:00.000z     SUCCESS 1519257620 00:00:20 1519258622 00:17:02 16.7m 17 895 0 895
2018-02-23t00:00:00.000z IN_PROGRESS 1519344013 00:00:13 0          00:00:00 20.6d 17   0 0   0
GET _snapshot/repo-name

{
    "repo-name": {
        "type": "s3",
        "settings": {
            "bucket": "some-bucket",
            "base_path": "repo-name",
            "storage_class": "standard_ia"
        }
    }
}
GET _snapshot/repo-name/_status

{
    "snapshots": [
        {
            "snapshot": "2018-02-23t00:00:00.000z",
            "repository": "repo-name",
            "uuid": "YtXSACL5Tm-kPadJ1I8JOw",
            "state": "ABORTED",
            "shards_stats": {
                "initializing": 0,
                "started": 0,
                "finalizing": 0,
                "done": 0,
                "failed": 2,
                "total": 2
            },
            "stats": {
                "number_of_files": 0,
                "processed_files": 0,
                "total_size_in_bytes": 0,
                "processed_size_in_bytes": 0,
                "start_time_in_millis": 0,
                "time_in_millis": 0
            },
            "indices": {
                "some_index": {
                    "shards_stats": {
                        "initializing": 0,
                        "started": 0,
                        "finalizing": 0,
                        "done": 0,
                        "failed": 2,
                        "total": 2
                    },
                    "stats": {
                        "number_of_files": 0,
                        "processed_files": 0,
                        "total_size_in_bytes": 0,
                        "processed_size_in_bytes": 0,
                        "start_time_in_millis": 0,
                        "time_in_millis": 0
                    },
                    "shards": {
                        "61": {
                            "stage": "FAILURE",
                            "stats": {
                                "number_of_files": 0,
                                "processed_files": 0,
                                "total_size_in_bytes": 0,
                                "processed_size_in_bytes": 0,
                                "start_time_in_millis": 0,
                                "time_in_millis": 0
                            }
                        },
                        "269": {
                            "stage": "FAILURE",
                            "stats": {
                                "number_of_files": 0,
                                "processed_files": 0,
                                "total_size_in_bytes": 0,
                                "processed_size_in_bytes": 0,
                                "start_time_in_millis": 0,
                                "time_in_millis": 0
                            }
                        }
                    }
                }
            }
        }
    ]
}
GET _snapshot/repo-name/_current

{
    "snapshots": [
        {
            "snapshot": "2018-02-23t00:00:00.000z",
            "uuid": "YtXSACL5Tm-kPadJ1I8JOw",
            "version_id": 6010199,
            "version": "6.1.1",
            "indices": [ ... ],
            "state": "IN_PROGRESS",
            "start_time": "2018-02-23T00:00:13.754Z",
            "start_time_in_millis": 1519344013754,
            "end_time": "1970-01-01T00:00:00.000Z",
            "end_time_in_millis": 0,
            "duration_in_millis": -1519344013754,
            "failures": [],
            "shards": {
                "total": 0,
                "failed": 0,
                "successful": 0
            }
        }
    ]
}
GET _cluster/state

{
    "snapshots": {
        "snapshots": [
            {
                "repository": "repo-name",
                "snapshot": "2018-02-23t00:00:00.000z",
                "uuid": "YtXSACL5Tm-kPadJ1I8JOw",
                "include_global_state": true,
                "partial": false,
                "state": "ABORTED",
                "indices": [ ... ],
                "start_time_millis": 1519344013754,
                "repository_state_id": 54,
                "shards": [
                    {
                        "index": {
                            "index_name": "some_index",
                            "index_uuid": "IDRyEj-lTSmr7imXFqdkgQ"
                        },
                        "shard": 61,
                        "state": "FAILED",
                        "node": "0sAovqveTv6q70IqfpYGDQ"
                    },
                    {
                        "index": {
                            "index_name": "some_index",
                            "index_uuid": "IDRyEj-lTSmr7imXFqdkgQ"
                        },
                        "shard": 269,
                        "state": "ABORTED",
                        "node": "arLNOMBtRnSu_dMZMxVxiw"
                    }
                ]
            }
        ]
    }
}

Only log statements I could find about one of the affected shards are:

[2018-02-23T17:43:22,336][WARN ][o.e.c.s.ClusterApplierService] [host_a] cluster state applier task [indices_store ([[some_index][61]] active fully on other nodes)] took [57.8s] above the warn threshold of 30s
[2018-02-23T17:41:13,470][WARN ][o.e.a.b.TransportShardBulkAction] [host_b] [[some_index][61]] failed to perform indices:data/write/bulk[s] on replica [some_index][61], node[86N5Eln9SsOj4f58HTaJzQ], relocating [0sAovqveTv6q70IqfpYGDQ], [P], recovery_source[peer recovery], s[INITIALIZING], a[id=uAJap9BDTTyTUL2iDhx9aA, rId=JNI46MMfQBaTKQkFrb5Rkg], expected_shard_size[74700244115]

The cluster state says the snapshot was ABORTED since two of the shards in one of the indexes were marked as ABORTED.

I have ran hot_threads with threads=1000 across all nodes and I cannot find any indication of the snapshot running. I cannot find any references to snapshots in _tasks.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions