-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Elasticsearch version (bin/elasticsearch --version):
curl -X GET "localhost:9200"
{
"name" : "redacted",
"cluster_name" : "redacted",
"cluster_uuid" : "UJ8NSWbXSQmErxP9IfbvNA",
"version" : {
"number" : "6.3.0",
"build_flavor" : "default",
"build_type" : "rpm",
"build_hash" : "424e937",
"build_date" : "2018-06-11T23:38:03.357887Z",
"build_snapshot" : false,
"lucene_version" : "7.3.1",
"minimum_wire_compatibility_version" : "5.6.0",
"minimum_index_compatibility_version" : "5.0.0"
},
"tagline" : "You Know, for Search"
}Plugins installed: [repository-s3, discovery-ec2]
JVM version (java -version):
java version "1.8.0_171"
Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
OS version (uname -a if on a Unix-like system):
Linux es-client-1 4.14.88-72.76.amzn1.x86_64 #1 SMP Mon Jan 7 19:47:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
Snapshotting is failing for us with the following error
{
"duration_in_millis": 345001,
"end_time": "2019-02-02T00:05:48.480Z",
"end_time_in_millis": 1549065948480,
"failures": [
{
"index": "redacted",
"index_uuid": "redacted",
"node_id": "lxtigl9JRvm1dLX0RurmUg",
"reason": "IndexShardSnapshotFailedException[com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SdkClientException[Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: ConnectTimeoutException[Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SocketTimeoutException[connect timed out]; ",
"shard_id": 11,
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "redacted",
"index_uuid": "redacted",
"node_id": "lxtigl9JRvm1dLX0RurmUg",
"reason": "IndexShardSnapshotFailedException[com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SdkClientException[Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: ConnectTimeoutException[Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SocketTimeoutException[connect timed out]; ",
"shard_id": 13,
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "redacted",
"index_uuid": "redacted",
"node_id": "lxtigl9JRvm1dLX0RurmUg",
"reason": "IndexShardSnapshotFailedException[com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SdkClientException[Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: ConnectTimeoutException[Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SocketTimeoutException[connect timed out]; ",
"shard_id": 11,
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "redacted",
"index_uuid": "redacted",
"node_id": "lxtigl9JRvm1dLX0RurmUg",
"reason": "IndexShardSnapshotFailedException[com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SdkClientException[Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: ConnectTimeoutException[Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SocketTimeoutException[connect timed out]; ",
"shard_id": 14,
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "redacted",
"index_uuid": "redacted",
"node_id": "lxtigl9JRvm1dLX0RurmUg",
"reason": "IndexShardSnapshotFailedException[com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SdkClientException[Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: ConnectTimeoutException[Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SocketTimeoutException[connect timed out]; ",
"shard_id": 12,
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "redacted",
"index_uuid": "redacted",
"node_id": "lxtigl9JRvm1dLX0RurmUg",
"reason": "IndexShardSnapshotFailedException[com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SdkClientException[Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: ConnectTimeoutException[Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SocketTimeoutException[connect timed out]; ",
"shard_id": 19,
"status": "INTERNAL_SERVER_ERROR"
}
],
"include_global_state": true,
"indices": [
"redacted",
"redacted",
"redacted",
"redacted",
"redacted",
"redacted",
"redacted",
"redacted",
"redacted",
"redacted"
],
"shards": {
"failed": 6,
"successful": 194,
"total": 200
},
"snapshot": "redacted_2019-02-02t00:00:03z",
"start_time": "2019-02-02T00:00:03.479Z",
"start_time_in_millis": 1549065603479,
"state": "PARTIAL",
"uuid": "mbCeCSWuTXWkauPLgNq3Hg",
"version": "6.3.0",
"version_id": 6030099
}
Note: The 6 shards that failed to snapshot were all on the same host lxtigl9JRvm1dLX0RurmUg. Every subsequent attempt to create a snapshot results in the exact same error, with only this single node failing with a timeout to S3.
Restarting the Elasticsearch process on this host, waiting for the cluster to go green, and then snapshotting again is successful.
This is a pretty serious problem for us, we need to be able to reliably take snapshots every 24 hours. If there is more information you need us to provide in order to get this triaged, please let us know.