-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Google Cloud Support was contacted by a customer who sees many 503 errors when doing snapshots to Google Cloud Storage (GCS). We believe to have tracked down the bug to an incorrect initialization of HttpRequest in GoogleCloudStorageService.
The example code in https://cloud.google.com/storage/transfer/create-client#retry attaches new instances of HttpUnsuccessfulResponseHandler and HttpBackOffIOExceptionHandler to every request. However the plugin initializes only one instance of each class in the constructor (line 142-143) and attaches the same instances to every request.
Please also consult the JavaDoc of both handler classes:
- HttpBackOffUnsuccessfulResponseHandler: "you MUST create a new instance of HttpBackOffIOExceptionHandler with a new instance of BackOff for each instance of HttpRequest."
- HttpBackOffIOExceptionHandler: "you MUST create a new instance of HttpBackOffIOExceptionHandler with a new instance of BackOff for each instance of HttpRequest."
We believe that the reuse of the failure handlers causes the client to not retry failed requests anymore after the number of possible backoffs (or backoff time) has been exhausted once for each failure handler. Instead clients immediately fail. This is even more problematic since the snapshot logic in an ES cluster causes an immediate spike of write requests and is thus very prone to temporary failure.
Please move the initialization of the failure handlers from the constructor of DefaultHttpRequestInitializer into the method initialize
Thank you for your consideration!
Elasticsearch version (bin/elasticsearch --version): 5.4.3
** plugins:
analysis-icu
analysis-smartcn
repository-gcs
repository-s3
** details for GCS plugin:
Name: repository-gcs
Description: The GCS repository plugin adds Google Cloud Storage support for repositories.
Version: 5.4.3
Native Controller: false
- Classname: org.elasticsearch.repositories.gcs.GoogleCloudStoragePlugin
** OS: Debian 3.16.43-2+deb8u2
JVM version (java -version): 1.8.0_131
Description of the problem including expected versus actual behavior:
Steps to reproduce:
Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.
- Use the GCS snapshot repository for multiple snapshots, without restarting in an ES cluster of more than 6 GCE VMs with an index that takes around 30-40s for a snapshot.
Provide logs (if relevant):
logs snippet from search-es-data-b-scale-1709191530 VM Instance:
[2017-10-10T03:30:03,139][WARN ][o.e.s.SnapshotShardsService] [search-es-data-b-scale-1709191530] [[listings_v7][14]] [gcs-search-es-snapshots:search-es-10102017_1129_29688/Hy6gbA22Q0WVv9i8KL9PvQ] failed to create snapshot
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: Failed to perform snapshot (index files)
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1376) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:971) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:382) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:88) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:335) [elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.4.3.jar:5.4.3]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 503 Service Unavailable
Service Unavailable
at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:145) ~[?:?]
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113) ~[?:?]
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40) ~[?:?]
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432) ~[?:?]
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352) ~[?:?]
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469) ~[?:?]
at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore.lambda$writeBlob$5(GoogleCloudStorageBlobStore.java:219) ~[?:?]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_131]
at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore.doPrivileged(GoogleCloudStorageBlobStore.java:333) ~[?:?]
at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore.writeBlob(GoogleCloudStorageBlobStore.java:213) ~[?:?]
at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobContainer.writeBlob(GoogleCloudStorageBlobContainer.java:72) ~[?:?]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshotFile(BlobStoreRepository.java:1432) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1374) ~[elasticsearch-5.4.3.jar:5.4.3]
... 9 more