-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Elasticsearch version: 5.4.0, master
Plugins installed: repository-s3
JVM version: 1.8.0_131-b11
OS version: Darwin Kernel Version 15.4.0: Fri Feb 26 22:08:05 PST 2016; root:xnu-3248.40.184~3/RELEASE_X86_64 x86_64
Description of the problem including expected versus actual behavior:
We are performing hourly snapshots of 10 daily indices. Every day there are 2 to 5 failures on either snapshot or delete snapshot caused by an Exception like this:
AmazonS3Exception: MD5 local [882AktFOIjwJNFwKqtfj+Q==], remote [LMtTgWbPjpF+Ia25VQtfwg==] are not equal...
We have never experienced such issues on 2.4.5 and S3 uploads using the AWS library should be quite reliable.
Analysis:
AmazonHttpClient uses mark/reset on the InputStream to retry failed uploads.
The checksum used by the repository-s3 code is calculated in DefaultS3OutputStream.doUpload() using a java.security.DigestInputStream which apparently does not support mark/reset.
When AmazonHttpClient retries after some bytes have already been streamed through the Digest, the checksum will be corrupted.
The checksum calculation in the repository-s3 code is redundant because AWS' AmazonS3Client. putObject() does already calculate and compare a checksum locally and on server side. AWS' implementation MD5DigestCalculatingInputStream does support mark/reset on retries.
The checksum calculation in the repository-s3 code does also contain another bug: It compares putObjectResult.getContentMd5() with its local checksum which is indeed the local checksum calculated by the S3 client. The remotely calculated MD5 is stored in the ETag property.
The solution is to remove the faulty MD5 code from repository-s3 because the S3 client already takes care for checksum comparison which is retry safe.
Steps to reproduce:
The error will occur in production eventually depending on how stable your network connection to S3 is.
In a test environment the error will occur eventually when creating snapshots in a loop.
There is a way to exhibit the behavior using a debugger :
- Enable remote debugging on a local ES installation
- Create an index
- Register S3 repository
- Put a break-point into the InputStream chain after the
java.security.DigestInputStreamon the return statement of apublic int read(byte[] b, int off, int len)method, e.g. inMD5DigestCalculatingInputStreamline 102 (for master). - perform a snapshot
- When the breakpoint got hit, the
DigestInputStreamalready updated its hash. Disable your network connection and continue the program. - The next time the breakpoint gets hit, the S3 client is retrying. Enable network connection and continue the program.
- Observe the Exception in the logs.
Provide logs (if relevant):
[2017-06-09T16:50:02,172][INFO ][o.e.s.SnapshotShardsService] [xrZgqIf] snapshot [elasticsearch-local:20170609t1525-140/4UI1XpYuRhutCG82cfUThA] is done
[2017-06-09T16:50:13,913][WARN ][o.e.s.SnapshotShardsService] [xrZgqIf] [[index4][3]] [elasticsearch-local:20170609t1525-141/SYTtvRUGSiyEBtsMmBvWMw] failed to create snapshot
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: Failed to write commit point
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1231) ~[elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:815) ~[elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:380) ~[elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:88) ~[elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:334) [elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: java.io.IOException: Unable to upload object elasticsearch/local600/indices/5v3fWPPXQdiTuLkWoXgs1Q/3/snap-SYTtvRUGSiyEBtsMmBvWMw.dat
at org.elasticsearch.repositories.s3.DefaultS3OutputStream.upload(DefaultS3OutputStream.java:112) ~[?:?]
at org.elasticsearch.repositories.s3.DefaultS3OutputStream.flush(DefaultS3OutputStream.java:99) ~[?:?]
at org.elasticsearch.repositories.s3.S3OutputStream.flushBuffer(S3OutputStream.java:69) ~[?:?]
at org.elasticsearch.repositories.s3.S3OutputStream.close(S3OutputStream.java:87) ~[?:?]
at org.elasticsearch.repositories.s3.S3BlobContainer.lambda$writeBlob$2(S3BlobContainer.java:98) ~[?:?]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_131]
at org.elasticsearch.repositories.s3.SocketAccess.doPrivilegedIOException(SocketAccess.java:48) ~[?:?]
at org.elasticsearch.repositories.s3.S3BlobContainer.writeBlob(S3BlobContainer.java:95) ~[?:?]
at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.writeBlob(ChecksumBlobStoreFormat.java:187) ~[elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.write(ChecksumBlobStoreFormat.java:157) ~[elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1229) ~[elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
... 9 more
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: MD5 local [882AktFOIjwJNFwKqtfj+Q==], remote [LMtTgWbPjpF+Ia25VQtfwg==] are not equal... (Service: null; Status Code: 0; Error Code: null; Request ID: null)
at org.elasticsearch.repositories.s3.DefaultS3OutputStream.doUpload(DefaultS3OutputStream.java:145) ~[?:?]
at org.elasticsearch.repositories.s3.DefaultS3OutputStream.upload(DefaultS3OutputStream.java:110) ~[?:?]
at org.elasticsearch.repositories.s3.DefaultS3OutputStream.flush(DefaultS3OutputStream.java:99) ~[?:?]
at org.elasticsearch.repositories.s3.S3OutputStream.flushBuffer(S3OutputStream.java:69) ~[?:?]
at org.elasticsearch.repositories.s3.S3OutputStream.close(S3OutputStream.java:87) ~[?:?]
at org.elasticsearch.repositories.s3.S3BlobContainer.lambda$writeBlob$2(S3BlobContainer.java:98) ~[?:?]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_131]
at org.elasticsearch.repositories.s3.SocketAccess.doPrivilegedIOException(SocketAccess.java:48) ~[?:?]
at org.elasticsearch.repositories.s3.S3BlobContainer.writeBlob(S3BlobContainer.java:95) ~[?:?]
at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.writeBlob(ChecksumBlobStoreFormat.java:187) ~[elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.write(ChecksumBlobStoreFormat.java:157) ~[elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1229) ~[elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
... 9 more
[2017-06-09T16:50:16,866][INFO ][o.e.s.SnapshotShardsService] [xrZgqIf] snapshot [elasticsearch-local:20170609t1525-141/SYTtvRUGSiyEBtsMmBvWMw] is done