Skip to content

intermittent MD5 checksum failures when snapshotting using repository-s3 #25269

@joachimdraeger

Description

@joachimdraeger

Elasticsearch version: 5.4.0, master

Plugins installed: repository-s3

JVM version: 1.8.0_131-b11

OS version: Darwin Kernel Version 15.4.0: Fri Feb 26 22:08:05 PST 2016; root:xnu-3248.40.184~3/RELEASE_X86_64 x86_64

Description of the problem including expected versus actual behavior:

We are performing hourly snapshots of 10 daily indices. Every day there are 2 to 5 failures on either snapshot or delete snapshot caused by an Exception like this:
AmazonS3Exception: MD5 local [882AktFOIjwJNFwKqtfj+Q==], remote [LMtTgWbPjpF+Ia25VQtfwg==] are not equal...
We have never experienced such issues on 2.4.5 and S3 uploads using the AWS library should be quite reliable.

Analysis:
AmazonHttpClient uses mark/reset on the InputStream to retry failed uploads.
The checksum used by the repository-s3 code is calculated in DefaultS3OutputStream.doUpload() using a java.security.DigestInputStream which apparently does not support mark/reset.
When AmazonHttpClient retries after some bytes have already been streamed through the Digest, the checksum will be corrupted.

The checksum calculation in the repository-s3 code is redundant because AWS' AmazonS3Client. putObject() does already calculate and compare a checksum locally and on server side. AWS' implementation MD5DigestCalculatingInputStream does support mark/reset on retries.

The checksum calculation in the repository-s3 code does also contain another bug: It compares putObjectResult.getContentMd5() with its local checksum which is indeed the local checksum calculated by the S3 client. The remotely calculated MD5 is stored in the ETag property.

The solution is to remove the faulty MD5 code from repository-s3 because the S3 client already takes care for checksum comparison which is retry safe.

Steps to reproduce:

The error will occur in production eventually depending on how stable your network connection to S3 is.

In a test environment the error will occur eventually when creating snapshots in a loop.

There is a way to exhibit the behavior using a debugger :

  1. Enable remote debugging on a local ES installation
  2. Create an index
  3. Register S3 repository
  4. Put a break-point into the InputStream chain after the java.security.DigestInputStream on the return statement of a public int read(byte[] b, int off, int len) method, e.g. in MD5DigestCalculatingInputStream line 102 (for master).
  5. perform a snapshot
  6. When the breakpoint got hit, the DigestInputStream already updated its hash. Disable your network connection and continue the program.
  7. The next time the breakpoint gets hit, the S3 client is retrying. Enable network connection and continue the program.
  8. Observe the Exception in the logs.

Provide logs (if relevant):

[2017-06-09T16:50:02,172][INFO ][o.e.s.SnapshotShardsService] [xrZgqIf] snapshot [elasticsearch-local:20170609t1525-140/4UI1XpYuRhutCG82cfUThA] is done
[2017-06-09T16:50:13,913][WARN ][o.e.s.SnapshotShardsService] [xrZgqIf] [[index4][3]] [elasticsearch-local:20170609t1525-141/SYTtvRUGSiyEBtsMmBvWMw] failed to create snapshot
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: Failed to write commit point
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1231) ~[elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:815) ~[elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
	at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:380) ~[elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
	at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:88) ~[elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
	at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:334) [elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: java.io.IOException: Unable to upload object elasticsearch/local600/indices/5v3fWPPXQdiTuLkWoXgs1Q/3/snap-SYTtvRUGSiyEBtsMmBvWMw.dat
	at org.elasticsearch.repositories.s3.DefaultS3OutputStream.upload(DefaultS3OutputStream.java:112) ~[?:?]
	at org.elasticsearch.repositories.s3.DefaultS3OutputStream.flush(DefaultS3OutputStream.java:99) ~[?:?]
	at org.elasticsearch.repositories.s3.S3OutputStream.flushBuffer(S3OutputStream.java:69) ~[?:?]
	at org.elasticsearch.repositories.s3.S3OutputStream.close(S3OutputStream.java:87) ~[?:?]
	at org.elasticsearch.repositories.s3.S3BlobContainer.lambda$writeBlob$2(S3BlobContainer.java:98) ~[?:?]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_131]
	at org.elasticsearch.repositories.s3.SocketAccess.doPrivilegedIOException(SocketAccess.java:48) ~[?:?]
	at org.elasticsearch.repositories.s3.S3BlobContainer.writeBlob(S3BlobContainer.java:95) ~[?:?]
	at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.writeBlob(ChecksumBlobStoreFormat.java:187) ~[elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
	at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.write(ChecksumBlobStoreFormat.java:157) ~[elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1229) ~[elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
	... 9 more
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: MD5 local [882AktFOIjwJNFwKqtfj+Q==], remote [LMtTgWbPjpF+Ia25VQtfwg==] are not equal... (Service: null; Status Code: 0; Error Code: null; Request ID: null)
	at org.elasticsearch.repositories.s3.DefaultS3OutputStream.doUpload(DefaultS3OutputStream.java:145) ~[?:?]
	at org.elasticsearch.repositories.s3.DefaultS3OutputStream.upload(DefaultS3OutputStream.java:110) ~[?:?]
	at org.elasticsearch.repositories.s3.DefaultS3OutputStream.flush(DefaultS3OutputStream.java:99) ~[?:?]
	at org.elasticsearch.repositories.s3.S3OutputStream.flushBuffer(S3OutputStream.java:69) ~[?:?]
	at org.elasticsearch.repositories.s3.S3OutputStream.close(S3OutputStream.java:87) ~[?:?]
	at org.elasticsearch.repositories.s3.S3BlobContainer.lambda$writeBlob$2(S3BlobContainer.java:98) ~[?:?]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_131]
	at org.elasticsearch.repositories.s3.SocketAccess.doPrivilegedIOException(SocketAccess.java:48) ~[?:?]
	at org.elasticsearch.repositories.s3.S3BlobContainer.writeBlob(S3BlobContainer.java:95) ~[?:?]
	at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.writeBlob(ChecksumBlobStoreFormat.java:187) ~[elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
	at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.write(ChecksumBlobStoreFormat.java:157) ~[elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1229) ~[elasticsearch-6.0.0-alpha3-SNAPSHOT.jar:6.0.0-alpha3-SNAPSHOT]
	... 9 more
[2017-06-09T16:50:16,866][INFO ][o.e.s.SnapshotShardsService] [xrZgqIf] snapshot [elasticsearch-local:20170609t1525-141/SYTtvRUGSiyEBtsMmBvWMw] is done

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions