HADOOP-16792: Make S3 client request timeout configurable #1795

mustafaiman · 2020-01-07T01:42:20Z

NOTICE

Please create an issue in ASF JIRA before opening a pull request,
and you need to set the title of the pull request which starts with
the corresponding JIRA issue number. (e.g. HADOOP-XXXXX. Fix a typo in YYY.)
For more details, please see https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute

bgaborg

Thanks for working on this @mustafaiman!
Looking great as a start, but there are some things still to work on.
testRequestTimeout looks great, but it will not test the actual functionality, just if the parameter has been passed for the S3Client. Could you add an integration testcase where you actually test the functionality, s we can assert that an actual request will time out and assert on that?

hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AUtils.java

hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3AConfiguration.java

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java

steveloughran · 2020-01-17T15:39:24Z

sorry, I had commented but I hadn't hit the "complete review" comment and so github didn't submit it

steveloughran

There are some consequences of this feature -requests are more likely to timeout. How does that affect our retry code?

Is the retry all handled in the AWS library, or is it passed up to the caller for us to handle? If it's sent up to us -what exception gets raised and will we react to it properly?

A real problem here will be how to react to failures of non-idempotent operations. Now we actually view ~all ops as idempotent, even DELETE, which isn't quite true. But if there's a timeout on the POST to complete a big upload, and that timeout happens after the request has been processed -how are things going to fail on the retry? See HADOOP-14028 there.

Anyway, like you say -it's going to be configurable. But, I am curious as to what happens to a full -Dscale test run if you set it to a small value such as 1ms. please run this and tell me what happened.

Probably the best stress tests will be ITestS3AHugeFiles*, with the size of file to upload and the amount of each upload ramped up to represent real-world numbers. This is where HADOOP-14028 surfaced.

You can set some larger numbers in auth-keys.xml

fs.s3a.scale.test.huge.filesize = 4G
fs.s3a.scale.test.huge.partitionsize = 128M

hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3AConfiguration.java

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AUtils.java

mustafaiman · 2020-01-21T18:36:18Z

@steveloughran thanks for the detailed review, i'll adress the comments shortly

mustafaiman · 2020-01-22T20:36:03Z

@steveloughran I ran ITestS3AHugeFilesDiskBlocks#test_010_CreateHugeFile with some combinations.

The first experiments used default file size and partition for huge files. I set request timeout to 1 ms for the first experiment. Test file system failed to initialize. This is because verifyBuckets call in the beginning times out repeteadly. This is retried within AWS sdk code up to com.amazonaws.ClientConfiguration#maxErrorRetry times. This value is configurable from Hadoop side via property fs.s3a.attempts.maximum. All of this retries are opaque to Hadoop. At the end of this retry cycle, aws sdk returns failure to Hadoop's Invoker. Then, Invoker evaluates whether to retry this operation or not according to its configured retry policies. I saw that verifyBuckets call were not retried on Invoker level.

In a followup experiment, I set request timeout to 200ms, which is enough for verifyBuckets call to succeed but short enough that multi part uploads fail. In these cases, again AWS sdk retries these http requests up to maxErrorRetry times. After this http request fails maxErrorRetry times, Invoker's retry mechanism kicks in. I observed Invoker to retry these operations up to fs.s3a.retry.limit times conforming to configured exponential back-off limited retry policy. After all these fs.s3a.retry.limit*maxErrorRetry retries, Invoker bubbles up AWSClientIOException to the user as shown below:

org.apache.hadoop.fs.s3a.AWSClientIOException: upload part on tests3ascale/disk/hugefile: com.amazonaws.SdkClientException: Unable to execute HTTP request: Request did not complete before the request timeout configuration.: Unable to execute HTTP request: Request did not complete before the request timeout configuration.
	at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:205)
	at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:112)
	at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$4(Invoker.java:315)
	at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:407)
	at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:311)

Later, I ran the test with 256M file size and 32M partitionsize. I set the request timeout to 5s. My goal was to introduce a few retries due to short request timeout, but complete the upload operation with the use of retries. I managed to do that. I saw some retries due to short request timeout, but they were retried and the upload operation completed successfully. The test failed anyway because it also expected that TRANSFER_PART_FAILED_EVENT be 0. This is obviously not the case because some transfers failed but they were retried. I checked S3 and verified that the file was there. I also verified that temporary partition files were cleared in my local drive.

When I run the same experiment with 8GB file and 128M partitions but with small request timeout, the test fails due to uploads not being completed.

I also ran a soak test with 8GB files with a large request timeout. This passed fine as expected because timeout value was high enough to let uploads complete.

@bgaborg
I did not add a functional test here. I am repeating what we talked offline to leave a record of the reasoning. The retry mechanism is entirely within AWS SDK as I explained earlier in this comment. To introduce a functional test, we need a mechanism to selectively delay/fail some requests because we want file system initialization to succeed but a subsequent dummy operation(like getFileStatus) to be delayed. Introducing such test support is very hard if not impossible since hadoop-aws does not have any visibility into this mechanism. So, we depend on AWS SDK to maintain this mechanism for which they expose a configuration option here. I think this is very reasonable assumption. If AWS SDK cuts support for this feature in the feature, the configuration test I added here will fail anyway. Because I expect AWS SDK to at least throw an exception trying to set a configuration on ClientConfiguration object that is not supported anymore, or they would remove the method completely which would cause a compile error.

@steveloughran @bgaborg
I addressed the other code comments in the code. This final version of code has passed hadoop-aws test suite against us-west-1.

mustafaiman · 2020-01-22T20:37:01Z

The code failed due to a compile error in trunk before. This now seems to be resolve. I am pushing it again to trigger automated tests.

hadoop-yetus · 2020-01-22T22:01:49Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	1m 6s	Docker mode activated.
		_ Prechecks _
+1 💚	dupname	0m 0s	No case conflicting files found.
+0 🆗	markdownlint	0m 0s	markdownlint was not available.
+1 💚	@author	0m 0s	The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s	The patch appears to include 1 new or modified test files.
		_ trunk Compile Tests _
+1 💚	mvninstall	20m 28s	trunk passed
+1 💚	compile	0m 32s	trunk passed
+1 💚	checkstyle	0m 23s	trunk passed
+1 💚	mvnsite	0m 35s	trunk passed
+1 💚	shadedclient	14m 42s	branch has no errors when building and testing our client artifacts.
+1 💚	javadoc	0m 25s	trunk passed
+0 🆗	spotbugs	0m 58s	Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚	findbugs	0m 55s	trunk passed
		_ Patch Compile Tests _
+1 💚	mvninstall	0m 31s	the patch passed
+1 💚	compile	0m 26s	the patch passed
+1 💚	javac	0m 26s	the patch passed
+1 💚	checkstyle	0m 17s	the patch passed
+1 💚	mvnsite	0m 31s	the patch passed
+1 💚	whitespace	0m 0s	The patch has no whitespace issues.
+1 💚	shadedclient	15m 5s	patch has no errors when building and testing our client artifacts.
+1 💚	javadoc	0m 21s	the patch passed
+1 💚	findbugs	1m 3s	the patch passed
		_ Other Tests _
+1 💚	unit	1m 27s	hadoop-aws in the patch passed.
+1 💚	asflicense	0m 27s	The patch does not generate ASF License warnings.
		60m 33s

Subsystem	Report/Notes
Docker	Client=19.03.5 Server=19.03.5 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-1795/5/artifact/out/Dockerfile
GITHUB PR	#1795
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle markdownlint
uname	Linux a14cf564963d 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	personality/hadoop.sh
git revision	trunk / 9520b2a
Default Java	1.8.0_232
Test Results	https://builds.apache.org/job/hadoop-multibranch/job/PR-1795/5/testReport/
Max. process+thread count	341 (vs. ulimit of 5500)
modules	C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output	https://builds.apache.org/job/hadoop-multibranch/job/PR-1795/5/console
versions	git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by	Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

steveloughran

OK, minor nits left -and I've suggested text for the documentation.

Thanks for the (rigorous) testing.

It's clear that what makes a good timeout is going to depend on which operation is taking place. And choosing a value which may trigger a fast failover en simple HTTP queries is it going to hurt bulk uploads.

I'm happy to have it in there, but I worry that people are going to choose values which may give fast failover on some operations at the expense of support for bulk IO, and running a risk of generating more S3 load. People will need to be careful

To introduce a functional test, we need a mechanism to selectively delay/fail some requests because we want file system initialization to succeed but a subsequent dummy operation(like getFileStatus) to be delayed. Introducing such test support is very hard if not impossible since hadoop-aws does not have any visibility into this mechanism.

we have the InconsistentS3AClient so simulate failures to S3; used for S3Guard. I've been wondering what it would take to actually simulate throttling there as well; some random probability of the client considering itself overloaded, and then having a window where it blocks. Or even better -could we actually let you configure a throttle load and have it trigger when the request rate exceeded it.

Simulating request timeouts would be simpler -but as or fault injecting client goes above the AWS SDK, it won't be testing their internals.

steveloughran · 2020-01-23T19:37:34Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AUtils.java

log uses {} for entries

steveloughran · 2020-01-23T19:40:56Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AUtils.java

can you make the default unit TimeUnit.Seconds, even if you take the range in millis. People should be using seconds for this timeout

hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md

mustafaiman · 2020-01-23T22:14:36Z

@steveloughran

we have the InconsistentS3AClient so simulate failures to S3; used for S3Guard. I've been wondering what it would take to actually simulate throttling there as well; some random probability of the client considering itself overloaded, and then having a window where it blocks. Or even better -could we actually let you configure a throttle load and have it trigger when the request rate exceeded it.

Simulating request timeouts would be simpler -but as or fault injecting client goes above the AWS SDK, it won't be testing their internals.

Simulating throttling looks achievable. AWS SDK passes throttle errors up to hadoop-aws and that is where we catch throttling error and implement retry mechanism. So you can simulate throttling in InconsistentAmazonS3Client. However, fs.s3a.connection.request.timeout affects the retry mechanism below AWS SDK. So as you already pointed out, it is not useful in testing this particular configuration.

hadoop-yetus · 2020-01-24T10:04:33Z

💔 -1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	1m 8s	Docker mode activated.
		_ Prechecks _
+1 💚	dupname	0m 0s	No case conflicting files found.
+0 🆗	markdownlint	0m 0s	markdownlint was not available.
+1 💚	@author	0m 0s	The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s	The patch appears to include 1 new or modified test files.
		_ trunk Compile Tests _
+0 🆗	mvndep	0m 21s	Maven dependency ordering for branch
+1 💚	mvninstall	20m 37s	trunk passed
+1 💚	compile	17m 26s	trunk passed
+1 💚	checkstyle	2m 47s	trunk passed
+1 💚	mvnsite	2m 4s	trunk passed
+1 💚	shadedclient	19m 53s	branch has no errors when building and testing our client artifacts.
+1 💚	javadoc	1m 56s	trunk passed
+0 🆗	spotbugs	1m 5s	Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚	findbugs	3m 6s	trunk passed
		_ Patch Compile Tests _
+0 🆗	mvndep	0m 21s	Maven dependency ordering for patch
+1 💚	mvninstall	1m 20s	the patch passed
+1 💚	compile	16m 46s	the patch passed
+1 💚	javac	16m 46s	the patch passed
+1 💚	checkstyle	2m 48s	the patch passed
+1 💚	mvnsite	2m 4s	the patch passed
+1 💚	whitespace	0m 0s	The patch has no whitespace issues.
+1 💚	xml	0m 2s	The patch has no ill-formed XML file.
+1 💚	shadedclient	13m 58s	patch has no errors when building and testing our client artifacts.
+1 💚	javadoc	1m 54s	the patch passed
+1 💚	findbugs	3m 26s	the patch passed
		_ Other Tests _
-1 ❌	unit	9m 14s	hadoop-common in the patch failed.
+1 💚	unit	1m 34s	hadoop-aws in the patch passed.
+1 💚	asflicense	0m 46s	The patch does not generate ASF License warnings.
		123m 26s

Reason	Tests
Failed junit tests	hadoop.fs.TestHarFileSystem

Subsystem	Report/Notes
Docker	Client=19.03.5 Server=19.03.5 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-1795/7/artifact/out/Dockerfile
GITHUB PR	#1795
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient xml findbugs checkstyle markdownlint
uname	Linux e04c8d8cf424 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	personality/hadoop.sh
git revision	trunk / `978c487`
Default Java	1.8.0_232
unit	https://builds.apache.org/job/hadoop-multibranch/job/PR-1795/7/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt
Test Results	https://builds.apache.org/job/hadoop-multibranch/job/PR-1795/7/testReport/
Max. process+thread count	1346 (vs. ulimit of 5500)
modules	C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output	https://builds.apache.org/job/hadoop-multibranch/job/PR-1795/7/console
versions	git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by	Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

steveloughran · 2020-01-24T13:32:38Z

test failure looks like a regression of mine -should have been picked up earlier..interesting

steveloughran · 2020-01-24T13:42:42Z

test failure unrelated;

+1

committed to trunk with the warning text in the commit message. I don't want to be the one fielding support calls about writes not working,..

mustafaiman · 2020-01-24T17:48:14Z

@steveloughran @bgaborg thank you for reviews.

mustafaiman force-pushed the HADOOP-16792 branch 2 times, most recently from a1faf69 to 980fd0f Compare January 11, 2020 00:17

bgaborg self-requested a review January 17, 2020 12:10

bgaborg requested changes Jan 17, 2020

View reviewed changes

hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md Outdated Show resolved Hide resolved

bgaborg requested a review from steveloughran January 17, 2020 12:36

steveloughran reviewed Jan 17, 2020

View reviewed changes

apache deleted a comment from hadoop-yetus Jan 17, 2020

steveloughran requested changes Jan 20, 2020

View reviewed changes

hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3AConfiguration.java Outdated Show resolved Hide resolved

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AUtils.java Outdated Show resolved Hide resolved

apache deleted a comment from hadoop-yetus Jan 21, 2020

mustafaiman force-pushed the HADOOP-16792 branch 3 times, most recently from 543228c to 2501d26 Compare January 22, 2020 01:15

mustafaiman force-pushed the HADOOP-16792 branch from 2501d26 to 34913f9 Compare January 22, 2020 20:37

apache deleted a comment from hadoop-yetus Jan 23, 2020

steveloughran requested changes Jan 23, 2020

View reviewed changes

mustafaiman force-pushed the HADOOP-16792 branch 3 times, most recently from 5c8bb4b to 5965b63 Compare January 23, 2020 22:18

HADOOP-16792: Make S3 client request timeout configurable

27e5d26

mustafaiman force-pushed the HADOOP-16792 branch from 5965b63 to 27e5d26 Compare January 24, 2020 07:30

apache deleted a comment from hadoop-yetus Jan 24, 2020

steveloughran closed this Jan 24, 2020

HADOOP-16792: Make S3 client request timeout configurable #1795

HADOOP-16792: Make S3 client request timeout configurable #1795

Uh oh!

Conversation

mustafaiman commented Jan 7, 2020

NOTICE

Uh oh!

bgaborg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

steveloughran commented Jan 17, 2020

Uh oh!

steveloughran left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mustafaiman commented Jan 21, 2020

Uh oh!

mustafaiman commented Jan 22, 2020

Uh oh!

mustafaiman commented Jan 22, 2020

Uh oh!

hadoop-yetus commented Jan 22, 2020

Uh oh!

steveloughran left a comment

Choose a reason for hiding this comment

Uh oh!

steveloughran Jan 23, 2020

Choose a reason for hiding this comment

Uh oh!

steveloughran Jan 23, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mustafaiman commented Jan 23, 2020

Uh oh!

hadoop-yetus commented Jan 24, 2020

Uh oh!

steveloughran commented Jan 24, 2020

Uh oh!

steveloughran commented Jan 24, 2020

Uh oh!

mustafaiman commented Jan 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants