-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HADOOP-16792: Make S3 client request timeout configurable #1795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a1faf69 to
980fd0f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @mustafaiman!
Looking great as a start, but there are some things still to work on.
testRequestTimeout looks great, but it will not test the actual functionality, just if the parameter has been passed for the S3Client. Could you add an integration testcase where you actually test the functionality, s we can assert that an actual request will time out and assert on that?
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AUtils.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3AConfiguration.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java
Outdated
Show resolved
Hide resolved
|
sorry, I had commented but I hadn't hit the "complete review" comment and so github didn't submit it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some consequences of this feature -requests are more likely to timeout. How does that affect our retry code?
Is the retry all handled in the AWS library, or is it passed up to the caller for us to handle? If it's sent up to us -what exception gets raised and will we react to it properly?
A real problem here will be how to react to failures of non-idempotent operations. Now we actually view ~all ops as idempotent, even DELETE, which isn't quite true. But if there's a timeout on the POST to complete a big upload, and that timeout happens after the request has been processed -how are things going to fail on the retry? See HADOOP-14028 there.
Anyway, like you say -it's going to be configurable. But, I am curious as to what happens to a full -Dscale test run if you set it to a small value such as 1ms. please run this and tell me what happened.
Probably the best stress tests will be ITestS3AHugeFiles*, with the size of file to upload and the amount of each upload ramped up to represent real-world numbers. This is where HADOOP-14028 surfaced.
You can set some larger numbers in auth-keys.xml
fs.s3a.scale.test.huge.filesize = 4G
fs.s3a.scale.test.huge.partitionsize = 128M
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3AConfiguration.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AUtils.java
Outdated
Show resolved
Hide resolved
|
@steveloughran thanks for the detailed review, i'll adress the comments shortly |
543228c to
2501d26
Compare
|
@steveloughran I ran ITestS3AHugeFilesDiskBlocks#test_010_CreateHugeFile with some combinations. The first experiments used default file size and partition for huge files. I set request timeout to 1 ms for the first experiment. Test file system failed to initialize. This is because verifyBuckets call in the beginning times out repeteadly. This is retried within AWS sdk code up to In a followup experiment, I set request timeout to 200ms, which is enough for verifyBuckets call to succeed but short enough that multi part uploads fail. In these cases, again AWS sdk retries these http requests up to Later, I ran the test with 256M file size and 32M partitionsize. I set the request timeout to 5s. My goal was to introduce a few retries due to short request timeout, but complete the upload operation with the use of retries. I managed to do that. I saw some retries due to short request timeout, but they were retried and the upload operation completed successfully. The test failed anyway because it also expected that When I run the same experiment with 8GB file and 128M partitions but with small request timeout, the test fails due to uploads not being completed. I also ran a soak test with 8GB files with a large request timeout. This passed fine as expected because timeout value was high enough to let uploads complete. @bgaborg @steveloughran @bgaborg |
|
The code failed due to a compile error in trunk before. This now seems to be resolve. I am pushing it again to trigger automated tests. |
2501d26 to
34913f9
Compare
|
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, minor nits left -and I've suggested text for the documentation.
Thanks for the (rigorous) testing.
It's clear that what makes a good timeout is going to depend on which operation is taking place. And choosing a value which may trigger a fast failover en simple HTTP queries is it going to hurt bulk uploads.
I'm happy to have it in there, but I worry that people are going to choose values which may give fast failover on some operations at the expense of support for bulk IO, and running a risk of generating more S3 load. People will need to be careful
To introduce a functional test, we need a mechanism to selectively delay/fail some requests because we want file system initialization to succeed but a subsequent dummy operation(like getFileStatus) to be delayed. Introducing such test support is very hard if not impossible since hadoop-aws does not have any visibility into this mechanism.
we have the InconsistentS3AClient so simulate failures to S3; used for S3Guard. I've been wondering what it would take to actually simulate throttling there as well; some random probability of the client considering itself overloaded, and then having a window where it blocks. Or even better -could we actually let you configure a throttle load and have it trigger when the request rate exceeded it.
Simulating request timeouts would be simpler -but as or fault injecting client goes above the AWS SDK, it won't be testing their internals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log uses {} for entries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make the default unit TimeUnit.Seconds, even if you take the range in millis. People should be using seconds for this timeout
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md
Outdated
Show resolved
Hide resolved
Simulating throttling looks achievable. AWS SDK passes throttle errors up to hadoop-aws and that is where we catch throttling error and implement retry mechanism. So you can simulate throttling in InconsistentAmazonS3Client. However, |
5c8bb4b to
5965b63
Compare
5965b63 to
27e5d26
Compare
|
💔 -1 overall
This message was automatically generated. |
|
test failure looks like a regression of mine -should have been picked up earlier..interesting |
|
test failure unrelated; +1 committed to trunk with the warning text in the commit message. I don't want to be the one fielding support calls about writes not working,.. |
|
@steveloughran @bgaborg thank you for reviews. |
NOTICE
Please create an issue in ASF JIRA before opening a pull request,
and you need to set the title of the pull request which starts with
the corresponding JIRA issue number. (e.g. HADOOP-XXXXX. Fix a typo in YYY.)
For more details, please see https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute