Skip to content

Conversation

@mustafaiman
Copy link
Contributor

NOTICE

Please create an issue in ASF JIRA before opening a pull request,
and you need to set the title of the pull request which starts with
the corresponding JIRA issue number. (e.g. HADOOP-XXXXX. Fix a typo in YYY.)
For more details, please see https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute

@mustafaiman mustafaiman force-pushed the HADOOP-16792 branch 2 times, most recently from a1faf69 to 980fd0f Compare January 11, 2020 00:17
@bgaborg bgaborg self-requested a review January 17, 2020 12:10
Copy link

@bgaborg bgaborg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @mustafaiman!
Looking great as a start, but there are some things still to work on.
testRequestTimeout looks great, but it will not test the actual functionality, just if the parameter has been passed for the S3Client. Could you add an integration testcase where you actually test the functionality, s we can assert that an actual request will time out and assert on that?

@bgaborg bgaborg requested a review from steveloughran January 17, 2020 12:36
@steveloughran
Copy link
Contributor

sorry, I had commented but I hadn't hit the "complete review" comment and so github didn't submit it

@apache apache deleted a comment from hadoop-yetus Jan 17, 2020
Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some consequences of this feature -requests are more likely to timeout. How does that affect our retry code?

Is the retry all handled in the AWS library, or is it passed up to the caller for us to handle? If it's sent up to us -what exception gets raised and will we react to it properly?

A real problem here will be how to react to failures of non-idempotent operations. Now we actually view ~all ops as idempotent, even DELETE, which isn't quite true. But if there's a timeout on the POST to complete a big upload, and that timeout happens after the request has been processed -how are things going to fail on the retry? See HADOOP-14028 there.

Anyway, like you say -it's going to be configurable. But, I am curious as to what happens to a full -Dscale test run if you set it to a small value such as 1ms. please run this and tell me what happened.

Probably the best stress tests will be ITestS3AHugeFiles*, with the size of file to upload and the amount of each upload ramped up to represent real-world numbers. This is where HADOOP-14028 surfaced.

You can set some larger numbers in auth-keys.xml

fs.s3a.scale.test.huge.filesize = 4G
fs.s3a.scale.test.huge.partitionsize = 128M

@apache apache deleted a comment from hadoop-yetus Jan 21, 2020
@mustafaiman
Copy link
Contributor Author

@steveloughran thanks for the detailed review, i'll adress the comments shortly

@mustafaiman mustafaiman force-pushed the HADOOP-16792 branch 3 times, most recently from 543228c to 2501d26 Compare January 22, 2020 01:15
@mustafaiman
Copy link
Contributor Author

@steveloughran I ran ITestS3AHugeFilesDiskBlocks#test_010_CreateHugeFile with some combinations.

The first experiments used default file size and partition for huge files. I set request timeout to 1 ms for the first experiment. Test file system failed to initialize. This is because verifyBuckets call in the beginning times out repeteadly. This is retried within AWS sdk code up to com.amazonaws.ClientConfiguration#maxErrorRetry times. This value is configurable from Hadoop side via property fs.s3a.attempts.maximum. All of this retries are opaque to Hadoop. At the end of this retry cycle, aws sdk returns failure to Hadoop's Invoker. Then, Invoker evaluates whether to retry this operation or not according to its configured retry policies. I saw that verifyBuckets call were not retried on Invoker level.

In a followup experiment, I set request timeout to 200ms, which is enough for verifyBuckets call to succeed but short enough that multi part uploads fail. In these cases, again AWS sdk retries these http requests up to maxErrorRetry times. After this http request fails maxErrorRetry times, Invoker's retry mechanism kicks in. I observed Invoker to retry these operations up to fs.s3a.retry.limit times conforming to configured exponential back-off limited retry policy. After all these fs.s3a.retry.limit*maxErrorRetry retries, Invoker bubbles up AWSClientIOException to the user as shown below:

org.apache.hadoop.fs.s3a.AWSClientIOException: upload part on tests3ascale/disk/hugefile: com.amazonaws.SdkClientException: Unable to execute HTTP request: Request did not complete before the request timeout configuration.: Unable to execute HTTP request: Request did not complete before the request timeout configuration.
	at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:205)
	at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:112)
	at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$4(Invoker.java:315)
	at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:407)
	at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:311)

Later, I ran the test with 256M file size and 32M partitionsize. I set the request timeout to 5s. My goal was to introduce a few retries due to short request timeout, but complete the upload operation with the use of retries. I managed to do that. I saw some retries due to short request timeout, but they were retried and the upload operation completed successfully. The test failed anyway because it also expected that TRANSFER_PART_FAILED_EVENT be 0. This is obviously not the case because some transfers failed but they were retried. I checked S3 and verified that the file was there. I also verified that temporary partition files were cleared in my local drive.

When I run the same experiment with 8GB file and 128M partitions but with small request timeout, the test fails due to uploads not being completed.

I also ran a soak test with 8GB files with a large request timeout. This passed fine as expected because timeout value was high enough to let uploads complete.

@bgaborg
I did not add a functional test here. I am repeating what we talked offline to leave a record of the reasoning. The retry mechanism is entirely within AWS SDK as I explained earlier in this comment. To introduce a functional test, we need a mechanism to selectively delay/fail some requests because we want file system initialization to succeed but a subsequent dummy operation(like getFileStatus) to be delayed. Introducing such test support is very hard if not impossible since hadoop-aws does not have any visibility into this mechanism. So, we depend on AWS SDK to maintain this mechanism for which they expose a configuration option here. I think this is very reasonable assumption. If AWS SDK cuts support for this feature in the feature, the configuration test I added here will fail anyway. Because I expect AWS SDK to at least throw an exception trying to set a configuration on ClientConfiguration object that is not supported anymore, or they would remove the method completely which would cause a compile error.

@steveloughran @bgaborg
I addressed the other code comments in the code. This final version of code has passed hadoop-aws test suite against us-west-1.

@mustafaiman
Copy link
Contributor Author

The code failed due to a compile error in trunk before. This now seems to be resolve. I am pushing it again to trigger automated tests.

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 1m 6s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 markdownlint 0m 0s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 20m 28s trunk passed
+1 💚 compile 0m 32s trunk passed
+1 💚 checkstyle 0m 23s trunk passed
+1 💚 mvnsite 0m 35s trunk passed
+1 💚 shadedclient 14m 42s branch has no errors when building and testing our client artifacts.
+1 💚 javadoc 0m 25s trunk passed
+0 🆗 spotbugs 0m 58s Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚 findbugs 0m 55s trunk passed
_ Patch Compile Tests _
+1 💚 mvninstall 0m 31s the patch passed
+1 💚 compile 0m 26s the patch passed
+1 💚 javac 0m 26s the patch passed
+1 💚 checkstyle 0m 17s the patch passed
+1 💚 mvnsite 0m 31s the patch passed
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 shadedclient 15m 5s patch has no errors when building and testing our client artifacts.
+1 💚 javadoc 0m 21s the patch passed
+1 💚 findbugs 1m 3s the patch passed
_ Other Tests _
+1 💚 unit 1m 27s hadoop-aws in the patch passed.
+1 💚 asflicense 0m 27s The patch does not generate ASF License warnings.
60m 33s
Subsystem Report/Notes
Docker Client=19.03.5 Server=19.03.5 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-1795/5/artifact/out/Dockerfile
GITHUB PR #1795
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle markdownlint
uname Linux a14cf564963d 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / 9520b2a
Default Java 1.8.0_232
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-1795/5/testReport/
Max. process+thread count 341 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-1795/5/console
versions git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

@apache apache deleted a comment from hadoop-yetus Jan 23, 2020
@apache apache deleted a comment from hadoop-yetus Jan 23, 2020
Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, minor nits left -and I've suggested text for the documentation.

Thanks for the (rigorous) testing.

It's clear that what makes a good timeout is going to depend on which operation is taking place. And choosing a value which may trigger a fast failover en simple HTTP queries is it going to hurt bulk uploads.

I'm happy to have it in there, but I worry that people are going to choose values which may give fast failover on some operations at the expense of support for bulk IO, and running a risk of generating more S3 load. People will need to be careful

To introduce a functional test, we need a mechanism to selectively delay/fail some requests because we want file system initialization to succeed but a subsequent dummy operation(like getFileStatus) to be delayed. Introducing such test support is very hard if not impossible since hadoop-aws does not have any visibility into this mechanism.

we have the InconsistentS3AClient so simulate failures to S3; used for S3Guard. I've been wondering what it would take to actually simulate throttling there as well; some random probability of the client considering itself overloaded, and then having a window where it blocks. Or even better -could we actually let you configure a throttle load and have it trigger when the request rate exceeded it.

Simulating request timeouts would be simpler -but as or fault injecting client goes above the AWS SDK, it won't be testing their internals.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log uses {} for entries

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make the default unit TimeUnit.Seconds, even if you take the range in millis. People should be using seconds for this timeout

@mustafaiman
Copy link
Contributor Author

@steveloughran

we have the InconsistentS3AClient so simulate failures to S3; used for S3Guard. I've been wondering what it would take to actually simulate throttling there as well; some random probability of the client considering itself overloaded, and then having a window where it blocks. Or even better -could we actually let you configure a throttle load and have it trigger when the request rate exceeded it.

Simulating request timeouts would be simpler -but as or fault injecting client goes above the AWS SDK, it won't be testing their internals.

Simulating throttling looks achievable. AWS SDK passes throttle errors up to hadoop-aws and that is where we catch throttling error and implement retry mechanism. So you can simulate throttling in InconsistentAmazonS3Client. However, fs.s3a.connection.request.timeout affects the retry mechanism below AWS SDK. So as you already pointed out, it is not useful in testing this particular configuration.

@mustafaiman mustafaiman force-pushed the HADOOP-16792 branch 3 times, most recently from 5c8bb4b to 5965b63 Compare January 23, 2020 22:18
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 1m 8s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 markdownlint 0m 0s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 0m 21s Maven dependency ordering for branch
+1 💚 mvninstall 20m 37s trunk passed
+1 💚 compile 17m 26s trunk passed
+1 💚 checkstyle 2m 47s trunk passed
+1 💚 mvnsite 2m 4s trunk passed
+1 💚 shadedclient 19m 53s branch has no errors when building and testing our client artifacts.
+1 💚 javadoc 1m 56s trunk passed
+0 🆗 spotbugs 1m 5s Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚 findbugs 3m 6s trunk passed
_ Patch Compile Tests _
+0 🆗 mvndep 0m 21s Maven dependency ordering for patch
+1 💚 mvninstall 1m 20s the patch passed
+1 💚 compile 16m 46s the patch passed
+1 💚 javac 16m 46s the patch passed
+1 💚 checkstyle 2m 48s the patch passed
+1 💚 mvnsite 2m 4s the patch passed
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 xml 0m 2s The patch has no ill-formed XML file.
+1 💚 shadedclient 13m 58s patch has no errors when building and testing our client artifacts.
+1 💚 javadoc 1m 54s the patch passed
+1 💚 findbugs 3m 26s the patch passed
_ Other Tests _
-1 ❌ unit 9m 14s hadoop-common in the patch failed.
+1 💚 unit 1m 34s hadoop-aws in the patch passed.
+1 💚 asflicense 0m 46s The patch does not generate ASF License warnings.
123m 26s
Reason Tests
Failed junit tests hadoop.fs.TestHarFileSystem
Subsystem Report/Notes
Docker Client=19.03.5 Server=19.03.5 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-1795/7/artifact/out/Dockerfile
GITHUB PR #1795
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient xml findbugs checkstyle markdownlint
uname Linux e04c8d8cf424 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/hadoop.sh
git revision trunk / 978c487
Default Java 1.8.0_232
unit https://builds.apache.org/job/hadoop-multibranch/job/PR-1795/7/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt
Test Results https://builds.apache.org/job/hadoop-multibranch/job/PR-1795/7/testReport/
Max. process+thread count 1346 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://builds.apache.org/job/hadoop-multibranch/job/PR-1795/7/console
versions git=2.7.4 maven=3.3.9 findbugs=3.1.0-RC1
Powered by Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

@apache apache deleted a comment from hadoop-yetus Jan 24, 2020
@steveloughran
Copy link
Contributor

test failure looks like a regression of mine -should have been picked up earlier..interesting

@steveloughran
Copy link
Contributor

test failure unrelated;

+1

committed to trunk with the warning text in the commit message. I don't want to be the one fielding support calls about writes not working,..

@mustafaiman
Copy link
Contributor Author

@steveloughran @bgaborg thank you for reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants