-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HADOOP-18310 Add option and make 400 bad request retryable #4483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
|
🎊 +1 overall
This message was automatically generated. |
|
which s3 endpoint did you run the hadoop-aws integration tests against, and what was the full mvn command line used? thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made some comments. i do want that s3 endpoint test run declaration before i look at it again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- needs to be all lower case with "." between words
- and javadocs with {@value)
- and something in the documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack and thanks, I will update it soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the normal retry policy -which is expected to handle network errors- be applied here, or something else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct me if I'm wrong but before our change, the response as AWSBadRequestException in fact is getting back with a HTTP 400 error code. It is different from other network failures that the fail/RetryPolicies.TRY_ONCE_THEN_FAIL has been applied for.
|
@steveloughran I should have provided the test result that executed with integration tests in the description, they're not perfect but we can discuss how we move forward. |
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
Do we really need to introduce this config? Seems like an overkill. |
|
reading from the hadoop-aws page
That does not match the code that in fact S3A do not retry for 400 bad request (please correct me if I'm incorrect), in the case I'm reporting and yeah sorry it's not as usual as others, retry did help and help us to use the beauty of retry policy API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Historically 400 errors have always been completely unrecoverable, apart from one or two stack traces we haven't seen again. Failing fast is the correct action in all the situations. We have had problems in the past where bad credentials caused things to hang for a long time as they keep retry and it's completely the wrong behaviour.
I understand the situation here is where a provider is issuing credentials which are out of date and this does not appear to be visible until the request is made of AWS. The retrying is being attempted in the hope that eventually the credential provider will notice that it needs to get new credentials and so a reattempted request will get through. As such it should help correct for some clock skew problem where the client is still catching them and has not noticed that they are out of date.
However, we have a few of things to consider here.
The normal retry policy is intended to cope with all values other than throttling, and is targeted at transient network failures. It does a basic back off with increasing delays and some jitter. Is this really the right recovery strategy for ID broker or or should it immediately go for a delay of a few seconds and keep retrying with a fixed interval of 10s for, say a minute or two?
Next question, do we only retry on calls we consider idempotent, or do we retry on everything? The policy your PR has put up says itempotent only, whereas I believe it should be something similar to connection failures where we assume that the request never got as far as the objects store and can be retried.
Finally, it would be better if we could somehow explicitly tell the credential providers that the request was rejected and that they need to trigger a refresh. The current PR is hoping that if we retry with delays things will sort themselves out -it may be better to go through each of the credential providers in turn and tell them to refresh their values letting them know to the last request was rejected. For EC2 IAM that could ensure that provided to the VM that would involve making a new GET request to the URL serving up credentials. For the ID broker client we would want it to send a request to ID broker notifying IDB that its last request was rejected.
We could even think about doing this always, but precisely once. If you're credential provider can't refresh when it's been told the previous call field then there is no hope for anyone. Slightly efficient here in that we would look to see if any of the credential providers in the list was capable of refreshing (and only force a refresh and retry if so. But: we would have to think hard about how to wire this up as retry policy it's simply Kaci exceptions and deciding what to do; credential providers are signing I wired up deep inside of the AWS code. We would probably need to add something into the call backs AWS provides and force that refresh on a 400 failure, and have the retry policy I am unable retries on the 400 hours if the policy was instantiated in an S3a client whose authentication chain is potentially recoverable
Testing. You know, I was wondering if we could actually add a real integration test here. Seriously. We would need a new credential provider which would be programmed two return some invalid credentials every N requests and the rest of the time provide and no credentials at all. If this provider was placed at the head of the provider chain and N was > than the retry count of the policy, then the entire Hadoop AWS test suite would be expected to run to completion. This would let us catch problems like retry around retry, and even failure to retry.
| policyMap.put(AWSBadRequestException.class, fail); | ||
| RetryPolicy awsBadRequestExceptionRetryPolicy = | ||
| configuration.getBoolean(FAIL_ON_AWS_BAD_REQUEST, DEFAULT_FAIL_ON_AWS_BAD_REQUEST) ? | ||
| fail : retryIdempotentCalls; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- should retry on all calls, rather than just idempotent ones, as long as we are confident that the request is never executed before the failure
- I don't believe the normal exponential backoff strategy is the right one, as the initial delays are very short lived (500ms), whereas if you are hoping that credential providers will fetch new credentials, an initial delay of a few seconds would seem better. I wouldn't even bother with exponential growth here, just say 6 times at 10 seconds.
I think we would also want to log at warn that this is happening, assuming this is rare.
| } | ||
|
|
||
| @Test | ||
| public void testRetryBadRequestIdempotent() throws Throwable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test looks ok.
| * If it's disabled and set to false, the failure is treated as retryable. | ||
| * Value {@value}. | ||
| */ | ||
| public static final String FAIL_ON_AWS_BAD_REQUEST = "fs.s3a.fail.on.aws.bad.request"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now think "fs.s3a.retry.on.400.response.enabled" would be better, with default flipped. docs would say "experimental"
and assuming we do have a custom policy, adjacent
fs.s3a.retry.on.400.response.delay // delay between attempts, default "10s"
fs.s3a.retry.on.400.response.attempts // number of attempts, default 6
Description of PR
Add option and make 400 bad request retryable, added
fs.s3a.fail.on.aws.bad.requestand default totruesuch that it's acting the same behavior without turning it on.How was this patch tested?
Add a new unit test.
For code changes:
LICENSE,LICENSE-binary,NOTICE-binaryfiles?Integration Tests executed
I configured auth-keys.xml with using
AssumedRoleCredentialProviderwith aTemporaryAWSCredentialsProviderinauth-keys.xml. the region is us-west-2 and the endpoint is s3.us-west-2.amazonaws.com.The default-integration-test failed because the following reasons, they looks fine to me.
com.amazonaws.services.securitytoken.model.AWSSecurityTokenServiceException: Cannot call GetSessionToken with session credentialsava.io.IOException: org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider instantiation exception: java.lang.IllegalArgumentException: Proxy error: fs.s3a.proxy.username or fs.s3a.proxy.password set without the other.java.nio.file.AccessDeniedException: s3a://abc/fork-0006/test: getFileStatus on s3a://abc/fork-0006/test: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied;org.apache.hadoop.service.ServiceStateException: java.io.IOException: Unset property fs.s3a.assumed.role.arn, this one is strange because I have setfs.s3a.assumed.role.arninauth-keys.xmlFor the sequential-integration-tests, the errors are.
org.junit.runners.model.TestTimedOutException: test timed out after 180000 millisecondsgetFileStatuscannot be done,ERROR contract.ContractTestUtils (ContractTestUtils.java:cleanup(383)) - Error deleting in TEARDOWN - /test: java.io.InterruptedIOException: getFileStatus on s3a://taklwu-thunderhead-dev/test: com.amazonaws.AbortedException[ERROR] test_100_audit_root_noauth(org.apache.hadoop.fs.s3a.tools.ITestMarkerToolRootOperations) Time elapsed: 7.774 s <<< ERROR! 46: Marker count 2 out of range [0 - 0]I'm wondered if I have a clean simple credential without using assumeRole setup, all above tests could passed and would not get into permission and the special
GetSessionTokenproblem.