Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -1203,4 +1203,18 @@ private Constants() {
* Default maximum read size in bytes during vectored reads : {@value}.
*/
public static final int DEFAULT_AWS_S3_VECTOR_READS_MAX_MERGED_READ_SIZE = 1253376; //1M

/**
* Flag for immediate failure when observing a {@link AWSBadRequestException}.
* If it's disabled and set to false, the failure is treated as retryable.
* Value {@value}.
*/
public static final String FAIL_ON_AWS_BAD_REQUEST = "fs.s3a.fail.on.aws.bad.request";
Copy link
Contributor

@steveloughran steveloughran Jun 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now think "fs.s3a.retry.on.400.response.enabled" would be better, with default flipped. docs would say "experimental"

and assuming we do have a custom policy, adjacent

fs.s3a.retry.on.400.response.delay  // delay between attempts, default "10s"
fs.s3a.retry.on.400.response.attempts // number of attempts, default 6


/**
* Default value for immediate failure when observing a
* {@link AWSBadRequestException}: {@value}.
*/
public static final boolean DEFAULT_FAIL_ON_AWS_BAD_REQUEST = true;

}
Original file line number Diff line number Diff line change
Expand Up @@ -214,7 +214,10 @@ protected Map<Class<? extends Exception>, RetryPolicy> createExceptionMap() {

// policy on a 400/bad request still ambiguous.
// Treated as an immediate failure
policyMap.put(AWSBadRequestException.class, fail);
RetryPolicy awsBadRequestExceptionRetryPolicy =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the normal retry policy -which is expected to handle network errors- be applied here, or something else

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct me if I'm wrong but before our change, the response as AWSBadRequestException in fact is getting back with a HTTP 400 error code. It is different from other network failures that the fail/RetryPolicies.TRY_ONCE_THEN_FAIL has been applied for.

configuration.getBoolean(FAIL_ON_AWS_BAD_REQUEST, DEFAULT_FAIL_ON_AWS_BAD_REQUEST) ?
fail : retryIdempotentCalls;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. should retry on all calls, rather than just idempotent ones, as long as we are confident that the request is never executed before the failure
  2. I don't believe the normal exponential backoff strategy is the right one, as the initial delays are very short lived (500ms), whereas if you are hoping that credential providers will fetch new credentials, an initial delay of a few seconds would seem better. I wouldn't even bother with exponential growth here, just say 6 times at 10 seconds.

I think we would also want to log at warn that this is happening, assuming this is rare.

policyMap.put(AWSBadRequestException.class, awsBadRequestExceptionRetryPolicy);

// Status 500 error code is also treated as a connectivity problem
policyMap.put(AWSStatus500Exception.class, connectivityFailure);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1118,12 +1118,15 @@ from them.

* Connection timeout: `ConnectTimeoutException`. Timeout before
setting up a connection to the S3 endpoint (or proxy).
* HTTP response status code 400, "Bad Request"
* HTTP response status code 400, "Bad Request" aka `AWSBadRequestException`

The status code 400, Bad Request usually means that the request
is unrecoverable; it's the generic "No" response. Very rarely it
does recover, which is why it is in this category, rather than that
of unrecoverable failures.
of unrecoverable failures. The default behavior fails immediately
without retry. If your system is failure sensitive, you can
configure `fs.s3a.fail.on.aws.bad.request` to `false` and allow
to retry when observing a Bad Request with status code 400.

These failures will be retried with an exponential sleep interval set in
`fs.s3a.retry.interval`, up to the limit set in `fs.s3a.retry.limit`.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -311,12 +311,25 @@ public void testRetryAWSConnectivity() throws Throwable {
*/
@Test(expected = AWSBadRequestException.class)
public void testRetryBadRequestNotIdempotent() throws Throwable {
invoker.retry("test", null, false,

invoker.retry("test", null, true,
() -> {
throw BAD_REQUEST;
});
}

@Test
public void testRetryBadRequestIdempotent() throws Throwable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test looks ok.

Configuration conf = new Configuration(FAST_RETRY_CONF);
conf.setBoolean(FAIL_ON_AWS_BAD_REQUEST, false);
S3ARetryPolicy retryPolicy = new S3ARetryPolicy(conf);

IOException ex = translateException("GET", "/", BAD_REQUEST);
assertRetryAction("Expected retry on aws bad request",
retryPolicy, RetryPolicy.RetryAction.RETRY,
ex, 1, true);
}

@Test
public void testConnectionRetryPolicyIdempotent() throws Throwable {
assertRetryAction("Expected retry on connection timeout",
Expand Down