-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HADOOP-19044. AWS SDK V2 - Update S3A region logic #6479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Testing against |
|
Several tests failing, cause is being discussed on HADOOP-18975 |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
@HarshitGupta11 has a PR for this too. Harshit, can you share yours? also I'd like @ahmarsuhail to review this as when it comes to region stuff and v2 sdk I'm still permanently confused about the corner cases. Could make it a job interview question for anyone claiming deep knowledge of AWS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @virajjasani, left some comments.
Prefer this to @HarshitGupta11's PR as I think that sets the region to US_EAST_2 whenever endpoint=s3.amazonaws.com, regardless of what is in fs.s3a.endpoint.region.
| origin = "SDK region chain"; | ||
| } | ||
|
|
||
| if (endpointStr != null && endpointStr.endsWith(CENTRAL_ENDPOINT)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of this, I think what we should do is on line 307, where it says:
if (region != null) {
builder.region(region);
}
do
if (region != null) {
builder.region(region);
if (region == US_EAST_1 && endpointStr != null && endpointStr.endsWith(CENTRAL_ENDPOINT)) {
builder.crossRegionAccessEnabled(true);
LOG.debug("Enabling cross region access for endpoint {}", endpointStr);
}
}
you will only get into that if block if you haven't set fs.s3a.endpoint.region and a region could be determined from your fs.s3a.endpoint.region , which will happen in this case. As getS3RegionFromEndpoint will return US_EAST_1 for s3.amazonaws.com.
Currently we are setting cross region enabled whenever fs.s3a.endpoint = s3.amazonaws.com, even if you know your region and you've set it correctly in fs.s3a.endpoint.region.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's also set the region to US_EAST_2 when enabling cross region. When I was doing this work a few months ago, I found that cross region with US_EAST_1 was showing some weird behaviours..I didn't dive into them at the time. Everything worked as expected with US_EAST_2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting!
You mean to say, with US_EAST_1 cross region, we have some intermittent issues while accessing bucket from other region? Or we always have issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I saw issues..403's with the landstat buckets iirc. Could be related to this. I don't know if those issues still exist, it could be worth testing with US_EAST_1 to see. It shouldn't make a difference what we set though, if you haven't got the correct region SDK will make a call to figure out the correct region and then cache it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this look good for source changes 05225bb ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
landsat bucket seems to have been closed off; we will need to move off it -but the replacement must be something other than us-east so it stresses more of the system
| public class ITestS3ACrossRegionAccess extends AbstractS3ATestBase { | ||
|
|
||
| @Test | ||
| public void testCentralEndpointCrossRegionAccess() throws Throwable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be better to move ITestsS3AEndpoint instead of creating a new test class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that test, we are not trying to create/write/read file using fs, so i thought of keeping this separate. This test fails during mkdir with 400 without the source change.
|
|
||
| newConf.set(ENDPOINT, CENTRAL_ENDPOINT); | ||
|
|
||
| try (S3AFileSystem newFs = new S3AFileSystem()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you look at the tests in ITestS3AEndpointRegion you can see how we do it there, instead of creating anything, we just intercept the request and check if things are being set correctly.
For this what we want to see is that if the central point is configured, and no region is configured .. region gets set to US_EAST_2 for cross region.
But if central endpoint is configured, and region is configured to US_EAST_1 , then region is US_EAST_1. that is region config takes precedence. See testCentralEndpoint in that class, does something similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked that test but since there was no real fs operation involved, i thought of keeping this as separate test. Does that work?
I can keep this test class as is and maybe write one more test in ITestS3AEndpointRegion as well without much of file operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to do a mdkir to verify behaviour though? won't just doing a headBucket also work? I think that would also fail without your change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have our custom RegionInterceptor, we will anyways not be able to make real request right? So regardless of whether we enable cross region endpoint, ITestS3AEndpointRegion is not going to help make real head request, that's why i thought of creating new class where regardless of the endpoint/region used to create bucket, new fs with central endpoint is able to perform file operations on the bucket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping the stacktrace here for reference:
org.apache.hadoop.fs.s3a.AWSBadRequestException: getFileStatus on s3a://${bucket}/user/vjasani/basePath-testCentralEndpointCrossRegionAccess/srcdir: software.amazon.awssdk.services.s3.model.S3Exception: The authorization header is malformed; the region 'us-east-2' is wrong; expecting 'us-west-2' (Service: S3, Status Code: 400, Request ID: G85CNFC579T4MJ76, Extended Request ID: xrYGGqXdYtr72cYyFN3v4yemDxBCYkdt8mYd8cGItNhdx1EmZMLxMhwJTwzmWZT6ershid/WT4w=):AuthorizationHeaderMalformed: The authorization header is malformed; the region 'us-east-2' is wrong; expecting 'us-west-2' (Service: S3, Status Code: 400, Request ID: G85CNFC579T4MJ76, Extended Request ID: xrYGGqXdYtr72cYyFN3v4yemDxBCYkdt8mYd8cGItNhdx1EmZMLxMhwJTwzmWZT6ershid/WT4w=)
at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:259)
at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:154)
at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:4075)
at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3934)
at org.apache.hadoop.fs.s3a.S3AFileSystem$MkdirOperationCallbacksImpl.probePathStatus(S3AFileSystem.java:3806)
at org.apache.hadoop.fs.s3a.impl.MkdirOperation.probePathStatusOrNull(MkdirOperation.java:173)
at org.apache.hadoop.fs.s3a.impl.MkdirOperation.getPathStatusExpectingDir(MkdirOperation.java:194)
at org.apache.hadoop.fs.s3a.impl.MkdirOperation.execute(MkdirOperation.java:108)
at org.apache.hadoop.fs.s3a.impl.MkdirOperation.execute(MkdirOperation.java:57)
at org.apache.hadoop.fs.s3a.impl.ExecutingStoreOperation.apply(ExecutingStoreOperation.java:76)
at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(IOStatisticsBinding.java:547)
at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:528)
at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:449)
at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2719)
at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2738)
at org.apache.hadoop.fs.s3a.S3AFileSystem.mkdirs(S3AFileSystem.java:3778)
at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:2494)
at org.apache.hadoop.fs.s3a.ITestS3ACrossRegionAccess.testCentralEndpointCrossRegionAccess(ITestS3ACrossRegionAccess.java:54)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to nitpick here! but in ITestS3AEndpointRegion, there is also a test currently that uses the FS, see testWithoutRegionConfig.
You can add another test there that does something like:
Configuration conf = getConfiguration();
removeBaseAndBucketOverrides(conf, ENDPOINT, AWS_REGION);
conf.set(ENDPOINT, CENTRAL_ENDPOINT);
newFS = new S3AFileSystem();
newFS.initialize(getFileSystem().getUri(), conf);
newFS.create(methodPath()).close();
This will fail without your source changes, but passes with them.
|
🎊 +1 overall
This message was automatically generated. |
| // endpoint is for US_EAST_1; | ||
| return Region.US_EAST_1; | ||
| // endpoint is for US_EAST_2; | ||
| return Region.US_EAST_2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changing this causes confusion. Maybe its better to the use the variable present in Constants.
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/DefaultS3ClientFactory.java
Outdated
Show resolved
Hide resolved
lol I was going to comment on something similar. This is the most confusing code in whole of hadoop-aws module. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
| * The special S3 region which can be used to talk to any bucket. | ||
| * Value {@value}. | ||
| */ | ||
| public static final String AWS_S3_CENTRAL_REGION = "us-east-1"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is already a default region below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kept both constants, with AWS_S3_CENTRAL_REGION = AWS_S3_DEFAULT_REGION
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's not change these constants. AWS_S3_CENTRAL_REGION should be us-east-1 as that is the central/global region.
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nearly there, suggested some more minor changes.
| * The special S3 region which can be used to talk to any bucket. | ||
| * Value {@value}. | ||
| */ | ||
| public static final String AWS_S3_CENTRAL_REGION = "us-east-1"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's not change these constants. AWS_S3_CENTRAL_REGION should be us-east-1 as that is the central/global region.
| // endpoint is for US_EAST_1; | ||
| return Region.US_EAST_1; | ||
| // endpoint for central region | ||
| return Region.of(AWS_S3_CENTRAL_REGION); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than changing constants, return the AWS_S3_DEFAULT_REGION here.
|
|
||
| // endpoint is for US_EAST_1; | ||
| return Region.US_EAST_1; | ||
| // endpoint for central region |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add a detailed comment here now why we have to do this with a link to the SPARK issue. Something along the lines of "If no region or endpoint is set, Spark will set the endpoint to s3.amazonaws.com. Since we do not know the region at this point, use the default region and enable cross region access"
|
|
||
| newConf.set(ENDPOINT, CENTRAL_ENDPOINT); | ||
|
|
||
| try (S3AFileSystem newFs = new S3AFileSystem()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to nitpick here! but in ITestS3AEndpointRegion, there is also a test currently that uses the FS, see testWithoutRegionConfig.
You can add another test there that does something like:
Configuration conf = getConfiguration();
removeBaseAndBucketOverrides(conf, ENDPOINT, AWS_REGION);
conf.set(ENDPOINT, CENTRAL_ENDPOINT);
newFS = new S3AFileSystem();
newFS.initialize(getFileSystem().getUri(), conf);
newFS.create(methodPath()).close();
This will fail without your source changes, but passes with them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments based on the response on the SDK issue.
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3AEndpointRegion.java
Show resolved
Hide resolved
| // No region was configured, try to determine it from the endpoint. | ||
| if (region == null) { | ||
| region = getS3RegionFromEndpoint(parameters.getEndpoint()); | ||
| boolean endpointEndsWithCentral = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so from the response on SDK issue, it looks like we don't want to override if endpoint is s3.amazonaws.com and region is not US_EAST_1.
Currently, if fs.s3a.endpoint is s3.amazonaws.com and fs.s3a.endpoint.region is eu-west-1 for example, we will end up overriding and end up with the same 400 errors.
What if we never override if endpoint is s3.amazonaws.com?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we never override if endpoint is s3.amazonaws.com?
That sounds right, let me test with various combinations of endpoint and region before making the changes:
e.g.
- endpoint central and region null
- endpoint central and region anything other than us-east-2
- endpoint central and region us-east-2
- endpoint null and region null
- endpoint s3-us-east-2.amazonaws.com and region us-east-2 (and null)
- endpoint s3-us-east-1.amazonaws.com and region us-east-1 (and null)
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3AEndpointRegion.java
Show resolved
Hide resolved
|
Tests with bucket created on us-west-2, and endpoint/region configured:
|
|
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks a lot for the effort on this so far @virajjasani, really appreciated. sorry this is having to go through so many revisions and testing phases. just suggested a change to the logic, let me know what you think
| describe("Create a client with the central endpoint but also specify region"); | ||
| Configuration conf = getConfiguration(); | ||
|
|
||
| S3Client client = createS3Client(conf, CENTRAL_ENDPOINT, US_WEST_2, US_EAST_2, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for example here, if configured region is US_WEST_2, expected region should also be US_WEST_2, not US_EAST_2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as in #6466 I'm going to propose we make the static methods accessible and unit tests to validate them, because
- this stuff is so important and complicated we need it running on every pr
- everyone's ITest setup is different, so may miss things.
| endpointStr.endsWith(CENTRAL_ENDPOINT); | ||
| // No region was configured or the endpoint is central, | ||
| // determine the region from the endpoint. | ||
| if (region == null || endpointEndsWithCentral) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, not sure about this. now we're parsing region if region is null or endpoint = s3.amazonaws.com.
So if you set s3.amazonaws.com and region to eu-west-2, you still end up with us setting the region to us-east-2 and cross region enabled. My thinking here is that a lot of people may have endpoint set to s3.amazonaws.com (as atleast with SDK V1 it was harmless to do that I think) .
we only want to get into this parsing if region == null. so let's revert to the previous condition here. And then we never don't want to override if the endpoint is s3.amazonaws.com. Suggested:
if (endpoint != null) {
checkArgument(!fipsEnabled,
"%s : %s", ERROR_ENDPOINT_WITH_FIPS, endpoint);
boolean endpointEndsWithCentral =
endpointStr.endsWith(CENTRAL_ENDPOINT);
// No region was configured or the endpoint is central,
// determine the region from the endpoint.
if (region == null) {
region = getS3RegionFromEndpoint(endpointStr,
endpointEndsWithCentral);
if (region != null) {
origin = "endpoint";
if (endpointEndsWithCentral) {
builder.crossRegionAccessEnabled(true);
origin = "origin with cross region access";
LOG.debug("Enabling cross region access for endpoint {}",
endpointStr);
}
}
}
// No need to override endpoint with "s3.amazonaws.com".
// Let the client take care of endpoint resolution. Overriding
// the endpoint with "s3.amazonaws.com" causes 400 Bad Request
// errors for non-existent buckets and objects.
// ref: https://github.com/aws/aws-sdk-java-v2/issues/4846
if (!endpointEndsWithCentral) {
builder.endpointOverride(endpoint);
LOG.debug("Setting endpoint to {}", endpoint);
}
}
So now:
- if endpoint = s3.amazonaws.com and region is null, set to US_EAST_2 and enable cross region, and don't override endpoint.
- if endpoint = s3.amazonaw.com and region is set (eg to eu-west-1), set region but do not override endpoint..let SDK figure it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even for user configured region, for sdk to figure out, we still need to enable cross region access right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise we will have same problem i suppose e.g. bucket on us-west-2 won't be accessible by central endpoint and us-west-1 combination. It will only be accessible by central endpoint and null region combination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, SDK fill figure out the endpoint even if cross region is not enabled. cross region is only if you don't know the region, so we set a random region and enable it. it doesn't effect endpoint resolution behaviour afaik
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, will test this out today. Thanks a lot for the reviews!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think anyone should set region=us-west-2 and endpoint = us-west-1 unless they like debugging things.
all we want is to handle situations where things are not set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant "central endpoint" with "us-west-1" region, to access bucket created on us-west-2. I will test out the combination. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now:
- if endpoint = s3.amazonaws.com and region is null, set to US_EAST_2 and enable cross region, and don't override endpoint.
- if endpoint = s3.amazonaw.com and region is set (eg to eu-west-1), set region but do not override endpoint..let SDK figure it out
For case#2, if user sets region to eu-west-1, is it possible that bucket might be on different region? If the answer is yes and if we want to cover that case, we will have to enable cross-region access for endpoint = s3.amazonaws.com, regardless of the region set (null or any particular one).
We should cover this, otherwise we get redirect errors, i will update the PR.
|
for no. 5, > endpoint s3-us-east-2.amazonaws.com and region us-east-2 (and null) you should be able to perform all operations right? |
It's not central endpoint, and cross region access is also not enabled. Bucket is on us-west-2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is getting complicated enough that I'm getting a headache. And I am already worrying about how I would field a support call related to this.
See #6466 for some pending VPCE changes too.
- the resolution algorithm needs to be written down in the markdown files for people who are not looking at the code to work out.
- we need unit tests of all the possible combinations. they don't need create the client, just given an endpoint region combo create the client configuration builder which we can make assertions on.
| endpointStr.endsWith(CENTRAL_ENDPOINT); | ||
| // No region was configured or the endpoint is central, | ||
| // determine the region from the endpoint. | ||
| if (region == null || endpointEndsWithCentral) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think anyone should set region=us-west-2 and endpoint = us-west-1 unless they like debugging things.
all we want is to handle situations where things are not set.
| describe("Create a client with the central endpoint but also specify region"); | ||
| Configuration conf = getConfiguration(); | ||
|
|
||
| S3Client client = createS3Client(conf, CENTRAL_ENDPOINT, US_WEST_2, US_EAST_2, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as in #6466 I'm going to propose we make the static methods accessible and unit tests to validate them, because
- this stuff is so important and complicated we need it running on every pr
- everyone's ITest setup is different, so may miss things.
| public void testCentralEndpointCrossRegionAccess() throws Throwable { | ||
| describe("Create bucket on different region and access it using central endpoint"); | ||
| final Configuration conf = getConfiguration(); | ||
| removeBaseAndBucketOverrides(conf, ENDPOINT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what should region be set to here? either unset it or explicitly set it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure will set region here because null region is anyways covered below.
Repeated for several tests. FS ops outputs as expected. |
|
Latest revision involves doc updates + tests that cover wider combination of central endpoint cases, that are expected to run for each PR without worrying about endpoint/region settings. |
|
🎊 +1 overall
This message was automatically generated. |
|
Ran full test suite against endpoints:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @virajjasani , source looks good to me. just some minor comments about documentation and tests.
| describe("Create a client with the central endpoint but also specify region"); | ||
| Configuration conf = getConfiguration(); | ||
|
|
||
| S3Client client = createS3Client(conf, CENTRAL_ENDPOINT, US_WEST_2, US_WEST_2, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not clear to me what these different test cases are doing. looks like they call check if you set endpoint to central and also configure a region, it's always the configured region that gets set. do we really need all of them?
| using this client. If the bucket is also in eu-west-2, then this will return a successful | ||
| response. Otherwise it will throw an error with status code 301 permanently moved. This error | ||
| contains the region of the bucket in its header, which we can then use to configure the client. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of what is in this doc is outdated, it was written when we raised the initial PR for the upgrade to help with reviewing. For example, we don't do the HEAD call to get the region of the bucket. We had to do this when crossRegion was not supported in SDK V2.
Let's create a new section in index.md which explains region handling, something like:
fs.s3a.endpointandfs.s3a.endpoint.regioncan be used to set values for endpoint and region.- If a value is set in
fs.s3a.endpoint.region, S3A will configure the S3 Client to use this value. If this is set to a region that does not match your bucket, you will receive a 301 redirect response. - If no region is set, but an endpoint is set, S3A will attempt to parse your region from the endpoint.
- If your endpoint is set to the central
s3.amazonaws.com, S3A will enable cross region access. This means that even if the regionfs.s3a.endpoint.regionis incorrect, the SDK will determine the region. This is done by making the request, and if the SDK receives a 301 redirect, it determines the region at the cost of a HEAD request, and caches it. - If no region is no set, and none could be determined from the endpoint, S3A will use US_EAST_2 as the default region and enable cross region access. Again, this means that requests will be not fail, but an initial HEAD request will be made to get the region of your bucket.
- If the configured region is an empty string, we fallback to using the SDK region resolution chain.
I don't think we need to explain specifically what the code is doing in too much detail, rather just the end behaviour the user will see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- put it in
connecting.md, link in index.md as "how to connect" - Highlight that it is complex and improving, so may be considered unstable.
- Link to
third_party_stores.htmlto make clear if you are working with third party stores to look there.
| } | ||
|
|
||
| @Test | ||
| public void testCentralEndpointWithEUWest2Region() throws Throwable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the value in having both testCentralEndpointWithUSWest2Region and testCentralEndpointWithEUWest2Region only comes in when your test bucket is in us-west-2, and then with the eu-west-2 test you check if cross region access behaves as expected. since we can't guarantee where users test against, I don't think we need both tests. Unless we end up using one of the public buckets, as for that we know the region.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's go with a public bucket
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public bucket wouldn't work for full CRUD operation? maybe only fs#exists and fs#open followed by some reads would work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that's enough to know if the request is ending up in the right place. is this logic doesn't work properly, even a HEAD will fail. all we really need is a HEAD object I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still wonder, is this not good coverage? basically regardless of where the bucket really is, since we have two tests with overriding region with us-west-2 and eu-west-2, we will ensure that at least once, we have covered the cross region access case with central endpoint at least once. That's the main considerations for central endpoint tests, wdyt? maybe i can still use public bucket as an additional test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated tests: accessing public bucket with central endpoint + same region, and different region, for guaranteed outcome.
| LOG.debug("Setting endpoint to {}", endpoint); | ||
| } else { | ||
| builder.crossRegionAccessEnabled(true); | ||
| origin = "origin with cross region access"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"central endpoint with cross region access"
| using this client. If the bucket is also in eu-west-2, then this will return a successful | ||
| response. Otherwise it will throw an error with status code 301 permanently moved. This error | ||
| contains the region of the bucket in its header, which we can then use to configure the client. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- put it in
connecting.md, link in index.md as "how to connect" - Highlight that it is complex and improving, so may be considered unstable.
- Link to
third_party_stores.htmlto make clear if you are working with third party stores to look there.
| } | ||
|
|
||
| @Test | ||
| public void testCentralEndpointWithEUWest2Region() throws Throwable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's go with a public bucket
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some minor changes.
the test suite is going to blow up when testing with third party store credentials; not done for that a while but I think the policy there is "don't worry"
|
|
||
| ## Connecting to Amazon S3 or a third-party store | ||
|
|
||
| See [Connecting to an Amazon S3 Bucket through the S3A Connector](connecting.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that should end with .html (I know, not your change. mine)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
|
||
| try (FSDataInputStream in = newFS.open(srcFilePath)) { | ||
| Assertions | ||
| .assertThat(in.read(buffer, 0, 3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use readFully() and skip the assert. read() can return 0-3 so brittle on slow networks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @virajjasani . +1, LGTM.
|
merged to trunk; if someone can do a cherrypick and retest to branch-3.4 I'll merge that PR too. For now I want both branches up to date |
|
branch-3.4 backport PR: #6524 |
Jira: HADOOP-19044