-
Couldn't load subscription status.
- Fork 9.1k
HADOOP-19044: AWS SDK V2 - Update S3A region logic #6482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If both fs.s3a.endpoint & fs.s3a.endpoint.region are empty, Spark will set fs.s3a.endpoint to s3.amazonaws.com here:https://github.com/apache/spark/blob/9a2f39318e3af8b3817dc5e4baf52e548d82063c/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L540 HADOOP-18908, updated the region logic such that if fs.s3a.endpoint.region is set, or if a region can be parsed from fs.s3a.endpoint (which will happen in this case, region will be US_EAST_1), cross region access is not enabled. This will cause 400 errors if the bucket is not in US_EAST_1. Proposed: Updated the logic so that if the endpoint is the global s3.amazonaws.com , cross region access is enabled.
|
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @HarshitGupta11 , I think we should go with @virajjasani's implementation #6479 .. let me know what you think.
| region = getS3RegionFromEndpoint(parameters.getEndpoint()); | ||
| if (region != null) { | ||
| origin = "endpoint"; | ||
| if(parameters.getEndpoint().equals(CENTRAL_ENDPOINT)){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if someone configures fs.s3a.endpoint to s3.amazonaws.com and sets region in fs.s3a.endpoint.region to eu-west-1, this code will start ignoring what we have in fs.s3a.endpoint.region and just enable cross region for everything. I think this could be risky as people could just have s.s3a.endpoint to s3.amazonaws.com in their core-site.xml as it doesn't make a difference if you've set your region right.
Cool, I will close this pull request. |
If both fs.s3a.endpoint & fs.s3a.endpoint.region are empty, Spark will set fs.s3a.endpoint to
s3.amazonaws.com here:https://github.com/apache/spark/blob/9a2f39318e3af8b3817dc5e4baf52e548d82063c/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L540
HADOOP-18908, updated the region logic such that if fs.s3a.endpoint.region is set, or if a region can be parsed from fs.s3a.endpoint (which will happen in this case, region will be US_EAST_1), cross region access is not enabled. This will cause 400 errors if the bucket is not in US_EAST_1.
Proposed: Updated the logic so that if the endpoint is the global s3.amazonaws.com , cross region access is enabled.
Description of PR
How was this patch tested?
Its being currently tested against us-west-2 by explicitly setting the endpoint as s3.amazonaws.com
For code changes:
LICENSE,LICENSE-binary,NOTICE-binaryfiles?