-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-31935][SQL][FOLLOWUP] Hadoop file system config should be effective in data source options #28948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Oh, thanks for pinging me, @cloud-fan . |
| class FakeFileSystemRequiringDSOption extends LocalFileSystem { | ||
| override def initialize(name: URI, conf: Configuration): Unit = { | ||
| super.initialize(name, conf) | ||
| require(conf.get("ds_option", "") == "value") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice check!
| * Tries to re-cache all the cache entries that contain `path` in one or more | ||
| * `HadoopFsRelation` node(s) as part of its logical plan. | ||
| */ | ||
| def recacheByPath(spark: SparkSession, path: Path, fs: FileSystem): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering if we can have a consistent parameter name for all recacheByPaths. Currently, resourcePath and path are used in a mixed way. Is it for distinguish Path class from String class? Even in that case, resourcePath: Path also looks reasonable. Or, we can use path for both if we can.
| * Tries to re-cache all the cache entries that contain `resourcePath` in one or more | ||
| * `HadoopFsRelation` node(s) as part of its logical plan. | ||
| */ | ||
| def recacheByPath(spark: SparkSession, resourcePath: String): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about we change the method as
def recacheByPath(spark: SparkSession, resourcePath: String, options: Map[String, String]=Map.empty)
so that we can avoid the new method below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That also works, but it feels a little bit weird to couple the data source options concept with the cache manager...
|
@cloud-fan, thanks for fixing this! We should probably backport this one to 2.4 as well. |
|
Yea, I'll open a new PR for 2.4 after this one gets merged. |
|
Test build #124651 has started for PR 28948 at commit |
|
Hi, @liancheng and @cloud-fan . Please note that I'm not against backporting of SPARK-31935 and this follow-up in Also, cc @gatorsmile |
|
@gengliangwang Can you open a new PR to backport #28760 to 2.4? |
@cloud-fan it is already back ported in #28771 |
|
retest this please. |
|
@dongjoon-hyun, thanks for reminding. Didn't see your comment at first, but I've already updated the "Affects Version/s" field in the Jira ticket. Also saw that it's already marked as a bug. |
|
Test build #124742 has finished for PR 28948 at commit
|
|
retest this please |
gengliangwang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the fix!
|
Thank you all! |
|
Test build #124776 has finished for PR 28948 at commit
|
|
Thanks, merging to master/3.0 |
…ctive in data source options ### What changes were proposed in this pull request? This is a followup of #28760 to fix the remaining issues: 1. should consider data source options when refreshing cache by path at the end of `InsertIntoHadoopFsRelationCommand` 2. should consider data source options when inferring schema for file source 3. should consider data source options when getting the qualified path in file source v2. ### Why are the changes needed? We didn't catch these issues in #28760, because the test case is to check error when initializing the file system. If we initialize the file system multiple times during a simple read/write action, the test case actually only test the first time. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? rewrite the test to make sure the entire data source read/write action can succeed. Closes #28948 from cloud-fan/fix. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Gengliang Wang <[email protected]> (cherry picked from commit 6edb20d) Signed-off-by: Gengliang Wang <[email protected]>
|
@cloud-fan there are a lot of conflicts when merging this one to branch-2.4. We will have to do it manually. |
…ctive in data source options This is a followup of apache#28760 to fix the remaining issues: 1. should consider data source options when refreshing cache by path at the end of `InsertIntoHadoopFsRelationCommand` 2. should consider data source options when inferring schema for file source 3. should consider data source options when getting the qualified path in file source v2. We didn't catch these issues in apache#28760, because the test case is to check error when initializing the file system. If we initialize the file system multiple times during a simple read/write action, the test case actually only test the first time. No rewrite the test to make sure the entire data source read/write action can succeed. Closes apache#28948 from cloud-fan/fix. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Gengliang Wang <[email protected]> (cherry picked from commit 6edb20d) Signed-off-by: Gengliang Wang <[email protected]>
… effective in data source options ### What changes were proposed in this pull request? backport #28948 This is a followup of #28760 to fix the remaining issues: 1. should consider data source options when refreshing cache by path at the end of `InsertIntoHadoopFsRelationCommand` 2. should consider data source options when inferring schema for file source 3. should consider data source options when getting the qualified path in file source v2. ### Why are the changes needed? We didn't catch these issues in #28760, because the test case is to check error when initializing the file system. If we initialize the file system multiple times during a simple read/write action, the test case actually only test the first time. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? rewrite the test to make sure the entire data source read/write action can succeed. Closes #28973 from cloud-fan/pick. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
What changes were proposed in this pull request?
This is a followup of #28760 to fix the remaining issues:
InsertIntoHadoopFsRelationCommandWhy are the changes needed?
We didn't catch these issues in #28760, because the test case is to check error when initializing the file system. If we initialize the file system multiple times during a simple read/write action, the test case actually only test the first time.
Does this PR introduce any user-facing change?
No
How was this patch tested?
rewrite the test to make sure the entire data source read/write action can succeed.