-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-31935][SQL] Hadoop file system config should be effective in data source options #28760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
checkAndGlobPathIfNecessarycheckAndGlobPathIfNecessary
| checkFilesExist: Boolean): Seq[Path] = { | ||
| val allPaths = caseInsensitiveOptions.get("path") ++ paths | ||
| val hadoopConf = sparkSession.sessionState.newHadoopConf() | ||
| val hadoopConf = sparkSession.sessionState.newHadoopConfWithOptions(options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are multiple newHadoopConfs in this file. Should we also fix them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should. I am adding more test cases.
checkAndGlobPathIfNecessary|
Test build #123652 has finished for PR 28760 at commit
|
|
Test build #123654 has finished for PR 28760 at commit
|
|
LGTM, thanks! |
|
Test build #123659 has finished for PR 28760 at commit
|
|
Test build #123655 has finished for PR 28760 at commit
|
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main code changes look good. LGTM if tests are fixed and pass.
|
Test build #123657 has finished for PR 28760 at commit
|
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. How far shall we backport it?
|
@cloud-fan I think we can backport it to branch-2.4. Once branch-3.0 rc3 is cut, we can backport it to branch-3.0 as well. |
|
Test build #123664 has finished for PR 28760 at commit
|
|
retest this please |
|
Test build #123677 has finished for PR 28760 at commit
|
|
Merging to master. |
…ata source options
Mkae Hadoop file system config effective in data source options.
From `org.apache.hadoop.fs.FileSystem.java`:
```
public static FileSystem get(URI uri, Configuration conf) throws IOException {
String scheme = uri.getScheme();
String authority = uri.getAuthority();
if (scheme == null && authority == null) { // use default FS
return get(conf);
}
if (scheme != null && authority == null) { // no authority
URI defaultUri = getDefaultUri(conf);
if (scheme.equals(defaultUri.getScheme()) // if scheme matches default
&& defaultUri.getAuthority() != null) { // & default has authority
return get(defaultUri, conf); // return default
}
}
String disableCacheName = String.format("fs.%s.impl.disable.cache", scheme);
if (conf.getBoolean(disableCacheName, false)) {
return createFileSystem(uri, conf);
}
return CACHE.get(uri, conf);
}
```
Before changes, the file system configurations in data source options are not propagated in `DataSource.scala`.
After changes, we can specify authority and URI schema related configurations for scanning file systems.
This problem only exists in data source V1. In V2, we already use `sparkSession.sessionState.newHadoopConfWithOptions(options)` in `FileTable`.
Allow users to specify authority and URI schema related Hadoop configurations for file source reading.
Yes, the file system related Hadoop configuration in data source option will be effective on reading.
Unit test
Closes apache#28760 from gengliangwang/ds_conf.
Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
|
Hi, All. |
|
I'll make a follow-up PR. |
### What changes were proposed in this pull request? This PR updates the test case to accept Hadoop 2/3 error message correctly. ### Why are the changes needed? SPARK-31935(#28760) breaks Hadoop 3.2 UT because Hadoop 2 and Hadoop 3 have different exception messages. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins with both Hadoop 2/3 or do the following manually. **Hadoop 2.7** ``` $ build/sbt "sql/testOnly *.FileBasedDataSourceSuite -- -z SPARK-31935" ... [info] All tests passed. ``` **Hadoop 3.2** ``` $ build/sbt "sql/testOnly *.FileBasedDataSourceSuite -- -z SPARK-31935" -Phadoop-3.2 ... [info] All tests passed. ``` Closes #28791 from dongjoon-hyun/SPARK-31935. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR updates the test case to accept Hadoop 2/3 error message correctly. ### Why are the changes needed? SPARK-31935(apache#28760) breaks Hadoop 3.2 UT because Hadoop 2 and Hadoop 3 have different exception messages. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins with both Hadoop 2/3 or do the following manually. **Hadoop 2.7** ``` $ build/sbt "sql/testOnly *.FileBasedDataSourceSuite -- -z SPARK-31935" ... [info] All tests passed. ``` **Hadoop 3.2** ``` $ build/sbt "sql/testOnly *.FileBasedDataSourceSuite -- -z SPARK-31935" -Phadoop-3.2 ... [info] All tests passed. ``` Closes apache#28791 from dongjoon-hyun/SPARK-31935. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR updates the test case to accept Hadoop 2/3 error message correctly. ### Why are the changes needed? SPARK-31935(#28760) breaks Hadoop 3.2 UT because Hadoop 2 and Hadoop 3 have different exception messages. In #28791, there are two test suites missed the fix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #28796 from gengliangwang/SPARK-31926-followup. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
This PR updates the test case to accept Hadoop 2/3 error message correctly. SPARK-31935(apache#28760) breaks Hadoop 3.2 UT because Hadoop 2 and Hadoop 3 have different exception messages. In apache#28791, there are two test suites missed the fix No Unit test Closes apache#28796 from gengliangwang/SPARK-31926-followup. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…ctive in data source options ### What changes were proposed in this pull request? This is a followup of #28760 to fix the remaining issues: 1. should consider data source options when refreshing cache by path at the end of `InsertIntoHadoopFsRelationCommand` 2. should consider data source options when inferring schema for file source 3. should consider data source options when getting the qualified path in file source v2. ### Why are the changes needed? We didn't catch these issues in #28760, because the test case is to check error when initializing the file system. If we initialize the file system multiple times during a simple read/write action, the test case actually only test the first time. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? rewrite the test to make sure the entire data source read/write action can succeed. Closes #28948 from cloud-fan/fix. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>
…ctive in data source options ### What changes were proposed in this pull request? This is a followup of #28760 to fix the remaining issues: 1. should consider data source options when refreshing cache by path at the end of `InsertIntoHadoopFsRelationCommand` 2. should consider data source options when inferring schema for file source 3. should consider data source options when getting the qualified path in file source v2. ### Why are the changes needed? We didn't catch these issues in #28760, because the test case is to check error when initializing the file system. If we initialize the file system multiple times during a simple read/write action, the test case actually only test the first time. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? rewrite the test to make sure the entire data source read/write action can succeed. Closes #28948 from cloud-fan/fix. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Gengliang Wang <[email protected]> (cherry picked from commit 6edb20d) Signed-off-by: Gengliang Wang <[email protected]>
…ctive in data source options This is a followup of apache#28760 to fix the remaining issues: 1. should consider data source options when refreshing cache by path at the end of `InsertIntoHadoopFsRelationCommand` 2. should consider data source options when inferring schema for file source 3. should consider data source options when getting the qualified path in file source v2. We didn't catch these issues in apache#28760, because the test case is to check error when initializing the file system. If we initialize the file system multiple times during a simple read/write action, the test case actually only test the first time. No rewrite the test to make sure the entire data source read/write action can succeed. Closes apache#28948 from cloud-fan/fix. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Gengliang Wang <[email protected]> (cherry picked from commit 6edb20d) Signed-off-by: Gengliang Wang <[email protected]>
… effective in data source options ### What changes were proposed in this pull request? backport #28948 This is a followup of #28760 to fix the remaining issues: 1. should consider data source options when refreshing cache by path at the end of `InsertIntoHadoopFsRelationCommand` 2. should consider data source options when inferring schema for file source 3. should consider data source options when getting the qualified path in file source v2. ### Why are the changes needed? We didn't catch these issues in #28760, because the test case is to check error when initializing the file system. If we initialize the file system multiple times during a simple read/write action, the test case actually only test the first time. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? rewrite the test to make sure the entire data source read/write action can succeed. Closes #28973 from cloud-fan/pick. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

What changes were proposed in this pull request?
Mkae Hadoop file system config effective in data source options.
From
org.apache.hadoop.fs.FileSystem.java:Before changes, the file system configurations in data source options are not propagated in
DataSource.scala.After changes, we can specify authority and URI schema related configurations for scanning file systems.
This problem only exists in data source V1. In V2, we already use
sparkSession.sessionState.newHadoopConfWithOptions(options)inFileTable.Why are the changes needed?
Allow users to specify authority and URI schema related Hadoop configurations for file source reading.
Does this PR introduce any user-facing change?
Yes, the file system related Hadoop configuration in data source option will be effective on reading.
How was this patch tested?
Unit test