-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-31935][SQL][3.0] Hadoop file system config should be effective in data source options #28776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ata source options
Mkae Hadoop file system config effective in data source options.
From `org.apache.hadoop.fs.FileSystem.java`:
```
public static FileSystem get(URI uri, Configuration conf) throws IOException {
String scheme = uri.getScheme();
String authority = uri.getAuthority();
if (scheme == null && authority == null) { // use default FS
return get(conf);
}
if (scheme != null && authority == null) { // no authority
URI defaultUri = getDefaultUri(conf);
if (scheme.equals(defaultUri.getScheme()) // if scheme matches default
&& defaultUri.getAuthority() != null) { // & default has authority
return get(defaultUri, conf); // return default
}
}
String disableCacheName = String.format("fs.%s.impl.disable.cache", scheme);
if (conf.getBoolean(disableCacheName, false)) {
return createFileSystem(uri, conf);
}
return CACHE.get(uri, conf);
}
```
Before changes, the file system configurations in data source options are not propagated in `DataSource.scala`.
After changes, we can specify authority and URI schema related configurations for scanning file systems.
This problem only exists in data source V1. In V2, we already use `sparkSession.sessionState.newHadoopConfWithOptions(options)` in `FileTable`.
Allow users to specify authority and URI schema related Hadoop configurations for file source reading.
Yes, the file system related Hadoop configuration in data source option will be effective on reading.
Unit test
Closes apache#28760 from gengliangwang/ds_conf.
Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
|
This PR backports #28760 to branch-3.0 |
|
Test build #123723 has finished for PR 28776 at commit
|
|
retest this please |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if tests pass
|
Test build #123734 has finished for PR 28776 at commit
|
|
retest this please |
|
Test build #123754 has finished for PR 28776 at commit
|
|
Please hold on this PR because this will break Hadoop 3.2. |
|
Retest this please. |
|
@gengliangwang . Please include my follow-PR here. |
### What changes were proposed in this pull request? This PR updates the test case to accept Hadoop 2/3 error message correctly. ### Why are the changes needed? SPARK-31935(apache#28760) breaks Hadoop 3.2 UT because Hadoop 2 and Hadoop 3 have different exception messages. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins with both Hadoop 2/3 or do the following manually. **Hadoop 2.7** ``` $ build/sbt "sql/testOnly *.FileBasedDataSourceSuite -- -z SPARK-31935" ... [info] All tests passed. ``` **Hadoop 3.2** ``` $ build/sbt "sql/testOnly *.FileBasedDataSourceSuite -- -z SPARK-31935" -Phadoop-3.2 ... [info] All tests passed. ``` Closes apache#28791 from dongjoon-hyun/SPARK-31935. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
|
@dongjoon-hyun sure, I have included it in this PR. Thanks. |
|
Thanks~ |
|
Test build #123794 has finished for PR 28776 at commit
|
| val message = intercept[java.io.IOException] { | ||
| spark.readStream.option("fs.defaultFS", defaultFs).text(path) | ||
| }.getMessage | ||
| assert(message == expectMessage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. It seems that I missed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you fix this place and forward-port to master, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, let me do it now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again!
|
Test build #123802 has finished for PR 28776 at commit
|
This PR updates the test case to accept Hadoop 2/3 error message correctly. SPARK-31935(apache#28760) breaks Hadoop 3.2 UT because Hadoop 2 and Hadoop 3 have different exception messages. In apache#28791, there are two test suites missed the fix No Unit test Closes apache#28796 from gengliangwang/SPARK-31926-followup. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
|
Test build #123821 has finished for PR 28776 at commit
|
|
retest this please |
|
Test build #123839 has finished for PR 28776 at commit
|
|
retest this please |
|
I reverted dad163f from branch-3.0. |
|
Test build #123849 has finished for PR 28776 at commit
|
|
retest this please |
|
Test build #123857 has finished for PR 28776 at commit
|
|
Thank you, @gengliangwang and all. |
… in data source options
### What changes were proposed in this pull request?
Mkae Hadoop file system config effective in data source options.
From `org.apache.hadoop.fs.FileSystem.java`:
```
public static FileSystem get(URI uri, Configuration conf) throws IOException {
String scheme = uri.getScheme();
String authority = uri.getAuthority();
if (scheme == null && authority == null) { // use default FS
return get(conf);
}
if (scheme != null && authority == null) { // no authority
URI defaultUri = getDefaultUri(conf);
if (scheme.equals(defaultUri.getScheme()) // if scheme matches default
&& defaultUri.getAuthority() != null) { // & default has authority
return get(defaultUri, conf); // return default
}
}
String disableCacheName = String.format("fs.%s.impl.disable.cache", scheme);
if (conf.getBoolean(disableCacheName, false)) {
return createFileSystem(uri, conf);
}
return CACHE.get(uri, conf);
}
```
Before changes, the file system configurations in data source options are not propagated in `DataSource.scala`.
After changes, we can specify authority and URI schema related configurations for scanning file systems.
This problem only exists in data source V1. In V2, we already use `sparkSession.sessionState.newHadoopConfWithOptions(options)` in `FileTable`.
### Why are the changes needed?
Allow users to specify authority and URI schema related Hadoop configurations for file source reading.
### Does this PR introduce _any_ user-facing change?
Yes, the file system related Hadoop configuration in data source option will be effective on reading.
### How was this patch tested?
Unit test
Closes #28776 from gengliangwang/SPARK-31935-3.0.
Lead-authored-by: Gengliang Wang <[email protected]>
Co-authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
… in data source options
### What changes were proposed in this pull request?
Mkae Hadoop file system config effective in data source options.
From `org.apache.hadoop.fs.FileSystem.java`:
```
public static FileSystem get(URI uri, Configuration conf) throws IOException {
String scheme = uri.getScheme();
String authority = uri.getAuthority();
if (scheme == null && authority == null) { // use default FS
return get(conf);
}
if (scheme != null && authority == null) { // no authority
URI defaultUri = getDefaultUri(conf);
if (scheme.equals(defaultUri.getScheme()) // if scheme matches default
&& defaultUri.getAuthority() != null) { // & default has authority
return get(defaultUri, conf); // return default
}
}
String disableCacheName = String.format("fs.%s.impl.disable.cache", scheme);
if (conf.getBoolean(disableCacheName, false)) {
return createFileSystem(uri, conf);
}
return CACHE.get(uri, conf);
}
```
Before changes, the file system configurations in data source options are not propagated in `DataSource.scala`.
After changes, we can specify authority and URI schema related configurations for scanning file systems.
This problem only exists in data source V1. In V2, we already use `sparkSession.sessionState.newHadoopConfWithOptions(options)` in `FileTable`.
### Why are the changes needed?
Allow users to specify authority and URI schema related Hadoop configurations for file source reading.
### Does this PR introduce _any_ user-facing change?
Yes, the file system related Hadoop configuration in data source option will be effective on reading.
### How was this patch tested?
Unit test
Closes apache#28776 from gengliangwang/SPARK-31935-3.0.
Lead-authored-by: Gengliang Wang <[email protected]>
Co-authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
Mkae Hadoop file system config effective in data source options.
From
org.apache.hadoop.fs.FileSystem.java:Before changes, the file system configurations in data source options are not propagated in
DataSource.scala.After changes, we can specify authority and URI schema related configurations for scanning file systems.
This problem only exists in data source V1. In V2, we already use
sparkSession.sessionState.newHadoopConfWithOptions(options)inFileTable.Why are the changes needed?
Allow users to specify authority and URI schema related Hadoop configurations for file source reading.
Does this PR introduce any user-facing change?
Yes, the file system related Hadoop configuration in data source option will be effective on reading.
How was this patch tested?
Unit test