[SPARK-31935][SQL][3.0] Hadoop file system config should be effective in data source options #28776

gengliangwang · 2020-06-10T04:45:57Z

What changes were proposed in this pull request?

Mkae Hadoop file system config effective in data source options.

From org.apache.hadoop.fs.FileSystem.java:

  public static FileSystem get(URI uri, Configuration conf) throws IOException {
    String scheme = uri.getScheme();
    String authority = uri.getAuthority();

    if (scheme == null && authority == null) {     // use default FS
      return get(conf);
    }

    if (scheme != null && authority == null) {     // no authority
      URI defaultUri = getDefaultUri(conf);
      if (scheme.equals(defaultUri.getScheme())    // if scheme matches default
          && defaultUri.getAuthority() != null) {  // & default has authority
        return get(defaultUri, conf);              // return default
      }
    }
    
    String disableCacheName = String.format("fs.%s.impl.disable.cache", scheme);
    if (conf.getBoolean(disableCacheName, false)) {
      return createFileSystem(uri, conf);
    }

    return CACHE.get(uri, conf);
  }

Before changes, the file system configurations in data source options are not propagated in DataSource.scala.
After changes, we can specify authority and URI schema related configurations for scanning file systems.

This problem only exists in data source V1. In V2, we already use sparkSession.sessionState.newHadoopConfWithOptions(options) in FileTable.

Why are the changes needed?

Allow users to specify authority and URI schema related Hadoop configurations for file source reading.

Does this PR introduce any user-facing change?

Yes, the file system related Hadoop configuration in data source option will be effective on reading.

How was this patch tested?

Unit test

…ata source options Mkae Hadoop file system config effective in data source options. From `org.apache.hadoop.fs.FileSystem.java`: ``` public static FileSystem get(URI uri, Configuration conf) throws IOException { String scheme = uri.getScheme(); String authority = uri.getAuthority(); if (scheme == null && authority == null) { // use default FS return get(conf); } if (scheme != null && authority == null) { // no authority URI defaultUri = getDefaultUri(conf); if (scheme.equals(defaultUri.getScheme()) // if scheme matches default && defaultUri.getAuthority() != null) { // & default has authority return get(defaultUri, conf); // return default } } String disableCacheName = String.format("fs.%s.impl.disable.cache", scheme); if (conf.getBoolean(disableCacheName, false)) { return createFileSystem(uri, conf); } return CACHE.get(uri, conf); } ``` Before changes, the file system configurations in data source options are not propagated in `DataSource.scala`. After changes, we can specify authority and URI schema related configurations for scanning file systems. This problem only exists in data source V1. In V2, we already use `sparkSession.sessionState.newHadoopConfWithOptions(options)` in `FileTable`. Allow users to specify authority and URI schema related Hadoop configurations for file source reading. Yes, the file system related Hadoop configuration in data source option will be effective on reading. Unit test Closes apache#28760 from gengliangwang/ds_conf. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>

gengliangwang · 2020-06-10T04:46:26Z

This PR backports #28760 to branch-3.0

SparkQA · 2020-06-10T07:05:02Z

Test build #123723 has finished for PR 28776 at commit f6cca6b.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2020-06-10T07:18:14Z

retest this please

HyukjinKwon

LGTM if tests pass

SparkQA · 2020-06-10T09:45:57Z

Test build #123734 has finished for PR 28776 at commit f6cca6b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-06-10T09:47:17Z

retest this please

SparkQA · 2020-06-10T16:13:47Z

Test build #123754 has finished for PR 28776 at commit f6cca6b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-06-10T23:37:24Z

Please hold on this PR because this will break Hadoop 3.2.

dongjoon-hyun · 2020-06-10T23:37:58Z

Retest this please.

dongjoon-hyun · 2020-06-11T00:38:36Z

@gengliangwang . Please include my follow-PR here.

c7d45c0

### What changes were proposed in this pull request? This PR updates the test case to accept Hadoop 2/3 error message correctly. ### Why are the changes needed? SPARK-31935(apache#28760) breaks Hadoop 3.2 UT because Hadoop 2 and Hadoop 3 have different exception messages. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins with both Hadoop 2/3 or do the following manually. **Hadoop 2.7** ``` $ build/sbt "sql/testOnly *.FileBasedDataSourceSuite -- -z SPARK-31935" ... [info] All tests passed. ``` **Hadoop 3.2** ``` $ build/sbt "sql/testOnly *.FileBasedDataSourceSuite -- -z SPARK-31935" -Phadoop-3.2 ... [info] All tests passed. ``` Closes apache#28791 from dongjoon-hyun/SPARK-31935. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

gengliangwang · 2020-06-11T00:42:51Z

@dongjoon-hyun sure, I have included it in this PR. Thanks.

dongjoon-hyun · 2020-06-11T00:59:39Z

Thanks~

SparkQA · 2020-06-11T01:43:22Z

Test build #123794 has finished for PR 28776 at commit f6cca6b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-06-11T03:20:49Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

+      val message = intercept[java.io.IOException] {
+        spark.readStream.option("fs.defaultFS", defaultFs).text(path)
+      }.getMessage
+      assert(message == expectMessage)


Oops. It seems that I missed here.

Could you fix this place and forward-port to master, too?

ok, let me do it now

Thanks again!

SparkQA · 2020-06-11T03:46:44Z

Test build #123802 has finished for PR 28776 at commit 5623228.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

This PR updates the test case to accept Hadoop 2/3 error message correctly. SPARK-31935(apache#28760) breaks Hadoop 3.2 UT because Hadoop 2 and Hadoop 3 have different exception messages. In apache#28791, there are two test suites missed the fix No Unit test Closes apache#28796 from gengliangwang/SPARK-31926-followup. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

SparkQA · 2020-06-11T07:05:02Z

Test build #123821 has finished for PR 28776 at commit da8d48d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-06-11T08:53:52Z

retest this please

SparkQA · 2020-06-11T13:16:13Z

Test build #123839 has finished for PR 28776 at commit da8d48d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-06-11T14:00:53Z

retest this please

HyukjinKwon · 2020-06-11T14:01:18Z

I reverted dad163f from branch-3.0.

SparkQA · 2020-06-11T15:13:14Z

Test build #123849 has finished for PR 28776 at commit da8d48d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-06-11T15:20:55Z

retest this please

SparkQA · 2020-06-11T21:16:31Z

Test build #123857 has finished for PR 28776 at commit da8d48d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-06-11T21:17:34Z

Thank you, @gengliangwang and all.
Merged to branch-3.0.

… in data source options ### What changes were proposed in this pull request? Mkae Hadoop file system config effective in data source options. From `org.apache.hadoop.fs.FileSystem.java`: ``` public static FileSystem get(URI uri, Configuration conf) throws IOException { String scheme = uri.getScheme(); String authority = uri.getAuthority(); if (scheme == null && authority == null) { // use default FS return get(conf); } if (scheme != null && authority == null) { // no authority URI defaultUri = getDefaultUri(conf); if (scheme.equals(defaultUri.getScheme()) // if scheme matches default && defaultUri.getAuthority() != null) { // & default has authority return get(defaultUri, conf); // return default } } String disableCacheName = String.format("fs.%s.impl.disable.cache", scheme); if (conf.getBoolean(disableCacheName, false)) { return createFileSystem(uri, conf); } return CACHE.get(uri, conf); } ``` Before changes, the file system configurations in data source options are not propagated in `DataSource.scala`. After changes, we can specify authority and URI schema related configurations for scanning file systems. This problem only exists in data source V1. In V2, we already use `sparkSession.sessionState.newHadoopConfWithOptions(options)` in `FileTable`. ### Why are the changes needed? Allow users to specify authority and URI schema related Hadoop configurations for file source reading. ### Does this PR introduce _any_ user-facing change? Yes, the file system related Hadoop configuration in data source option will be effective on reading. ### How was this patch tested? Unit test Closes #28776 from gengliangwang/SPARK-31935-3.0. Lead-authored-by: Gengliang Wang <[email protected]> Co-authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

… in data source options ### What changes were proposed in this pull request? Mkae Hadoop file system config effective in data source options. From `org.apache.hadoop.fs.FileSystem.java`: ``` public static FileSystem get(URI uri, Configuration conf) throws IOException { String scheme = uri.getScheme(); String authority = uri.getAuthority(); if (scheme == null && authority == null) { // use default FS return get(conf); } if (scheme != null && authority == null) { // no authority URI defaultUri = getDefaultUri(conf); if (scheme.equals(defaultUri.getScheme()) // if scheme matches default && defaultUri.getAuthority() != null) { // & default has authority return get(defaultUri, conf); // return default } } String disableCacheName = String.format("fs.%s.impl.disable.cache", scheme); if (conf.getBoolean(disableCacheName, false)) { return createFileSystem(uri, conf); } return CACHE.get(uri, conf); } ``` Before changes, the file system configurations in data source options are not propagated in `DataSource.scala`. After changes, we can specify authority and URI schema related configurations for scanning file systems. This problem only exists in data source V1. In V2, we already use `sparkSession.sessionState.newHadoopConfWithOptions(options)` in `FileTable`. ### Why are the changes needed? Allow users to specify authority and URI schema related Hadoop configurations for file source reading. ### Does this PR introduce _any_ user-facing change? Yes, the file system related Hadoop configuration in data source option will be effective on reading. ### How was this patch tested? Unit test Closes apache#28776 from gengliangwang/SPARK-31935-3.0. Lead-authored-by: Gengliang Wang <[email protected]> Co-authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

probot-autolabeler bot added SQL STRUCTURED STREAMING labels Jun 10, 2020

gengliangwang changed the title ~~[SPARK-31935][SQL] Hadoop file system config should be effective in data source options~~ [3.0][SPARK-31935][SQL] Hadoop file system config should be effective in data source options Jun 10, 2020

gengliangwang requested a review from cloud-fan June 10, 2020 04:49

HyukjinKwon approved these changes Jun 10, 2020

View reviewed changes

HyukjinKwon changed the title ~~[3.0][SPARK-31935][SQL] Hadoop file system config should be effective in data source options~~ [SPARK-31935][SQL][3.0] Hadoop file system config should be effective in data source options Jun 10, 2020

dongjoon-hyun changed the title ~~[SPARK-31935][SQL][3.0] Hadoop file system config should be effective in data source options~~ [SPARK-31935][SQL][3.0][test-hadoop3.2] Hadoop file system config should be effective in data source options Jun 10, 2020

dongjoon-hyun reviewed Jun 11, 2020

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-31935][SQL][3.0][test-hadoop3.2] Hadoop file system config should be effective in data source options~~ [SPARK-31935][SQL][3.0] Hadoop file system config should be effective in data source options Jun 11, 2020

dongjoon-hyun closed this Jun 11, 2020

[SPARK-31935][SQL][3.0] Hadoop file system config should be effective in data source options #28776

[SPARK-31935][SQL][3.0] Hadoop file system config should be effective in data source options #28776

Uh oh!

Conversation

gengliangwang commented Jun 10, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gengliangwang commented Jun 10, 2020

Uh oh!

SparkQA commented Jun 10, 2020

Uh oh!

gengliangwang commented Jun 10, 2020

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 10, 2020

Uh oh!

cloud-fan commented Jun 10, 2020

Uh oh!

SparkQA commented Jun 10, 2020

Uh oh!

dongjoon-hyun commented Jun 10, 2020

Uh oh!

dongjoon-hyun commented Jun 10, 2020

Uh oh!

dongjoon-hyun commented Jun 11, 2020

Uh oh!

gengliangwang commented Jun 11, 2020

Uh oh!

dongjoon-hyun commented Jun 11, 2020

Uh oh!

SparkQA commented Jun 11, 2020

Uh oh!

dongjoon-hyun Jun 11, 2020

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jun 11, 2020

Choose a reason for hiding this comment

Uh oh!

gengliangwang Jun 11, 2020

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jun 11, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 11, 2020

Uh oh!

SparkQA commented Jun 11, 2020

Uh oh!

HyukjinKwon commented Jun 11, 2020

Uh oh!

SparkQA commented Jun 11, 2020

Uh oh!

HyukjinKwon commented Jun 11, 2020

Uh oh!

HyukjinKwon commented Jun 11, 2020

Uh oh!

SparkQA commented Jun 11, 2020

Uh oh!

cloud-fan commented Jun 11, 2020

Uh oh!

SparkQA commented Jun 11, 2020

Uh oh!

dongjoon-hyun commented Jun 11, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants