[2.4][SPARK-31935][SQL] Hadoop file system config should be effective in data source options #28771

gengliangwang · 2020-06-09T19:44:47Z

What changes were proposed in this pull request?

Make Hadoop file system config effective in data source options.

From org.apache.hadoop.fs.FileSystem.java:

  public static FileSystem get(URI uri, Configuration conf) throws IOException {
    String scheme = uri.getScheme();
    String authority = uri.getAuthority();

    if (scheme == null && authority == null) {     // use default FS
      return get(conf);
    }

    if (scheme != null && authority == null) {     // no authority
      URI defaultUri = getDefaultUri(conf);
      if (scheme.equals(defaultUri.getScheme())    // if scheme matches default
          && defaultUri.getAuthority() != null) {  // & default has authority
        return get(defaultUri, conf);              // return default
      }
    }
    
    String disableCacheName = String.format("fs.%s.impl.disable.cache", scheme);
    if (conf.getBoolean(disableCacheName, false)) {
      return createFileSystem(uri, conf);
    }

    return CACHE.get(uri, conf);
  }

Before changes, the file system configurations in data source options are not propagated in DataSource.scala.
After changes, we can specify authority and URI schema related configurations for scanning file systems.

This problem only exists in data source V1. In V2, we already use sparkSession.sessionState.newHadoopConfWithOptions(options) in FileTable.

Why are the changes needed?

Allow users to specify authority and URI schema related Hadoop configurations for file source reading.

Does this PR introduce any user-facing change?

Yes, the file system related Hadoop configuration in data source option will be effective on reading.

How was this patch tested?

Unit test

gengliangwang · 2020-06-09T19:45:12Z

This PR is to backport #28760 to branch-2.4

gengliangwang · 2020-06-09T19:45:48Z

cc @liancheng

SparkQA · 2020-06-09T23:03:34Z

Test build #123703 has finished for PR 28771 at commit 6984453.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2020-06-10T01:48:44Z

merging to branch 2.4

… in data source options ### What changes were proposed in this pull request? Mkae Hadoop file system config effective in data source options. From `org.apache.hadoop.fs.FileSystem.java`: ``` public static FileSystem get(URI uri, Configuration conf) throws IOException { String scheme = uri.getScheme(); String authority = uri.getAuthority(); if (scheme == null && authority == null) { // use default FS return get(conf); } if (scheme != null && authority == null) { // no authority URI defaultUri = getDefaultUri(conf); if (scheme.equals(defaultUri.getScheme()) // if scheme matches default && defaultUri.getAuthority() != null) { // & default has authority return get(defaultUri, conf); // return default } } String disableCacheName = String.format("fs.%s.impl.disable.cache", scheme); if (conf.getBoolean(disableCacheName, false)) { return createFileSystem(uri, conf); } return CACHE.get(uri, conf); } ``` Before changes, the file system configurations in data source options are not propagated in `DataSource.scala`. After changes, we can specify authority and URI schema related configurations for scanning file systems. This problem only exists in data source V1. In V2, we already use `sparkSession.sessionState.newHadoopConfWithOptions(options)` in `FileTable`. ### Why are the changes needed? Allow users to specify authority and URI schema related Hadoop configurations for file source reading. ### Does this PR introduce _any_ user-facing change? Yes, the file system related Hadoop configuration in data source option will be effective on reading. ### How was this patch tested? Unit test Closes #28771 from gengliangwang/SPARK-31935-2.4. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>

dongjoon-hyun · 2020-07-01T06:52:50Z

Hi, @gengliangwang . Did you merge this without any LGTM from other committers?

cc @gatorsmile , @cloud-fan , @HyukjinKwon

dongjoon-hyun · 2020-07-01T06:55:43Z

Note that I'm fine with that backporting because it seems that you were confident with this and it looked urgent to you.

HyukjinKwon · 2020-07-01T06:59:31Z

Yeah, it should be best to have LGTM or at least some positive comments.

back port SPARK-31935

6984453

probot-autolabeler bot added SQL STRUCTURED STREAMING labels Jun 9, 2020

gengliangwang closed this Jun 10, 2020

gengliangwang mentioned this pull request Jul 1, 2020

[SPARK-31935][SQL][FOLLOWUP] Hadoop file system config should be effective in data source options #28948

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[2.4][SPARK-31935][SQL] Hadoop file system config should be effective in data source options #28771

[2.4][SPARK-31935][SQL] Hadoop file system config should be effective in data source options #28771

Uh oh!

gengliangwang commented Jun 9, 2020 •

edited

Loading

Uh oh!

gengliangwang commented Jun 9, 2020

Uh oh!

gengliangwang commented Jun 9, 2020

Uh oh!

SparkQA commented Jun 9, 2020

Uh oh!

gengliangwang commented Jun 10, 2020

Uh oh!

dongjoon-hyun commented Jul 1, 2020

Uh oh!

dongjoon-hyun commented Jul 1, 2020 •

edited

Loading

Uh oh!

HyukjinKwon commented Jul 1, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[2.4][SPARK-31935][SQL] Hadoop file system config should be effective in data source options #28771

[2.4][SPARK-31935][SQL] Hadoop file system config should be effective in data source options #28771

Uh oh!

Conversation

gengliangwang commented Jun 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gengliangwang commented Jun 9, 2020

Uh oh!

gengliangwang commented Jun 9, 2020

Uh oh!

SparkQA commented Jun 9, 2020

Uh oh!

gengliangwang commented Jun 10, 2020

Uh oh!

dongjoon-hyun commented Jul 1, 2020

Uh oh!

dongjoon-hyun commented Jul 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Jul 1, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gengliangwang commented Jun 9, 2020 •

edited

Loading

dongjoon-hyun commented Jul 1, 2020 •

edited

Loading