[SPARK-31935][SQL] Hadoop file system config should be effective in data source options #28760

gengliangwang · 2020-06-08T23:21:01Z

What changes were proposed in this pull request?

Mkae Hadoop file system config effective in data source options.

From org.apache.hadoop.fs.FileSystem.java:

  public static FileSystem get(URI uri, Configuration conf) throws IOException {
    String scheme = uri.getScheme();
    String authority = uri.getAuthority();

    if (scheme == null && authority == null) {     // use default FS
      return get(conf);
    }

    if (scheme != null && authority == null) {     // no authority
      URI defaultUri = getDefaultUri(conf);
      if (scheme.equals(defaultUri.getScheme())    // if scheme matches default
          && defaultUri.getAuthority() != null) {  // & default has authority
        return get(defaultUri, conf);              // return default
      }
    }
    
    String disableCacheName = String.format("fs.%s.impl.disable.cache", scheme);
    if (conf.getBoolean(disableCacheName, false)) {
      return createFileSystem(uri, conf);
    }

    return CACHE.get(uri, conf);
  }

Before changes, the file system configurations in data source options are not propagated in DataSource.scala.
After changes, we can specify authority and URI schema related configurations for scanning file systems.

This problem only exists in data source V1. In V2, we already use sparkSession.sessionState.newHadoopConfWithOptions(options) in FileTable.

Why are the changes needed?

Allow users to specify authority and URI schema related Hadoop configurations for file source reading.

Does this PR introduce any user-facing change?

Yes, the file system related Hadoop configuration in data source option will be effective on reading.

How was this patch tested?

Unit test

…Necessary

gengliangwang · 2020-06-08T23:22:17Z

cc @liancheng @gatorsmile

zsxwing · 2020-06-08T23:36:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

      checkFilesExist: Boolean): Seq[Path] = {
    val allPaths = caseInsensitiveOptions.get("path") ++ paths
-    val hadoopConf = sparkSession.sessionState.newHadoopConf()
+    val hadoopConf = sparkSession.sessionState.newHadoopConfWithOptions(options)


There are multiple newHadoopConfs in this file. Should we also fix them?

Yes, we should. I am adding more test cases.

SparkQA · 2020-06-09T01:27:00Z

Test build #123652 has finished for PR 28760 at commit e5b5751.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-09T01:27:22Z

Test build #123654 has finished for PR 28760 at commit 0dada86.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2020-06-09T02:48:10Z

LGTM, thanks!

SparkQA · 2020-06-09T03:16:58Z

Test build #123659 has finished for PR 28760 at commit 922efd5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-09T03:45:46Z

Test build #123655 has finished for PR 28760 at commit 0513869.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

Main code changes look good. LGTM if tests are fixed and pass.

SparkQA · 2020-06-09T04:59:55Z

Test build #123657 has finished for PR 28760 at commit 8aa11cc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan

LGTM. How far shall we backport it?

gengliangwang · 2020-06-09T07:01:35Z

@cloud-fan I think we can backport it to branch-2.4. Once branch-3.0 rc3 is cut, we can backport it to branch-3.0 as well.
WDYT?

SparkQA · 2020-06-09T07:05:02Z

Test build #123664 has finished for PR 28760 at commit 4d88604.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2020-06-09T07:06:34Z

retest this please

SparkQA · 2020-06-09T12:04:47Z

Test build #123677 has finished for PR 28760 at commit 4d88604.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2020-06-09T19:13:20Z

Merging to master.
I will backport to branch-2.4 and branch-3.0 later

…ata source options Mkae Hadoop file system config effective in data source options. From `org.apache.hadoop.fs.FileSystem.java`: ``` public static FileSystem get(URI uri, Configuration conf) throws IOException { String scheme = uri.getScheme(); String authority = uri.getAuthority(); if (scheme == null && authority == null) { // use default FS return get(conf); } if (scheme != null && authority == null) { // no authority URI defaultUri = getDefaultUri(conf); if (scheme.equals(defaultUri.getScheme()) // if scheme matches default && defaultUri.getAuthority() != null) { // & default has authority return get(defaultUri, conf); // return default } } String disableCacheName = String.format("fs.%s.impl.disable.cache", scheme); if (conf.getBoolean(disableCacheName, false)) { return createFileSystem(uri, conf); } return CACHE.get(uri, conf); } ``` Before changes, the file system configurations in data source options are not propagated in `DataSource.scala`. After changes, we can specify authority and URI schema related configurations for scanning file systems. This problem only exists in data source V1. In V2, we already use `sparkSession.sessionState.newHadoopConfWithOptions(options)` in `FileTable`. Allow users to specify authority and URI schema related Hadoop configurations for file source reading. Yes, the file system related Hadoop configuration in data source option will be effective on reading. Unit test Closes apache#28760 from gengliangwang/ds_conf. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>

dongjoon-hyun · 2020-06-10T23:22:39Z

Hi, All.
The new test case seems flaky. This fails consistently in Hadoop 3.2 / Hive 2.3.

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-3.2-hive-2.3/789/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/SPARK_31935__Hadoop_file_system_config_should_be_effective_in_data_source_options/history/

dongjoon-hyun · 2020-06-10T23:34:01Z

I'll make a follow-up PR.

### What changes were proposed in this pull request? This PR updates the test case to accept Hadoop 2/3 error message correctly. ### Why are the changes needed? SPARK-31935(#28760) breaks Hadoop 3.2 UT because Hadoop 2 and Hadoop 3 have different exception messages. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins with both Hadoop 2/3 or do the following manually. **Hadoop 2.7** ``` $ build/sbt "sql/testOnly *.FileBasedDataSourceSuite -- -z SPARK-31935" ... [info] All tests passed. ``` **Hadoop 3.2** ``` $ build/sbt "sql/testOnly *.FileBasedDataSourceSuite -- -z SPARK-31935" -Phadoop-3.2 ... [info] All tests passed. ``` Closes #28791 from dongjoon-hyun/SPARK-31935. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This PR updates the test case to accept Hadoop 2/3 error message correctly. ### Why are the changes needed? SPARK-31935(apache#28760) breaks Hadoop 3.2 UT because Hadoop 2 and Hadoop 3 have different exception messages. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins with both Hadoop 2/3 or do the following manually. **Hadoop 2.7** ``` $ build/sbt "sql/testOnly *.FileBasedDataSourceSuite -- -z SPARK-31935" ... [info] All tests passed. ``` **Hadoop 3.2** ``` $ build/sbt "sql/testOnly *.FileBasedDataSourceSuite -- -z SPARK-31935" -Phadoop-3.2 ... [info] All tests passed. ``` Closes apache#28791 from dongjoon-hyun/SPARK-31935. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This PR updates the test case to accept Hadoop 2/3 error message correctly. ### Why are the changes needed? SPARK-31935(#28760) breaks Hadoop 3.2 UT because Hadoop 2 and Hadoop 3 have different exception messages. In #28791, there are two test suites missed the fix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #28796 from gengliangwang/SPARK-31926-followup. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

This PR updates the test case to accept Hadoop 2/3 error message correctly. SPARK-31935(apache#28760) breaks Hadoop 3.2 UT because Hadoop 2 and Hadoop 3 have different exception messages. In apache#28791, there are two test suites missed the fix No Unit test Closes apache#28796 from gengliangwang/SPARK-31926-followup. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…ctive in data source options ### What changes were proposed in this pull request? This is a followup of #28760 to fix the remaining issues: 1. should consider data source options when refreshing cache by path at the end of `InsertIntoHadoopFsRelationCommand` 2. should consider data source options when inferring schema for file source 3. should consider data source options when getting the qualified path in file source v2. ### Why are the changes needed? We didn't catch these issues in #28760, because the test case is to check error when initializing the file system. If we initialize the file system multiple times during a simple read/write action, the test case actually only test the first time. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? rewrite the test to make sure the entire data source read/write action can succeed. Closes #28948 from cloud-fan/fix. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>

…ctive in data source options ### What changes were proposed in this pull request? This is a followup of #28760 to fix the remaining issues: 1. should consider data source options when refreshing cache by path at the end of `InsertIntoHadoopFsRelationCommand` 2. should consider data source options when inferring schema for file source 3. should consider data source options when getting the qualified path in file source v2. ### Why are the changes needed? We didn't catch these issues in #28760, because the test case is to check error when initializing the file system. If we initialize the file system multiple times during a simple read/write action, the test case actually only test the first time. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? rewrite the test to make sure the entire data source read/write action can succeed. Closes #28948 from cloud-fan/fix. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Gengliang Wang <[email protected]> (cherry picked from commit 6edb20d) Signed-off-by: Gengliang Wang <[email protected]>

…ctive in data source options This is a followup of apache#28760 to fix the remaining issues: 1. should consider data source options when refreshing cache by path at the end of `InsertIntoHadoopFsRelationCommand` 2. should consider data source options when inferring schema for file source 3. should consider data source options when getting the qualified path in file source v2. We didn't catch these issues in apache#28760, because the test case is to check error when initializing the file system. If we initialize the file system multiple times during a simple read/write action, the test case actually only test the first time. No rewrite the test to make sure the entire data source read/write action can succeed. Closes apache#28948 from cloud-fan/fix. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Gengliang Wang <[email protected]> (cherry picked from commit 6edb20d) Signed-off-by: Gengliang Wang <[email protected]>

… effective in data source options ### What changes were proposed in this pull request? backport #28948 This is a followup of #28760 to fix the remaining issues: 1. should consider data source options when refreshing cache by path at the end of `InsertIntoHadoopFsRelationCommand` 2. should consider data source options when inferring schema for file source 3. should consider data source options when getting the qualified path in file source v2. ### Why are the changes needed? We didn't catch these issues in #28760, because the test case is to check error when initializing the file system. If we initialize the file system multiple times during a simple read/write action, the test case actually only test the first time. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? rewrite the test to make sure the entire data source read/write action can succeed. Closes #28973 from cloud-fan/pick. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

Data source options should be propagated in method checkAndGlobPathIf…

e5b5751

…Necessary

probot-autolabeler bot added the SQL label Jun 8, 2020

gengliangwang changed the title ~~[PARK-31935][SQL]Data source options should be propagated in method checkAndGlobPathIfNecessary~~ [SPARK-31935][SQL]Data source options should be propagated in method checkAndGlobPathIfNecessary Jun 8, 2020

revise test case

0dada86

zsxwing reviewed Jun 8, 2020

View reviewed changes

gengliangwang added 2 commits June 8, 2020 16:43

fix other hadoop conf

0513869

add test cases

8aa11cc

probot-autolabeler bot added the STRUCTURED STREAMING label Jun 9, 2020

revise test case name

806b45b

gengliangwang changed the title ~~[SPARK-31935][SQL]Data source options should be propagated in method checkAndGlobPathIfNecessary~~ [SPARK-31935][SQL] Hadoop file system config should be effective in data source options Jun 9, 2020

revise

922efd5

fix test failure

4d88604

HyukjinKwon approved these changes Jun 9, 2020

View reviewed changes

cloud-fan approved these changes Jun 9, 2020

View reviewed changes

gengliangwang closed this in f3771c6 Jun 9, 2020

gengliangwang mentioned this pull request Jun 9, 2020

[2.4][SPARK-31935][SQL] Hadoop file system config should be effective in data source options #28771

Closed

gengliangwang mentioned this pull request Jun 10, 2020

[SPARK-31935][SQL][3.0] Hadoop file system config should be effective in data source options #28776

Closed

dongjoon-hyun mentioned this pull request Jun 10, 2020

[SPARK-31935][SQL][TESTS][FOLLOWUP] Fix the test case for Hadoop2/3 #28791

Closed

gengliangwang mentioned this pull request Jun 11, 2020

[SPARK-31935][SQL][TESTS][FOLLOWUP] Fix the test case for Hadoop2/3 #28796

Closed

cloud-fan mentioned this pull request Jun 29, 2020

[SPARK-31935][SQL][FOLLOWUP] Hadoop file system config should be effective in data source options #28948

Closed

cloud-fan mentioned this pull request Jul 2, 2020

[2.4][SPARK-31935][SQL][FOLLOWUP] Hadoop file system config should be effective in data source options #28973

Closed

[SPARK-31935][SQL] Hadoop file system config should be effective in data source options #28760

[SPARK-31935][SQL] Hadoop file system config should be effective in data source options #28760

Uh oh!

Conversation

gengliangwang commented Jun 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gengliangwang commented Jun 8, 2020

Uh oh!

zsxwing Jun 8, 2020

Choose a reason for hiding this comment

Uh oh!

gengliangwang Jun 8, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 9, 2020

Uh oh!

SparkQA commented Jun 9, 2020

Uh oh!

liancheng commented Jun 9, 2020

Uh oh!

SparkQA commented Jun 9, 2020

Uh oh!

SparkQA commented Jun 9, 2020

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 9, 2020

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

gengliangwang commented Jun 9, 2020

Uh oh!

SparkQA commented Jun 9, 2020

Uh oh!

gengliangwang commented Jun 9, 2020

Uh oh!

SparkQA commented Jun 9, 2020

Uh oh!

gengliangwang commented Jun 9, 2020

Uh oh!

dongjoon-hyun commented Jun 10, 2020

Uh oh!

dongjoon-hyun commented Jun 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

gengliangwang commented Jun 8, 2020 •

edited

Loading