Skip to content

Conversation

@gengliangwang
Copy link
Member

@gengliangwang gengliangwang commented Jun 8, 2020

What changes were proposed in this pull request?

Mkae Hadoop file system config effective in data source options.

From org.apache.hadoop.fs.FileSystem.java:

  public static FileSystem get(URI uri, Configuration conf) throws IOException {
    String scheme = uri.getScheme();
    String authority = uri.getAuthority();

    if (scheme == null && authority == null) {     // use default FS
      return get(conf);
    }

    if (scheme != null && authority == null) {     // no authority
      URI defaultUri = getDefaultUri(conf);
      if (scheme.equals(defaultUri.getScheme())    // if scheme matches default
          && defaultUri.getAuthority() != null) {  // & default has authority
        return get(defaultUri, conf);              // return default
      }
    }
    
    String disableCacheName = String.format("fs.%s.impl.disable.cache", scheme);
    if (conf.getBoolean(disableCacheName, false)) {
      return createFileSystem(uri, conf);
    }

    return CACHE.get(uri, conf);
  }

Before changes, the file system configurations in data source options are not propagated in DataSource.scala.
After changes, we can specify authority and URI schema related configurations for scanning file systems.

This problem only exists in data source V1. In V2, we already use sparkSession.sessionState.newHadoopConfWithOptions(options) in FileTable.

Why are the changes needed?

Allow users to specify authority and URI schema related Hadoop configurations for file source reading.

Does this PR introduce any user-facing change?

Yes, the file system related Hadoop configuration in data source option will be effective on reading.

How was this patch tested?

Unit test

@gengliangwang
Copy link
Member Author

cc @liancheng @gatorsmile

@gengliangwang gengliangwang changed the title [PARK-31935][SQL]Data source options should be propagated in method checkAndGlobPathIfNecessary [SPARK-31935][SQL]Data source options should be propagated in method checkAndGlobPathIfNecessary Jun 8, 2020
checkFilesExist: Boolean): Seq[Path] = {
val allPaths = caseInsensitiveOptions.get("path") ++ paths
val hadoopConf = sparkSession.sessionState.newHadoopConf()
val hadoopConf = sparkSession.sessionState.newHadoopConfWithOptions(options)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple newHadoopConfs in this file. Should we also fix them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should. I am adding more test cases.

@gengliangwang gengliangwang changed the title [SPARK-31935][SQL]Data source options should be propagated in method checkAndGlobPathIfNecessary [SPARK-31935][SQL] Hadoop file system config should be effective in data source options Jun 9, 2020
@SparkQA
Copy link

SparkQA commented Jun 9, 2020

Test build #123652 has finished for PR 28760 at commit e5b5751.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 9, 2020

Test build #123654 has finished for PR 28760 at commit 0dada86.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

LGTM, thanks!

@SparkQA
Copy link

SparkQA commented Jun 9, 2020

Test build #123659 has finished for PR 28760 at commit 922efd5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 9, 2020

Test build #123655 has finished for PR 28760 at commit 0513869.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main code changes look good. LGTM if tests are fixed and pass.

@SparkQA
Copy link

SparkQA commented Jun 9, 2020

Test build #123657 has finished for PR 28760 at commit 8aa11cc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. How far shall we backport it?

@gengliangwang
Copy link
Member Author

@cloud-fan I think we can backport it to branch-2.4. Once branch-3.0 rc3 is cut, we can backport it to branch-3.0 as well.
WDYT?

@SparkQA
Copy link

SparkQA commented Jun 9, 2020

Test build #123664 has finished for PR 28760 at commit 4d88604.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Jun 9, 2020

Test build #123677 has finished for PR 28760 at commit 4d88604.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

Merging to master.
I will backport to branch-2.4 and branch-3.0 later

gengliangwang added a commit to gengliangwang/spark that referenced this pull request Jun 10, 2020
…ata source options

Mkae Hadoop file system config effective in data source options.

From `org.apache.hadoop.fs.FileSystem.java`:
```
  public static FileSystem get(URI uri, Configuration conf) throws IOException {
    String scheme = uri.getScheme();
    String authority = uri.getAuthority();

    if (scheme == null && authority == null) {     // use default FS
      return get(conf);
    }

    if (scheme != null && authority == null) {     // no authority
      URI defaultUri = getDefaultUri(conf);
      if (scheme.equals(defaultUri.getScheme())    // if scheme matches default
          && defaultUri.getAuthority() != null) {  // & default has authority
        return get(defaultUri, conf);              // return default
      }
    }

    String disableCacheName = String.format("fs.%s.impl.disable.cache", scheme);
    if (conf.getBoolean(disableCacheName, false)) {
      return createFileSystem(uri, conf);
    }

    return CACHE.get(uri, conf);
  }
```
Before changes, the file system configurations in data source options are not propagated in `DataSource.scala`.
After changes, we can specify authority and URI schema related configurations for scanning file systems.

This problem only exists in data source V1. In V2, we already use `sparkSession.sessionState.newHadoopConfWithOptions(options)` in `FileTable`.

Allow users to specify authority and URI schema related Hadoop configurations for file source reading.

Yes, the file system related Hadoop configuration in data source option will be effective on reading.

Unit test

Closes apache#28760 from gengliangwang/ds_conf.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
@dongjoon-hyun
Copy link
Member

I'll make a follow-up PR.

dongjoon-hyun added a commit that referenced this pull request Jun 11, 2020
### What changes were proposed in this pull request?

This PR updates the test case to accept Hadoop 2/3 error message correctly.

### Why are the changes needed?

SPARK-31935(#28760) breaks Hadoop 3.2 UT because Hadoop 2 and Hadoop 3 have different exception messages.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with both Hadoop 2/3 or do the following manually.

**Hadoop 2.7**
```
$ build/sbt "sql/testOnly *.FileBasedDataSourceSuite -- -z SPARK-31935"
...
[info] All tests passed.
```

**Hadoop 3.2**
```
$ build/sbt "sql/testOnly *.FileBasedDataSourceSuite -- -z SPARK-31935" -Phadoop-3.2
...
[info] All tests passed.
```

Closes #28791 from dongjoon-hyun/SPARK-31935.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
gengliangwang pushed a commit to gengliangwang/spark that referenced this pull request Jun 11, 2020
### What changes were proposed in this pull request?

This PR updates the test case to accept Hadoop 2/3 error message correctly.

### Why are the changes needed?

SPARK-31935(apache#28760) breaks Hadoop 3.2 UT because Hadoop 2 and Hadoop 3 have different exception messages.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with both Hadoop 2/3 or do the following manually.

**Hadoop 2.7**
```
$ build/sbt "sql/testOnly *.FileBasedDataSourceSuite -- -z SPARK-31935"
...
[info] All tests passed.
```

**Hadoop 3.2**
```
$ build/sbt "sql/testOnly *.FileBasedDataSourceSuite -- -z SPARK-31935" -Phadoop-3.2
...
[info] All tests passed.
```

Closes apache#28791 from dongjoon-hyun/SPARK-31935.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Jun 11, 2020
### What changes were proposed in this pull request?

This PR updates the test case to accept Hadoop 2/3 error message correctly.

### Why are the changes needed?

SPARK-31935(#28760) breaks Hadoop 3.2 UT because Hadoop 2 and Hadoop 3 have different exception messages.
In #28791, there are two test suites missed the fix

### Does this PR introduce _any_ user-facing change?

No
### How was this patch tested?

Unit test

Closes #28796 from gengliangwang/SPARK-31926-followup.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
gengliangwang added a commit to gengliangwang/spark that referenced this pull request Jun 11, 2020
This PR updates the test case to accept Hadoop 2/3 error message correctly.

SPARK-31935(apache#28760) breaks Hadoop 3.2 UT because Hadoop 2 and Hadoop 3 have different exception messages.
In apache#28791, there are two test suites missed the fix

No

Unit test

Closes apache#28796 from gengliangwang/SPARK-31926-followup.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
gengliangwang pushed a commit that referenced this pull request Jul 1, 2020
…ctive in data source options

### What changes were proposed in this pull request?

This is a followup of #28760 to fix the remaining issues:
1. should consider data source options when refreshing cache by path at the end of `InsertIntoHadoopFsRelationCommand`
2. should consider data source options when inferring schema for file source
3. should consider data source options when getting the qualified path in file source v2.

### Why are the changes needed?

We didn't catch these issues in #28760, because the test case is to check error when initializing the file system. If we initialize the file system multiple times during a simple read/write action, the test case actually only test the first time.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

rewrite the test to make sure the entire data source read/write action can succeed.

Closes #28948 from cloud-fan/fix.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
gengliangwang pushed a commit that referenced this pull request Jul 1, 2020
…ctive in data source options

### What changes were proposed in this pull request?

This is a followup of #28760 to fix the remaining issues:
1. should consider data source options when refreshing cache by path at the end of `InsertIntoHadoopFsRelationCommand`
2. should consider data source options when inferring schema for file source
3. should consider data source options when getting the qualified path in file source v2.

### Why are the changes needed?

We didn't catch these issues in #28760, because the test case is to check error when initializing the file system. If we initialize the file system multiple times during a simple read/write action, the test case actually only test the first time.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

rewrite the test to make sure the entire data source read/write action can succeed.

Closes #28948 from cloud-fan/fix.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
(cherry picked from commit 6edb20d)
Signed-off-by: Gengliang Wang <[email protected]>
cloud-fan added a commit to cloud-fan/spark that referenced this pull request Jul 2, 2020
…ctive in data source options

This is a followup of apache#28760 to fix the remaining issues:
1. should consider data source options when refreshing cache by path at the end of `InsertIntoHadoopFsRelationCommand`
2. should consider data source options when inferring schema for file source
3. should consider data source options when getting the qualified path in file source v2.

We didn't catch these issues in apache#28760, because the test case is to check error when initializing the file system. If we initialize the file system multiple times during a simple read/write action, the test case actually only test the first time.

No

rewrite the test to make sure the entire data source read/write action can succeed.

Closes apache#28948 from cloud-fan/fix.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
(cherry picked from commit 6edb20d)
Signed-off-by: Gengliang Wang <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request Jul 2, 2020
… effective in data source options

### What changes were proposed in this pull request?

backport #28948

This is a followup of #28760 to fix the remaining issues:
1. should consider data source options when refreshing cache by path at the end of `InsertIntoHadoopFsRelationCommand`
2. should consider data source options when inferring schema for file source
3. should consider data source options when getting the qualified path in file source v2.

### Why are the changes needed?

We didn't catch these issues in #28760, because the test case is to check error when initializing the file system. If we initialize the file system multiple times during a simple read/write action, the test case actually only test the first time.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

rewrite the test to make sure the entire data source read/write action can succeed.

Closes #28973 from cloud-fan/pick.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants