Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

This is a followup of #28760 to fix the remaining issues:

  1. should consider data source options when refreshing cache by path at the end of InsertIntoHadoopFsRelationCommand
  2. should consider data source options when inferring schema for file source
  3. should consider data source options when getting the qualified path in file source v2.

Why are the changes needed?

We didn't catch these issues in #28760, because the test case is to check error when initializing the file system. If we initialize the file system multiple times during a simple read/write action, the test case actually only test the first time.

Does this PR introduce any user-facing change?

No

How was this patch tested?

rewrite the test to make sure the entire data source read/write action can succeed.

@cloud-fan
Copy link
Contributor Author

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jun 29, 2020

Oh, thanks for pinging me, @cloud-fan .

class FakeFileSystemRequiringDSOption extends LocalFileSystem {
override def initialize(name: URI, conf: Configuration): Unit = {
super.initialize(name, conf)
require(conf.get("ds_option", "") == "value")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice check!

* Tries to re-cache all the cache entries that contain `path` in one or more
* `HadoopFsRelation` node(s) as part of its logical plan.
*/
def recacheByPath(spark: SparkSession, path: Path, fs: FileSystem): Unit = {
Copy link
Member

@dongjoon-hyun dongjoon-hyun Jun 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if we can have a consistent parameter name for all recacheByPaths. Currently, resourcePath and path are used in a mixed way. Is it for distinguish Path class from String class? Even in that case, resourcePath: Path also looks reasonable. Or, we can use path for both if we can.

* Tries to re-cache all the cache entries that contain `resourcePath` in one or more
* `HadoopFsRelation` node(s) as part of its logical plan.
*/
def recacheByPath(spark: SparkSession, resourcePath: String): Unit = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about we change the method as

def recacheByPath(spark: SparkSession, resourcePath: String, options: Map[String, String]=Map.empty)

so that we can avoid the new method below?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That also works, but it feels a little bit weird to couple the data source options concept with the cache manager...

@liancheng
Copy link
Contributor

@cloud-fan, thanks for fixing this! We should probably backport this one to 2.4 as well.

@cloud-fan
Copy link
Contributor Author

Yea, I'll open a new PR for 2.4 after this one gets merged.

@SparkQA
Copy link

SparkQA commented Jun 30, 2020

Test build #124651 has started for PR 28948 at commit 2d7975e.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jun 30, 2020

Hi, @liancheng and @cloud-fan .
@gengliangwang 's original SPARK-31935 was Improvement and landed only at branch-3.0.
To have this in branch-2.4, could you adjust the JIRA issue into Bug and add 2.4.x into Affected Versions first?

Please note that I'm not against backporting of SPARK-31935 and this follow-up in branch-2.4.

Also, cc @gatorsmile

@cloud-fan
Copy link
Contributor Author

@gengliangwang Can you open a new PR to backport #28760 to 2.4?

@gengliangwang
Copy link
Member

gengliangwang commented Jul 1, 2020

@gengliangwang Can you open a new PR to backport #28760 to 2.4?

@cloud-fan it is already back ported in #28771

@gengliangwang
Copy link
Member

retest this please.

@liancheng
Copy link
Contributor

liancheng commented Jul 1, 2020

@dongjoon-hyun, thanks for reminding. Didn't see your comment at first, but I've already updated the "Affects Version/s" field in the Jira ticket. Also saw that it's already marked as a bug.

@SparkQA
Copy link

SparkQA commented Jul 1, 2020

Test build #124742 has finished for PR 28948 at commit 2d7975e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the fix!

@dongjoon-hyun
Copy link
Member

Thank you all!

@SparkQA
Copy link

SparkQA commented Jul 1, 2020

Test build #124776 has finished for PR 28948 at commit 2d7975e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member

gengliangwang commented Jul 1, 2020

Thanks, merging to master/3.0

gengliangwang pushed a commit that referenced this pull request Jul 1, 2020
…ctive in data source options

### What changes were proposed in this pull request?

This is a followup of #28760 to fix the remaining issues:
1. should consider data source options when refreshing cache by path at the end of `InsertIntoHadoopFsRelationCommand`
2. should consider data source options when inferring schema for file source
3. should consider data source options when getting the qualified path in file source v2.

### Why are the changes needed?

We didn't catch these issues in #28760, because the test case is to check error when initializing the file system. If we initialize the file system multiple times during a simple read/write action, the test case actually only test the first time.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

rewrite the test to make sure the entire data source read/write action can succeed.

Closes #28948 from cloud-fan/fix.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
(cherry picked from commit 6edb20d)
Signed-off-by: Gengliang Wang <[email protected]>
@gengliangwang
Copy link
Member

@cloud-fan there are a lot of conflicts when merging this one to branch-2.4. We will have to do it manually.

cloud-fan added a commit to cloud-fan/spark that referenced this pull request Jul 2, 2020
…ctive in data source options

This is a followup of apache#28760 to fix the remaining issues:
1. should consider data source options when refreshing cache by path at the end of `InsertIntoHadoopFsRelationCommand`
2. should consider data source options when inferring schema for file source
3. should consider data source options when getting the qualified path in file source v2.

We didn't catch these issues in apache#28760, because the test case is to check error when initializing the file system. If we initialize the file system multiple times during a simple read/write action, the test case actually only test the first time.

No

rewrite the test to make sure the entire data source read/write action can succeed.

Closes apache#28948 from cloud-fan/fix.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
(cherry picked from commit 6edb20d)
Signed-off-by: Gengliang Wang <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request Jul 2, 2020
… effective in data source options

### What changes were proposed in this pull request?

backport #28948

This is a followup of #28760 to fix the remaining issues:
1. should consider data source options when refreshing cache by path at the end of `InsertIntoHadoopFsRelationCommand`
2. should consider data source options when inferring schema for file source
3. should consider data source options when getting the qualified path in file source v2.

### Why are the changes needed?

We didn't catch these issues in #28760, because the test case is to check error when initializing the file system. If we initialize the file system multiple times during a simple read/write action, the test case actually only test the first time.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

rewrite the test to make sure the entire data source read/write action can succeed.

Closes #28973 from cloud-fan/pick.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants