[SPARK-14993] [SQL] Fix Partition Discovery Inconsistency when Input is a Path to Parquet File #12828

gatorsmile · 2016-05-02T02:42:34Z

What changes were proposed in this pull request?

When we load a dataset, if we set the path to /path/a=1, we will not take a as the partitioning column. However, if we set the path to /path/a=1/file.parquet, we take a as the partitioning column and it shows up in the schema.

This PR is to fix the behavior inconsistency issue.

The base path contains a set of paths that are considered as the base dirs of the input datasets. The partitioning discovery logic will make sure it will stop when it reaches any base path.

By default, the paths of the dataset provided by users will be base paths. Below are three typical cases,
Case 1sqlContext.read.parquet("/path/something=true/"): the base path will be
/path/something=true/, and the returned DataFrame will not contain a column of something.
Case 2sqlContext.read.parquet("/path/something=true/a.parquet"): the base path will be
still /path/something=true/, and the returned DataFrame will also not contain a column of
something.
Case 3sqlContext.read.parquet("/path/"): the base path will be /path/, and the returned
DataFrame will have the column of something.

Users also can override the basePath by setting basePath in the options to pass the new base
path to the data source. For example,
sqlContext.read.option("basePath", "/path/").parquet("/path/something=true/"),
and the returned DataFrame will have the column of something.

The related PRs:

How was this patch tested?

Added a couple of test cases

SparkQA · 2016-05-02T04:11:12Z

Test build #57500 has finished for PR 12828 at commit 461441c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-05-02T04:12:46Z

cc @yhuai

yhuai · 2016-05-03T01:30:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala


-      if (basePaths.contains(currentPath)) {
+      if (basePaths.contains(currentPath) ||
+        basePaths.exists(_.toString.startsWith(currentPath.toString))) {


Can you explain this and provide an example?

I see. We are trying to check if there is a basePath starts with the currentPath.

So, the actual problem is that basePaths in HDFSFileCatalog contains files, right? I discussed it with @tdas. He will have a pr to change basePaths. Let's review his fix together. What do you think?

Sure, please include the test case in
https://github.com/apache/spark/pull/12828/files#diff-cf57fe1c329fb21ac00a8528f049da4aR435

This test case checks three typical cases.

yhuai · 2016-05-03T01:37:47Z

@gatorsmile When we call PartitioningUtils.parsePartitions, we should provide a Seq[Path] representing leaf dirs, right? We have this problem is caused by the fact we actually pass leaf files in?

gatorsmile · 2016-05-03T01:56:26Z

@yhuai We passed leaf dirs to path, but the basePaths is a path to a Parquet file. For example,

    parsePartition(
      path = new Path("file://path/a=10"),
      defaultPartitionName = defaultPartitionName,
      typeInference = true,
      basePaths = Set(new Path("file://path/a=10/p.parquet")))

In this case, we need to follow what we did in #9651.

The current behavior is shown in the test case:
https://github.com/apache/spark/pull/12828/files#diff-cf57fe1c329fb21ac00a8528f049da4aR435

tdas · 2016-05-03T02:57:41Z

@yhuai and I discussed that this solution of substring match seems very hacky.

The real problem is that basePaths should never have files as it does not make sense to have a basePath that is not a directory. So, our strategy in HDFSFileCatalog of making the set of input files as the default basePath is incorrect. The correct fix is to set the default base path based on the [dirs in input paths] UNION [parent dirs of files in input paths].

Here is the fix - fbef90f
Please update your PR with this. You dont have to change parsePartition in that case.

Consider updating the scala docs to make this implicit assumption of basePath clear in the code.

gatorsmile · 2016-05-03T03:42:29Z

@tdas Thank you very much! Will do it soon.

SparkQA · 2016-05-03T06:50:46Z

Test build #57599 has finished for PR 12828 at commit bf98150.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-05-03T12:32:24Z

@tdas @yhuai Based on the fix fbef90f , updated the scala docs, test cases and PR description.

Please let me know if anything is not appropriate. Thanks again!

yhuai · 2016-05-03T22:46:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala

+        Set(userDefinedBasePath.makeQualified(fs.getUri, fs.getWorkingDirectory))
+
+      case None =>
+        paths.map { path => if (leafFiles.contains(path)) path.getParent else path }.toSet


Do we need to make this path qualified?

leafFiles only contain qualified paths, right?

I believe leaf files contain only qualified. There was comments elsewhere in the file that same so.

Here it is - https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala#L468

Will make path qualified before comparison. Thanks!

tdas · 2016-05-04T00:17:23Z

just a heads up ... i have a PR that refactors FileCatalog significantly - #12879 . I want to merge that first as it will cause many conflicts, including this PR as well as my PR #12856

gatorsmile · 2016-05-04T01:42:12Z

Sure, I will wait for it. Thanks for letting me know it!

SparkQA · 2016-05-04T05:35:05Z

Test build #57711 has finished for PR 12828 at commit 252065c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

# Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala

SparkQA · 2016-05-04T20:22:16Z

Test build #57787 has finished for PR 12828 at commit e92e9b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-05-05T01:46:57Z

LGTM. Merging to master and branch 2.0

…s a Path to Parquet File #### What changes were proposed in this pull request? When we load a dataset, if we set the path to ```/path/a=1```, we will not take `a` as the partitioning column. However, if we set the path to ```/path/a=1/file.parquet```, we take `a` as the partitioning column and it shows up in the schema. This PR is to fix the behavior inconsistency issue. The base path contains a set of paths that are considered as the base dirs of the input datasets. The partitioning discovery logic will make sure it will stop when it reaches any base path. By default, the paths of the dataset provided by users will be base paths. Below are three typical cases, **Case 1**```sqlContext.read.parquet("/path/something=true/")```: the base path will be `/path/something=true/`, and the returned DataFrame will not contain a column of `something`. **Case 2**```sqlContext.read.parquet("/path/something=true/a.parquet")```: the base path will be still `/path/something=true/`, and the returned DataFrame will also not contain a column of `something`. **Case 3**```sqlContext.read.parquet("/path/")```: the base path will be `/path/`, and the returned DataFrame will have the column of `something`. Users also can override the basePath by setting `basePath` in the options to pass the new base path to the data source. For example, ```sqlContext.read.option("basePath", "/path/").parquet("/path/something=true/")```, and the returned DataFrame will have the column of `something`. The related PRs: - #9651 - #10211 #### How was this patch tested? Added a couple of test cases Author: gatorsmile <[email protected]> Author: xiaoli <[email protected]> Author: Xiao Li <[email protected]> Closes #12828 from gatorsmile/readPartitionedTable. (cherry picked from commit ef55e46) Signed-off-by: Yin Huai <[email protected]>

gatorsmile and others added 30 commits November 13, 2015 14:50

Merge remote-tracking branch 'upstream/master'

01e4cdf

Merge remote-tracking branch 'upstream/master'

6835704

Merge remote-tracking branch 'upstream/master'

9180687

SPARK-11633

b38a21e

Merge remote-tracking branch 'upstream/master' into joinMakeCopy

d2b84af

Merge remote-tracking branch 'upstream/master'

fda8025

Merge branch 'master' of https://github.com/gatorsmile/spark

ac0dccd

Merge remote-tracking branch 'upstream/master'

6e0018b

converge

0546772

converge

b37a64f

Merge remote-tracking branch 'upstream/master'

c2a872c

Merge remote-tracking branch 'upstream/master'

ab6dbd7

Merge remote-tracking branch 'upstream/master'

4276356

Merge remote-tracking branch 'upstream/master'

2dab708

Merge remote-tracking branch 'upstream/master'

0458770

Merge remote-tracking branch 'upstream/master'

1debdfa

Merge remote-tracking branch 'upstream/master'

763706d

Merge remote-tracking branch 'upstream/master'

4de6ec1

Merge remote-tracking branch 'upstream/master'

9422a4f

Merge remote-tracking branch 'upstream/master'

52bdf48

Merge remote-tracking branch 'upstream/master'

1e95df3

Merge remote-tracking branch 'upstream/master'

fab24cf

Merge remote-tracking branch 'upstream/master'

8b2e33b

Merge remote-tracking branch 'upstream/master'

2ee1876

Merge remote-tracking branch 'upstream/master'

b9f0090

Merge remote-tracking branch 'upstream/master'

ade6f7e

Merge remote-tracking branch 'upstream/master'

9fd63d2

Merge remote-tracking branch 'upstream/master'

5199d49

Merge remote-tracking branch 'upstream/master'

404214c

Merge remote-tracking branch 'upstream/master'

c001dd9

gatorsmile added 3 commits April 30, 2016 18:09

Merge remote-tracking branch 'upstream/master'

7c4b2f0

Merge remote-tracking branch 'upstream/master'

38f3af9

initial fix.

461441c

gatorsmile changed the title ~~[SPARK-14993] Fix Partition Discovery Inconsistency when Input is a Path to Parquet File~~ [SPARK-14993] [SQL] Fix Partition Discovery Inconsistency when Input is a Path to Parquet File May 2, 2016

yhuai reviewed May 3, 2016
View reviewed changes

gatorsmile added 2 commits May 2, 2016 22:30

address comments

b230e33

revert

bf98150

yhuai reviewed May 3, 2016
View reviewed changes

gatorsmile added 2 commits May 3, 2016 20:41

Merge remote-tracking branch 'upstream/master' into readPartitionedTable

3ebaf73

address comments.

252065c

gatorsmile added 4 commits May 3, 2016 23:46

Merge remote-tracking branch 'upstream/master'

8089c6f

Merge remote-tracking branch 'upstream/master'

a6c7518

Merge remote-tracking branch 'upstream/master'

546c1db

Merge branch 'readPartitionedTable' into readPartitionedTableNew

e92e9b2

# Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala

asfgit closed this in ef55e46 May 5, 2016

[SPARK-14993] [SQL] Fix Partition Discovery Inconsistency when Input is a Path to Parquet File #12828

[SPARK-14993] [SQL] Fix Partition Discovery Inconsistency when Input is a Path to Parquet File #12828

Uh oh!

Conversation

gatorsmile commented May 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 2, 2016

Uh oh!

gatorsmile commented May 2, 2016

Uh oh!

yhuai May 3, 2016

Choose a reason for hiding this comment

Uh oh!

yhuai May 3, 2016

Choose a reason for hiding this comment

Uh oh!

gatorsmile May 3, 2016

Choose a reason for hiding this comment

Uh oh!

yhuai commented May 3, 2016

Uh oh!

gatorsmile commented May 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdas commented May 3, 2016

Uh oh!

gatorsmile commented May 3, 2016

Uh oh!

SparkQA commented May 3, 2016

Uh oh!

gatorsmile commented May 3, 2016

Uh oh!

yhuai May 3, 2016

Choose a reason for hiding this comment

Uh oh!

yhuai May 3, 2016

Choose a reason for hiding this comment

Uh oh!

tdas May 4, 2016

Choose a reason for hiding this comment

Uh oh!

gatorsmile May 4, 2016

Choose a reason for hiding this comment

Uh oh!

tdas commented May 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented May 4, 2016

Uh oh!

SparkQA commented May 4, 2016

Uh oh!

SparkQA commented May 4, 2016

Uh oh!

yhuai commented May 5, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gatorsmile commented May 2, 2016 •

edited

Loading

gatorsmile commented May 3, 2016 •

edited

Loading

tdas commented May 4, 2016 •

edited

Loading