Skip to content

Conversation

@gatorsmile
Copy link
Member

@gatorsmile gatorsmile commented May 2, 2016

What changes were proposed in this pull request?

When we load a dataset, if we set the path to /path/a=1, we will not take a as the partitioning column. However, if we set the path to /path/a=1/file.parquet, we take a as the partitioning column and it shows up in the schema.

This PR is to fix the behavior inconsistency issue.

The base path contains a set of paths that are considered as the base dirs of the input datasets. The partitioning discovery logic will make sure it will stop when it reaches any base path.

By default, the paths of the dataset provided by users will be base paths. Below are three typical cases,
Case 1sqlContext.read.parquet("/path/something=true/"): the base path will be
/path/something=true/, and the returned DataFrame will not contain a column of something.
Case 2sqlContext.read.parquet("/path/something=true/a.parquet"): the base path will be
still /path/something=true/, and the returned DataFrame will also not contain a column of
something.
Case 3sqlContext.read.parquet("/path/"): the base path will be /path/, and the returned
DataFrame will have the column of something.

Users also can override the basePath by setting basePath in the options to pass the new base
path to the data source. For example,
sqlContext.read.option("basePath", "/path/").parquet("/path/something=true/"),
and the returned DataFrame will have the column of something.

The related PRs:

How was this patch tested?

Added a couple of test cases

gatorsmile and others added 30 commits November 13, 2015 14:50
@SparkQA
Copy link

SparkQA commented May 2, 2016

Test build #57500 has finished for PR 12828 at commit 461441c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

cc @yhuai

@gatorsmile gatorsmile changed the title [SPARK-14993] Fix Partition Discovery Inconsistency when Input is a Path to Parquet File [SPARK-14993] [SQL] Fix Partition Discovery Inconsistency when Input is a Path to Parquet File May 2, 2016

if (basePaths.contains(currentPath)) {
if (basePaths.contains(currentPath) ||
basePaths.exists(_.toString.startsWith(currentPath.toString))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this and provide an example?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. We are trying to check if there is a basePath starts with the currentPath.

So, the actual problem is that basePaths in HDFSFileCatalog contains files, right? I discussed it with @tdas. He will have a pr to change basePaths. Let's review his fix together. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, please include the test case in
https://github.com/apache/spark/pull/12828/files#diff-cf57fe1c329fb21ac00a8528f049da4aR435

This test case checks three typical cases.

@yhuai
Copy link
Contributor

yhuai commented May 3, 2016

@gatorsmile When we call PartitioningUtils.parsePartitions, we should provide a Seq[Path] representing leaf dirs, right? We have this problem is caused by the fact we actually pass leaf files in?

@gatorsmile
Copy link
Member Author

gatorsmile commented May 3, 2016

@yhuai We passed leaf dirs to path, but the basePaths is a path to a Parquet file. For example,

    parsePartition(
      path = new Path("file://path/a=10"),
      defaultPartitionName = defaultPartitionName,
      typeInference = true,
      basePaths = Set(new Path("file://path/a=10/p.parquet")))

In this case, we need to follow what we did in #9651.

The current behavior is shown in the test case:
https://github.com/apache/spark/pull/12828/files#diff-cf57fe1c329fb21ac00a8528f049da4aR435

@tdas
Copy link
Contributor

tdas commented May 3, 2016

@yhuai and I discussed that this solution of substring match seems very hacky.

The real problem is that basePaths should never have files as it does not make sense to have a basePath that is not a directory. So, our strategy in HDFSFileCatalog of making the set of input files as the default basePath is incorrect. The correct fix is to set the default base path based on the [dirs in input paths] UNION [parent dirs of files in input paths].

Here is the fix - fbef90f
Please update your PR with this. You dont have to change parsePartition in that case.

Consider updating the scala docs to make this implicit assumption of basePath clear in the code.

@gatorsmile
Copy link
Member Author

@tdas Thank you very much! Will do it soon.

@SparkQA
Copy link

SparkQA commented May 3, 2016

Test build #57599 has finished for PR 12828 at commit bf98150.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

@tdas @yhuai Based on the fix fbef90f , updated the scala docs, test cases and PR description.

Please let me know if anything is not appropriate. Thanks again!

Set(userDefinedBasePath.makeQualified(fs.getUri, fs.getWorkingDirectory))

case None =>
paths.map { path => if (leafFiles.contains(path)) path.getParent else path }.toSet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to make this path qualified?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leafFiles only contain qualified paths, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe leaf files contain only qualified. There was comments elsewhere in the file that same so.

Here it is - https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala#L468

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will make path qualified before comparison. Thanks!

@tdas
Copy link
Contributor

tdas commented May 4, 2016

just a heads up ... i have a PR that refactors FileCatalog significantly - #12879 . I want to merge that first as it will cause many conflicts, including this PR as well as my PR #12856

@gatorsmile
Copy link
Member Author

Sure, I will wait for it. Thanks for letting me know it!

@SparkQA
Copy link

SparkQA commented May 4, 2016

Test build #57711 has finished for PR 12828 at commit 252065c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 4, 2016

Test build #57787 has finished for PR 12828 at commit e92e9b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented May 5, 2016

LGTM. Merging to master and branch 2.0

@asfgit asfgit closed this in ef55e46 May 5, 2016
asfgit pushed a commit that referenced this pull request May 5, 2016
…s a Path to Parquet File

#### What changes were proposed in this pull request?
When we load a dataset, if we set the path to ```/path/a=1```, we will not take `a` as the partitioning column. However, if we set the path to ```/path/a=1/file.parquet```, we take `a` as the partitioning column and it shows up in the schema.

This PR is to fix the behavior inconsistency issue.

The base path contains a set of paths that are considered as the base dirs of the input datasets. The partitioning discovery logic will make sure it will stop when it reaches any base path.

By default, the paths of the dataset provided by users will be base paths. Below are three typical cases,
**Case 1**```sqlContext.read.parquet("/path/something=true/")```: the base path will be
`/path/something=true/`, and the returned DataFrame will not contain a column of `something`.
**Case 2**```sqlContext.read.parquet("/path/something=true/a.parquet")```: the base path will be
still `/path/something=true/`, and the returned DataFrame will also not contain a column of
`something`.
**Case 3**```sqlContext.read.parquet("/path/")```: the base path will be `/path/`, and the returned
DataFrame will have the column of `something`.

Users also can override the basePath by setting `basePath` in the options to pass the new base
path to the data source. For example,
```sqlContext.read.option("basePath", "/path/").parquet("/path/something=true/")```,
and the returned DataFrame will have the column of `something`.

The related PRs:
- #9651
- #10211

#### How was this patch tested?
Added a couple of test cases

Author: gatorsmile <[email protected]>
Author: xiaoli <[email protected]>
Author: Xiao Li <[email protected]>

Closes #12828 from gatorsmile/readPartitionedTable.

(cherry picked from commit ef55e46)
Signed-off-by: Yin Huai <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants