Skip to content

Conversation

@tdas
Copy link
Contributor

@tdas tdas commented May 3, 2016

What changes were proposed in this pull request?

Lets says there are json files in the following directories structure

xyz/file0.json
xyz/subdir1/file1.json
xyz/subdir2/file2.json
xyz/subdir1/subsubdir1/file3.json

sqlContext.read.json("xyz") should read only file0.json according to behavior in Spark 1.6.1. However in current master, all the 4 files are read.

The fix is to make FileCatalog return only the children files of the given path if there is not partitioning detected (instead of all the recursive list of files).

Closes #12774

How was this patch tested?

unit tests

Partition(InternalRow.empty, allFiles().filterNot(_.getPath.getName startsWith "_")) :: Nil
Partition(
InternalRow.empty,
unpartitionedDataFiles().filterNot(_.getPath.getName startsWith "_")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call allFiles at here?

Copy link
Contributor Author

@tdas tdas May 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont know for sure. if there is a partitioning scheme, there is an additional filter on files that start with "_" in listFiles, that does not seem to be present in allFiles. So I am not sure where its best to merge.

Also, I think this way is slightly cleaner than listFiles conditionally depending on allFiles

@SparkQA
Copy link

SparkQA commented May 3, 2016

Test build #57580 has finished for PR 12856 at commit b1f82ce.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

def allFiles(): Seq[FileStatus] = leafFiles.values.toSeq
def allFiles(): Seq[FileStatus] = {
if (partitionSpec().partitionColumns.isEmpty) {
unpartitionedDataFiles()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add some comments at here?

@SparkQA
Copy link

SparkQA commented May 3, 2016

Test build #57585 has finished for PR 12856 at commit 3bb42bd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented May 3, 2016

@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 3, 2016

(Maybe adding "Closes #12774" in the description?)

@tdas
Copy link
Contributor Author

tdas commented May 3, 2016

@rxin nope this does not fix that. I tested the code on the JIRA you gave on this branch, and did not solve the issue.

@SparkQA
Copy link

SparkQA commented May 3, 2016

Test build #57604 has finished for PR 12856 at commit 2f7c523.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas tdas changed the title [SPARK-14997][SQL] Fixed FileCatalog to return correct set of files when there is not partitioning scheme in the given paths [SPARK-14997][SQL] Fixed FileCatalog to return correct set of files when there is no partitioning scheme in the given paths May 3, 2016
@SparkQA
Copy link

SparkQA commented May 3, 2016

Test build #57618 has finished for PR 12856 at commit 4198d56.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 3, 2016

Test build #57650 has finished for PR 12856 at commit f1b793a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// 3. The path is a directory, but has no children files. Do not include this path.

leafDirToChildrenFiles.get(qualifiedPath)
.orElse { leafFiles.get(path).map(Array(_)) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we reach this block, the path is a file that explicitly passed in by the user, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. though it should probably be qualifiedPath instead of path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if paths in leafFiles only contain qualified paths?

@SparkQA
Copy link

SparkQA commented May 4, 2016

Test build #57795 has finished for PR 12856 at commit efd261f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


leafDirToChildrenFiles.get(qualifiedPath)
.orElse {
leafFiles.get(path).map(Array(_))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use qualifiedPath instead of path?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leafFiles contain all qualified paths, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right. I forgot to address that comment. Sorry!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we add a test to make sure all paths in leafFiles contain the scheme?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new FileCatalogSuite

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems we still need to update this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about we add a test that will not pass if we use path at here?


e match {
case _: AnalysisException =>
assert(e.getMessage.contains("infer"))
Copy link
Contributor

@yhuai yhuai May 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where will this error be thrown (the place where we complain that there is no file)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SparkQA
Copy link

SparkQA commented May 5, 2016

Test build #57827 has finished for PR 12856 at commit f15ee32.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 6, 2016

Test build #57941 has finished for PR 12856 at commit 9a64496.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class FileCatalogSuite extends SharedSQLContext

@SparkQA
Copy link

SparkQA commented May 6, 2016

Test build #57951 has finished for PR 12856 at commit 8c06d4e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor Author

tdas commented May 6, 2016

@yhuai the path to qualifiedPath fix got lost in an incorrect merge. you are right that this is prone to error, so I added a unit test in FileCatalogSuite that fails in if qualified paths are not used.

@SparkQA
Copy link

SparkQA commented May 6, 2016

Test build #57977 has finished for PR 12856 at commit 7c5d7ba.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
}

test("ListingFileCatalog: input paths are converted to qualified paths") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@yhuai
Copy link
Contributor

yhuai commented May 6, 2016

LGTM! Those tests are awesome!

@SparkQA
Copy link

SparkQA commented May 6, 2016

Test build #58017 has finished for PR 12856 at commit 33a1345.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 6, 2016

Test build #58027 has finished for PR 12856 at commit 8abc999.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented May 6, 2016

Thanks! Merging to master and 2.0.

@asfgit asfgit closed this in f7b7ef4 May 6, 2016
asfgit pushed a commit that referenced this pull request May 6, 2016
…hen there is no partitioning scheme in the given paths

## What changes were proposed in this pull request?
Lets says there are json files in the following directories structure
```
xyz/file0.json
xyz/subdir1/file1.json
xyz/subdir2/file2.json
xyz/subdir1/subsubdir1/file3.json
```
`sqlContext.read.json("xyz")` should read only file0.json according to behavior in Spark 1.6.1. However in current master, all the 4 files are read.

The fix is to make FileCatalog return only the children files of the given path if there is not partitioning detected (instead of all the recursive list of files).

Closes #12774

## How was this patch tested?

unit tests

Author: Tathagata Das <[email protected]>

Closes #12856 from tdas/SPARK-14997.

(cherry picked from commit f7b7ef4)
Signed-off-by: Yin Huai <[email protected]>
@tdas
Copy link
Contributor Author

tdas commented May 6, 2016

@yhuai Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants