-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-14997][SQL] Fixed FileCatalog to return correct set of files when there is no partitioning scheme in the given paths #12856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Partition(InternalRow.empty, allFiles().filterNot(_.getPath.getName startsWith "_")) :: Nil | ||
| Partition( | ||
| InternalRow.empty, | ||
| unpartitionedDataFiles().filterNot(_.getPath.getName startsWith "_") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call allFiles at here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont know for sure. if there is a partitioning scheme, there is an additional filter on files that start with "_" in listFiles, that does not seem to be present in allFiles. So I am not sure where its best to merge.
Also, I think this way is slightly cleaner than listFiles conditionally depending on allFiles
|
Test build #57580 has finished for PR 12856 at commit
|
| def allFiles(): Seq[FileStatus] = leafFiles.values.toSeq | ||
| def allFiles(): Seq[FileStatus] = { | ||
| if (partitionSpec().partitionColumns.isEmpty) { | ||
| unpartitionedDataFiles() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add some comments at here?
|
Test build #57585 has finished for PR 12856 at commit
|
|
Would this also resolve https://issues.apache.org/jira/browse/SPARK-14463?filter=12335640 ? |
|
(Maybe adding "Closes #12774" in the description?) |
|
@rxin nope this does not fix that. I tested the code on the JIRA you gave on this branch, and did not solve the issue. |
|
Test build #57604 has finished for PR 12856 at commit
|
|
Test build #57618 has finished for PR 12856 at commit
|
|
Test build #57650 has finished for PR 12856 at commit
|
| // 3. The path is a directory, but has no children files. Do not include this path. | ||
|
|
||
| leafDirToChildrenFiles.get(qualifiedPath) | ||
| .orElse { leafFiles.get(path).map(Array(_)) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we reach this block, the path is a file that explicitly passed in by the user, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. though it should probably be qualifiedPath instead of path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if paths in leafFiles only contain qualified paths?
|
Test build #57795 has finished for PR 12856 at commit
|
|
|
||
| leafDirToChildrenFiles.get(qualifiedPath) | ||
| .orElse { | ||
| leafFiles.get(path).map(Array(_)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use qualifiedPath instead of path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leafFiles contain all qualified paths, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right. I forgot to address that comment. Sorry!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we add a test to make sure all paths in leafFiles contain the scheme?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a new FileCatalogSuite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems we still need to update this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about we add a test that will not pass if we use path at here?
|
|
||
| e match { | ||
| case _: AnalysisException => | ||
| assert(e.getMessage.contains("infer")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where will this error be thrown (the place where we complain that there is no file)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Test build #57827 has finished for PR 12856 at commit
|
|
Test build #57941 has finished for PR 12856 at commit
|
|
Test build #57951 has finished for PR 12856 at commit
|
|
@yhuai the path to qualifiedPath fix got lost in an incorrect merge. you are right that this is prone to error, so I added a unit test in FileCatalogSuite that fails in if qualified paths are not used. |
|
Test build #57977 has finished for PR 12856 at commit
|
| } | ||
| } | ||
|
|
||
| test("ListingFileCatalog: input paths are converted to qualified paths") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
|
LGTM! Those tests are awesome! |
|
Test build #58017 has finished for PR 12856 at commit
|
|
Test build #58027 has finished for PR 12856 at commit
|
|
Thanks! Merging to master and 2.0. |
…hen there is no partitioning scheme in the given paths
## What changes were proposed in this pull request?
Lets says there are json files in the following directories structure
```
xyz/file0.json
xyz/subdir1/file1.json
xyz/subdir2/file2.json
xyz/subdir1/subsubdir1/file3.json
```
`sqlContext.read.json("xyz")` should read only file0.json according to behavior in Spark 1.6.1. However in current master, all the 4 files are read.
The fix is to make FileCatalog return only the children files of the given path if there is not partitioning detected (instead of all the recursive list of files).
Closes #12774
## How was this patch tested?
unit tests
Author: Tathagata Das <[email protected]>
Closes #12856 from tdas/SPARK-14997.
(cherry picked from commit f7b7ef4)
Signed-off-by: Yin Huai <[email protected]>
|
@yhuai Thanks! |
What changes were proposed in this pull request?
Lets says there are json files in the following directories structure
sqlContext.read.json("xyz")should read only file0.json according to behavior in Spark 1.6.1. However in current master, all the 4 files are read.The fix is to make FileCatalog return only the children files of the given path if there is not partitioning detected (instead of all the recursive list of files).
Closes #12774
How was this patch tested?
unit tests