-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-14997]Files in subdirectories are incorrectly considered in sqlContext.read.json() #12774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…lContext.read.json()
|
Can one of the admins verify this patch? |
| } else { | ||
| mutable.LinkedHashSet(files: _*) ++ listLeafFiles(dirs.map(_.getPath)) | ||
| } | ||
| mutable.LinkedHashSet(files: _*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure of the difference between 1.6.1 and master? I see this logics are not changed comparing to that interfaces.scala#L467-L472 in branch 1.6.
Also, does this still support to read partitioned tables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I believe there is another method in HadoopFsRelation companion object to list up files parallely. This will use this method based on a threshold. I think that should be also corrected if it is really problematic and there should be tests for them as well.
|
(I think "(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)" can be removed in the PR description) |
|
Hello @HyukjinKwon , I am able to reproduce the same issue even in Spark 1.6.1. I had two files like this I execute the following code snippet in Spark 1.6.1 & the output is So, both files are considered. The issue requires further discussion on what approach to follow to solve it. |
|
IMO, the current behavior is expected. If the document is not clear, we should correct the document. If we need to change the behavior, we might need to introduce a conf parameter or the external API change for supporting both. |
|
Hi @gatorsmile, |
|
cc @yhuai |
|
@sbcd90 I dont get your example. Your example actually shows that only file In current master, after refactoring, there is only one code path, that uses FileCatalog and HDFSFileCatalog, which always returns all the files recursively, even when there is not partitioning scheme in the directory structure. |
|
Here is my version of the fix - https://github.com/apache/spark/pull/12856/files |
…hen there is no partitioning scheme in the given paths
## What changes were proposed in this pull request?
Lets says there are json files in the following directories structure
```
xyz/file0.json
xyz/subdir1/file1.json
xyz/subdir2/file2.json
xyz/subdir1/subsubdir1/file3.json
```
`sqlContext.read.json("xyz")` should read only file0.json according to behavior in Spark 1.6.1. However in current master, all the 4 files are read.
The fix is to make FileCatalog return only the children files of the given path if there is not partitioning detected (instead of all the recursive list of files).
Closes #12774
## How was this patch tested?
unit tests
Author: Tathagata Das <[email protected]>
Closes #12856 from tdas/SPARK-14997.
(cherry picked from commit f7b7ef4)
Signed-off-by: Yin Huai <[email protected]>
What changes were proposed in this pull request?
This PR fixes the issue of "Files in subdirectories are incorrectly considered in sqlContext.read.json()".
An example,
How was this patch tested?
unit tests
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)