Skip to content

Conversation

@sbcd90
Copy link
Contributor

@sbcd90 sbcd90 commented Apr 29, 2016

What changes were proposed in this pull request?

This PR fixes the issue of "Files in subdirectories are incorrectly considered in sqlContext.read.json()".

An example,

xyz/file0.json
xyz/subdir1/file1.json
xyz/subdir2/file2.json
xyz/subdir1/subsubdir1/file3.json


sqlContext.read.json("xyz") should read only file0.json according to behavior in Spark 1.6.1. However in current master, all the 4 files are read.

How was this patch tested?

unit tests

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

} else {
mutable.LinkedHashSet(files: _*) ++ listLeafFiles(dirs.map(_.getPath))
}
mutable.LinkedHashSet(files: _*)
Copy link
Member

@HyukjinKwon HyukjinKwon Apr 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure of the difference between 1.6.1 and master? I see this logics are not changed comparing to that interfaces.scala#L467-L472 in branch 1.6.
Also, does this still support to read partitioned tables?

Copy link
Member

@HyukjinKwon HyukjinKwon Apr 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I believe there is another method in HadoopFsRelation companion object to list up files parallely. This will use this method based on a threshold. I think that should be also corrected if it is really problematic and there should be tests for them as well.

@HyukjinKwon
Copy link
Member

(I think "(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)" can be removed in the PR description)

@sbcd90
Copy link
Contributor Author

sbcd90 commented Apr 30, 2016

Hello @HyukjinKwon , I am able to reproduce the same issue even in Spark 1.6.1. I had two files like this

/test_spark/join1.json
{"a": 1, "b": 2}
{"a": 2, "b": 4}
{"a": 4, "b": 8}
{"a": 8, "b": 16}
/test_spark/subdir/join2.json
{"a": 1, "c": 1}
{"a": 2, "c": 2}
{"a": 3, "c": 3}
{"a": 4, "c": 4}

I execute the following code snippet in Spark 1.6.1

package org.apache.spark

import org.apache.spark.sql.SQLContext

object TestApp9 extends App {
  val conf = new SparkConf().setAppName("TestApp9").setMaster("local")
  val sc = new SparkContext(conf)
  val sqlContext = new SQLContext(sc)

  sqlContext.read.json("/test_spark").show()
}

& the output is

+---+---+----+
|  a|  b|   c|
+---+---+----+
|  1|  2|null|
|  2|  4|null|
|  4|  8|null|
|  8| 16|null|
+---+---+----+

So, both files are considered. The issue requires further discussion on what approach to follow to solve it.
The cause of the issue is the piece of code I have changed. But I'm unsure on what approach to follow to support partitioned tables also.

@gatorsmile
Copy link
Member

gatorsmile commented May 2, 2016

IMO, the current behavior is expected. If the document is not clear, we should correct the document.

If we need to change the behavior, we might need to introduce a conf parameter or the external API change for supporting both.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 2, 2016

Hi @gatorsmile,
Does that maybe imply closing this for now and make a JIRA or send a email to dev-mailing list in order to discuss this further?

@gatorsmile
Copy link
Member

cc @yhuai

@tdas
Copy link
Contributor

tdas commented May 3, 2016

@sbcd90 I dont get your example. Your example actually shows that only file /test_spark/join1.json is considered in Spark 1.6.1. In Spark master, this is broken as both files will be considered. The reason for this bug is that in Spark 1.6.1, there were two code paths - one when partitioning is detected, another without. This led to the non-partitioning case not consider directories recursively, which is what the behavior should be.

In current master, after refactoring, there is only one code path, that uses FileCatalog and HDFSFileCatalog, which always returns all the files recursively, even when there is not partitioning scheme in the directory structure.

@tdas
Copy link
Contributor

tdas commented May 3, 2016

Here is my version of the fix - https://github.com/apache/spark/pull/12856/files

@asfgit asfgit closed this in f7b7ef4 May 6, 2016
asfgit pushed a commit that referenced this pull request May 6, 2016
…hen there is no partitioning scheme in the given paths

## What changes were proposed in this pull request?
Lets says there are json files in the following directories structure
```
xyz/file0.json
xyz/subdir1/file1.json
xyz/subdir2/file2.json
xyz/subdir1/subsubdir1/file3.json
```
`sqlContext.read.json("xyz")` should read only file0.json according to behavior in Spark 1.6.1. However in current master, all the 4 files are read.

The fix is to make FileCatalog return only the children files of the given path if there is not partitioning detected (instead of all the recursive list of files).

Closes #12774

## How was this patch tested?

unit tests

Author: Tathagata Das <[email protected]>

Closes #12856 from tdas/SPARK-14997.

(cherry picked from commit f7b7ef4)
Signed-off-by: Yin Huai <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants