[SPARK-14997]Files in subdirectories are incorrectly considered in sqlContext.read.json() #12774

sbcd90 · 2016-04-29T04:31:41Z

What changes were proposed in this pull request?

This PR fixes the issue of "Files in subdirectories are incorrectly considered in sqlContext.read.json()".

An example,

xyz/file0.json
xyz/subdir1/file1.json
xyz/subdir2/file2.json
xyz/subdir1/subsubdir1/file3.json


sqlContext.read.json("xyz") should read only file0.json according to behavior in Spark 1.6.1. However in current master, all the 4 files are read.

How was this patch tested?

unit tests

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

…lContext.read.json()

AmplabJenkins · 2016-04-29T04:32:13Z

Can one of the admins verify this patch?

HyukjinKwon · 2016-04-29T04:51:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala

-      } else {
-        mutable.LinkedHashSet(files: _*) ++ listLeafFiles(dirs.map(_.getPath))
-      }
+      mutable.LinkedHashSet(files: _*)


Are you sure of the difference between 1.6.1 and master? I see this logics are not changed comparing to that interfaces.scala#L467-L472 in branch 1.6.
Also, does this still support to read partitioned tables?

Also, I believe there is another method in HadoopFsRelation companion object to list up files parallely. This will use this method based on a threshold. I think that should be also corrected if it is really problematic and there should be tests for them as well.

HyukjinKwon · 2016-04-29T05:06:53Z

(I think "(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)" can be removed in the PR description)

sbcd90 · 2016-04-30T02:49:53Z

Hello @HyukjinKwon , I am able to reproduce the same issue even in Spark 1.6.1. I had two files like this

/test_spark/join1.json
{"a": 1, "b": 2}
{"a": 2, "b": 4}
{"a": 4, "b": 8}
{"a": 8, "b": 16}

/test_spark/subdir/join2.json
{"a": 1, "c": 1}
{"a": 2, "c": 2}
{"a": 3, "c": 3}
{"a": 4, "c": 4}

I execute the following code snippet in Spark 1.6.1

package org.apache.spark

import org.apache.spark.sql.SQLContext

object TestApp9 extends App {
  val conf = new SparkConf().setAppName("TestApp9").setMaster("local")
  val sc = new SparkContext(conf)
  val sqlContext = new SQLContext(sc)

  sqlContext.read.json("/test_spark").show()
}

& the output is

+---+---+----+
|  a|  b|   c|
+---+---+----+
|  1|  2|null|
|  2|  4|null|
|  4|  8|null|
|  8| 16|null|
+---+---+----+

So, both files are considered. The issue requires further discussion on what approach to follow to solve it.
The cause of the issue is the piece of code I have changed. But I'm unsure on what approach to follow to support partitioned tables also.

gatorsmile · 2016-05-02T02:52:26Z

IMO, the current behavior is expected. If the document is not clear, we should correct the document.

If we need to change the behavior, we might need to introduce a conf parameter or the external API change for supporting both.

HyukjinKwon · 2016-05-02T03:19:57Z

Hi @gatorsmile,
Does that maybe imply closing this for now and make a JIRA or send a email to dev-mailing list in order to discuss this further?

gatorsmile · 2016-05-02T03:34:18Z

cc @yhuai

tdas · 2016-05-03T01:28:51Z

@sbcd90 I dont get your example. Your example actually shows that only file /test_spark/join1.json is considered in Spark 1.6.1. In Spark master, this is broken as both files will be considered. The reason for this bug is that in Spark 1.6.1, there were two code paths - one when partitioning is detected, another without. This led to the non-partitioning case not consider directories recursively, which is what the behavior should be.

In current master, after refactoring, there is only one code path, that uses FileCatalog and HDFSFileCatalog, which always returns all the files recursively, even when there is not partitioning scheme in the directory structure.

tdas · 2016-05-03T01:36:29Z

Here is my version of the fix - https://github.com/apache/spark/pull/12856/files

…hen there is no partitioning scheme in the given paths ## What changes were proposed in this pull request? Lets says there are json files in the following directories structure ``` xyz/file0.json xyz/subdir1/file1.json xyz/subdir2/file2.json xyz/subdir1/subsubdir1/file3.json ``` `sqlContext.read.json("xyz")` should read only file0.json according to behavior in Spark 1.6.1. However in current master, all the 4 files are read. The fix is to make FileCatalog return only the children files of the given path if there is not partitioning detected (instead of all the recursive list of files). Closes #12774 ## How was this patch tested? unit tests Author: Tathagata Das <[email protected]> Closes #12856 from tdas/SPARK-14997. (cherry picked from commit f7b7ef4) Signed-off-by: Yin Huai <[email protected]>

[SPARK-14997]Files in subdirectories are incorrectly considered in sq…

a693297

…lContext.read.json()

HyukjinKwon reviewed Apr 29, 2016
View reviewed changes

HyukjinKwon mentioned this pull request May 3, 2016

[SPARK-14997][SQL] Fixed FileCatalog to return correct set of files when there is no partitioning scheme in the given paths #12856

Closed

asfgit closed this in f7b7ef4 May 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-14997]Files in subdirectories are incorrectly considered in sqlContext.read.json() #12774

[SPARK-14997]Files in subdirectories are incorrectly considered in sqlContext.read.json() #12774

Uh oh!

sbcd90 commented Apr 29, 2016

Uh oh!

AmplabJenkins commented Apr 29, 2016

Uh oh!

HyukjinKwon Apr 29, 2016 •

edited

Loading

Uh oh!

HyukjinKwon Apr 29, 2016 •

edited

Loading

Uh oh!

HyukjinKwon commented Apr 29, 2016

Uh oh!

sbcd90 commented Apr 30, 2016

Uh oh!

gatorsmile commented May 2, 2016 •

edited

Loading

Uh oh!

HyukjinKwon commented May 2, 2016 •

edited

Loading

Uh oh!

gatorsmile commented May 2, 2016

Uh oh!

tdas commented May 3, 2016

Uh oh!

tdas commented May 3, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-14997]Files in subdirectories are incorrectly considered in sqlContext.read.json() #12774

[SPARK-14997]Files in subdirectories are incorrectly considered in sqlContext.read.json() #12774

Uh oh!

Conversation

sbcd90 commented Apr 29, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

AmplabJenkins commented Apr 29, 2016

Uh oh!

HyukjinKwon Apr 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Apr 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Apr 29, 2016

Uh oh!

sbcd90 commented Apr 30, 2016

Uh oh!

gatorsmile commented May 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented May 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented May 2, 2016

Uh oh!

tdas commented May 3, 2016

Uh oh!

tdas commented May 3, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HyukjinKwon Apr 29, 2016 •

edited

Loading

HyukjinKwon Apr 29, 2016 •

edited

Loading

gatorsmile commented May 2, 2016 •

edited

Loading

HyukjinKwon commented May 2, 2016 •

edited

Loading