[SPARK-14997][SQL] Fixed FileCatalog to return correct set of files when there is no partitioning scheme in the given paths #12856

tdas · 2016-05-03T01:35:28Z

What changes were proposed in this pull request?

Lets says there are json files in the following directories structure

xyz/file0.json
xyz/subdir1/file1.json
xyz/subdir2/file2.json
xyz/subdir1/subsubdir1/file3.json

sqlContext.read.json("xyz") should read only file0.json according to behavior in Spark 1.6.1. However in current master, all the 4 files are read.

The fix is to make FileCatalog return only the children files of the given path if there is not partitioning detected (instead of all the recursive list of files).

Closes #12774

How was this patch tested?

unit tests

yhuai · 2016-05-03T01:39:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala

-      Partition(InternalRow.empty, allFiles().filterNot(_.getPath.getName startsWith "_")) :: Nil
+      Partition(
+        InternalRow.empty,
+        unpartitionedDataFiles().filterNot(_.getPath.getName startsWith "_")


Can we call allFiles at here?

I dont know for sure. if there is a partitioning scheme, there is an additional filter on files that start with "_" in listFiles, that does not seem to be present in allFiles. So I am not sure where its best to merge.

Also, I think this way is slightly cleaner than listFiles conditionally depending on allFiles

SparkQA · 2016-05-03T02:24:55Z

Test build #57580 has finished for PR 12856 at commit b1f82ce.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-05-03T02:46:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala

-  def allFiles(): Seq[FileStatus] = leafFiles.values.toSeq
+  def allFiles(): Seq[FileStatus] = {
+    if (partitionSpec().partitionColumns.isEmpty) {
+      unpartitionedDataFiles()


Maybe add some comments at here?

SparkQA · 2016-05-03T04:06:01Z

Test build #57585 has finished for PR 12856 at commit 3bb42bd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-05-03T04:17:47Z

Would this also resolve https://issues.apache.org/jira/browse/SPARK-14463?filter=12335640 ?

HyukjinKwon · 2016-05-03T05:27:22Z

(Maybe adding "Closes #12774" in the description?)

tdas · 2016-05-03T07:24:08Z

@rxin nope this does not fix that. I tested the code on the JIRA you gave on this branch, and did not solve the issue.

SparkQA · 2016-05-03T07:52:38Z

Test build #57604 has finished for PR 12856 at commit 2f7c523.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-03T10:02:43Z

Test build #57618 has finished for PR 12856 at commit 4198d56.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-03T19:39:30Z

Test build #57650 has finished for PR 12856 at commit f1b793a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-05-03T23:11:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala

+        // 3. The path is a directory, but has no children files. Do not include this path.
+
+        leafDirToChildrenFiles.get(qualifiedPath)
+          .orElse { leafFiles.get(path).map(Array(_)) }


When we reach this block, the path is a file that explicitly passed in by the user, right?

yes. though it should probably be qualifiedPath instead of path.

I am wondering if paths in leafFiles only contain qualified paths?

SparkQA · 2016-05-04T21:14:46Z

Test build #57795 has finished for PR 12856 at commit efd261f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-05-05T02:13:01Z

...src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala

+
+        leafDirToChildrenFiles.get(qualifiedPath)
+          .orElse {
+            leafFiles.get(path).map(Array(_))


Should we use qualifiedPath instead of path?

leafFiles contain all qualified paths, right?

Oh right. I forgot to address that comment. Sorry!

How about we add a test to make sure all paths in leafFiles contain the scheme?

Added a new FileCatalogSuite

seems we still need to update this

how about we add a test that will not pass if we use path at here?

yhuai · 2016-05-05T02:28:39Z

sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala

+
+      e match {
+        case _: AnalysisException =>
+          assert(e.getMessage.contains("infer"))


Where will this error be thrown (the place where we complain that there is no file)?

I think here
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L269

SparkQA · 2016-05-05T03:02:54Z

Test build #57827 has finished for PR 12856 at commit f15ee32.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-06T02:04:43Z

Test build #57941 has finished for PR 12856 at commit 9a64496.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FileCatalogSuite extends SharedSQLContext

SparkQA · 2016-05-06T04:02:12Z

Test build #57951 has finished for PR 12856 at commit 8c06d4e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2016-05-06T09:49:41Z

@yhuai the path to qualifiedPath fix got lost in an incorrect merge. you are right that this is prone to error, so I added a unit test in FileCatalogSuite that fails in if qualified paths are not used.

SparkQA · 2016-05-06T11:09:47Z

Test build #57977 has finished for PR 12856 at commit 7c5d7ba.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-05-06T19:53:22Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileCatalogSuite.scala

+    }
+  }
+
+  test("ListingFileCatalog: input paths are converted to qualified paths") {


yhuai · 2016-05-06T20:01:53Z

LGTM! Those tests are awesome!

SparkQA · 2016-05-06T20:35:47Z

Test build #58017 has finished for PR 12856 at commit 33a1345.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-06T21:55:48Z

Test build #58027 has finished for PR 12856 at commit 8abc999.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-05-06T22:03:23Z

Thanks! Merging to master and 2.0.

…hen there is no partitioning scheme in the given paths ## What changes were proposed in this pull request? Lets says there are json files in the following directories structure ``` xyz/file0.json xyz/subdir1/file1.json xyz/subdir2/file2.json xyz/subdir1/subsubdir1/file3.json ``` `sqlContext.read.json("xyz")` should read only file0.json according to behavior in Spark 1.6.1. However in current master, all the 4 files are read. The fix is to make FileCatalog return only the children files of the given path if there is not partitioning detected (instead of all the recursive list of files). Closes #12774 ## How was this patch tested? unit tests Author: Tathagata Das <[email protected]> Closes #12856 from tdas/SPARK-14997. (cherry picked from commit f7b7ef4) Signed-off-by: Yin Huai <[email protected]>

tdas · 2016-05-06T22:06:38Z

@yhuai Thanks!

Fixed SPARK-14997

b1f82ce

yhuai reviewed May 3, 2016
View reviewed changes

Addressed comments

3bb42bd

Updated test for SimpleTextHadoopFsRelationSuite

2f7c523

tdas changed the title ~~[SPARK-14997][SQL] Fixed FileCatalog to return correct set of files when there is not partitioning scheme in the given paths~~ [SPARK-14997][SQL] Fixed FileCatalog to return correct set of files when there is no partitioning scheme in the given paths May 3, 2016

Fix unit test

4198d56

Fixed more unit tests

f1b793a

yhuai reviewed May 3, 2016
View reviewed changes

tdas mentioned this pull request May 4, 2016

[SPARK-14993] [SQL] Fix Partition Discovery Inconsistency when Input is a Path to Parquet File #12828

Closed

tdas added 2 commits May 4, 2016 13:03

Merge remote-tracking branch 'apache-github/master' into SPARK-14997

6779bd9

Addressed comments

efd261f

Fixed tests

f15ee32

yhuai reviewed May 5, 2016
View reviewed changes

tdas added 3 commits May 5, 2016 15:37

Merge remote-tracking branch 'apache-github/master' into SPARK-14997

072c28b

Addressed comments and added tests

2b43ae4

Add new missing file

9a64496

Fixed bug

8c06d4e

Fixed bugs

7c5d7ba

Fixed ParquetPartitionDiscoverySuite

33a1345

yhuai reviewed May 6, 2016
View reviewed changes

One more fix

8abc999

asfgit closed this in f7b7ef4 May 6, 2016

[SPARK-14997][SQL] Fixed FileCatalog to return correct set of files when there is no partitioning scheme in the given paths #12856

[SPARK-14997][SQL] Fixed FileCatalog to return correct set of files when there is no partitioning scheme in the given paths #12856

Uh oh!

Conversation

tdas commented May 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdas May 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 3, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 3, 2016

Uh oh!

rxin commented May 3, 2016

Uh oh!

HyukjinKwon commented May 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdas commented May 3, 2016

Uh oh!

SparkQA commented May 3, 2016

Uh oh!

SparkQA commented May 3, 2016

Uh oh!

SparkQA commented May 3, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 4, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yhuai May 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 5, 2016

Uh oh!

SparkQA commented May 6, 2016

Uh oh!

SparkQA commented May 6, 2016

Uh oh!

tdas commented May 6, 2016

Uh oh!

SparkQA commented May 6, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yhuai commented May 6, 2016

Uh oh!

SparkQA commented May 6, 2016

Uh oh!

SparkQA commented May 6, 2016

Uh oh!

tdas commented May 3, 2016 •

edited

Loading

tdas May 3, 2016 •

edited

Loading

HyukjinKwon commented May 3, 2016 •

edited

Loading

yhuai May 5, 2016 •

edited

Loading