[SPARK-15248][SQL] Make MetastoreFileCatalog consider directories from partition specs of a partitioned metastore table #13022

tdas · 2016-05-10T08:57:14Z

What changes were proposed in this pull request?

Table partitions can be added with locations different from default warehouse location of a hive table.
CREATE TABLE parquetTable (a int) PARTITIONED BY (b int) STORED AS parquet
ALTER TABLE parquetTable ADD PARTITION (b=1) LOCATION '/path/1'
Querying such a table throws error as the MetastoreFileCatalog does not list the added partition directory, it only lists the default base location.

[info] - SPARK-15248: explicitly added partitions should be readable *** FAILED *** (1 second, 8 milliseconds)
[info]   java.util.NoSuchElementException: key not found: file:/Users/tdas/Projects/Spark/spark2/target/tmp/spark-b39ad224-c5d1-4966-8981-fb45a2066d61/partition
[info]   at scala.collection.MapLike$class.default(MapLike.scala:228)
[info]   at scala.collection.AbstractMap.default(Map.scala:59)
[info]   at scala.collection.MapLike$class.apply(MapLike.scala:141)
[info]   at scala.collection.AbstractMap.apply(Map.scala:59)
[info]   at org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog$$anonfun$listFiles$1.apply(PartitioningAwareFileCatalog.scala:59)
[info]   at org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog$$anonfun$listFiles$1.apply(PartitioningAwareFileCatalog.scala:55)
[info]   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
[info]   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
[info]   at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
[info]   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
[info]   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
[info]   at org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.listFiles(PartitioningAwareFileCatalog.scala:55)
[info]   at org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:93)
[info]   at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59)
[info]   at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59)
[info]   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
[info]   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
[info]   at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:60)
[info]   at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:55)
[info]   at org.apache.spark.sql.execution.SparkStrategies$SpecialLimits$.apply(SparkStrategies.scala:55)
[info]   at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59)
[info]   at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59)
[info]   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
[info]   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
[info]   at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:60)
[info]   at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:77)
[info]   at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75)
[info]   at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:82)
[info]   at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:82)
[info]   at org.apache.spark.sql.QueryTest.assertEmptyMissingInput(QueryTest.scala:330)
[info]   at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:146)
[info]   at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:159)
[info]   at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12$$anonfun$apply$mcV$sp$7$$anonfun$apply$mcV$sp$25.apply(parquetSuites.scala:554)
[info]   at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12$$anonfun$apply$mcV$sp$7$$anonfun$apply$mcV$sp$25.apply(parquetSuites.scala:535)
[info]   at org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:125)
[info]   at org.apache.spark.sql.hive.ParquetPartitioningTest.withTempDir(parquetSuites.scala:726)
[info]   at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12$$anonfun$apply$mcV$sp$7.apply$mcV$sp(parquetSuites.scala:535)
[info]   at org.apache.spark.sql.test.SQLTestUtils$class.withTable(SQLTestUtils.scala:166)
[info]   at org.apache.spark.sql.hive.ParquetPartitioningTest.withTable(parquetSuites.scala:726)
[info]   at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12.apply$mcV$sp(parquetSuites.scala:534)
[info]   at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12.apply(parquetSuites.scala:534)
[info]   at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12.apply(parquetSuites.scala:534)

The solution in this PR to get the paths to list from the partition spec and not rely on the default table path alone.

How was this patch tested?

unit tests.

SparkQA · 2016-05-10T09:48:55Z

Test build #58224 has finished for PR 13022 at commit 29148c9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2016-05-10T21:58:01Z

@yhuai Please take a look.

tdas · 2016-05-10T23:31:04Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

+  /** Get the list of non-overalapping paths to list files in the for a metastore table */
+  def getPaths(tableBasePath: Path, partitionSpec: PartitionSpec): Seq[Path] = {
+    val basePathStr = tableBasePath.toUri.toString
+    val partitionsOutsideBasePath = partitionSpec


@yin this is the hacky thing. I want to make sure that the seq of paths does not contains both
tableBasePath as well as tableBasePath/partition=1. Having both parent dir and sub dir might cause the listing to do duplicate work.

Option1: Use substring match to eliminate partition paths that are anyways covered in the base path. This is what I have done.

Option 2: (just occurred to me, I think this is cleaner)

if (partitionSpec.partitions.nonEmpty) { partitionSpec.partitions.map(_.path) } else { Seq(tableBasePath) }

Which do you think is better?
Note that the list of path determines the match with cached HadoopFsRelation objects.

I spent some time analyzing your problem and I agree with your second option. Why am I included in this discussion anyways? I never contributed to Spark.

Oh, sorry. @tdas was trying to cc me.

SparkQA · 2016-05-10T23:31:40Z

Test build #58284 has finished for PR 13022 at commit 8ab243c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2016-05-11T00:10:14Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

+ * @param tableBasePath The default base path of the Hive metastore table
+ * @param partitionSpec The partition specifications from Hive metastore
 */
 private[hive] class MetaStoreFileCatalog(


@yin I am inclined to rename this class to MetastorePartitionedTableFileCatalog as this is only used for partitioned tables, and therefore does not need to worry about what files to return when the tables is not partitioned.

SparkQA · 2016-05-11T01:50:56Z

Test build #58297 has finished for PR 13022 at commit 6b81f8f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-05-11T17:36:01Z

...src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala

+
+            case None =>
+              // Directory does not exist, or has not children files
+              Nil


For the stacktrace in the description, how did we create the path of file:/Users/tdas/Projects/Spark/spark2/target/tmp/spark-b39ad224-c5d1-4966-8981-fb45a2066d61/partition?

I ran the test in this PR without the changes in source.
What is your concern?

yhuai · 2016-05-11T19:35:29Z

LGTM. Merging to master and branch 2.0

…m partition specs of a partitioned metastore table Table partitions can be added with locations different from default warehouse location of a hive table. `CREATE TABLE parquetTable (a int) PARTITIONED BY (b int) STORED AS parquet ` `ALTER TABLE parquetTable ADD PARTITION (b=1) LOCATION '/partition'` Querying such a table throws error as the MetastoreFileCatalog does not list the added partition directory, it only lists the default base location. ``` [info] - SPARK-15248: explicitly added partitions should be readable *** FAILED *** (1 second, 8 milliseconds) [info] java.util.NoSuchElementException: key not found: file:/Users/tdas/Projects/Spark/spark2/target/tmp/spark-b39ad224-c5d1-4966-8981-fb45a2066d61/partition [info] at scala.collection.MapLike$class.default(MapLike.scala:228) [info] at scala.collection.AbstractMap.default(Map.scala:59) [info] at scala.collection.MapLike$class.apply(MapLike.scala:141) [info] at scala.collection.AbstractMap.apply(Map.scala:59) [info] at org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog$$anonfun$listFiles$1.apply(PartitioningAwareFileCatalog.scala:59) [info] at org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog$$anonfun$listFiles$1.apply(PartitioningAwareFileCatalog.scala:55) [info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) [info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) [info] at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) [info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) [info] at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) [info] at scala.collection.AbstractTraversable.map(Traversable.scala:104) [info] at org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.listFiles(PartitioningAwareFileCatalog.scala:55) [info] at org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:93) [info] at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59) [info] at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59) [info] at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) [info] at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) [info] at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:60) [info] at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:55) [info] at org.apache.spark.sql.execution.SparkStrategies$SpecialLimits$.apply(SparkStrategies.scala:55) [info] at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59) [info] at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59) [info] at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) [info] at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) [info] at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:60) [info] at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:77) [info] at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75) [info] at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:82) [info] at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:82) [info] at org.apache.spark.sql.QueryTest.assertEmptyMissingInput(QueryTest.scala:330) [info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:146) [info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:159) [info] at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12$$anonfun$apply$mcV$sp$7$$anonfun$apply$mcV$sp$25.apply(parquetSuites.scala:554) [info] at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12$$anonfun$apply$mcV$sp$7$$anonfun$apply$mcV$sp$25.apply(parquetSuites.scala:535) [info] at org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:125) [info] at org.apache.spark.sql.hive.ParquetPartitioningTest.withTempDir(parquetSuites.scala:726) [info] at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12$$anonfun$apply$mcV$sp$7.apply$mcV$sp(parquetSuites.scala:535) [info] at org.apache.spark.sql.test.SQLTestUtils$class.withTable(SQLTestUtils.scala:166) [info] at org.apache.spark.sql.hive.ParquetPartitioningTest.withTable(parquetSuites.scala:726) [info] at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12.apply$mcV$sp(parquetSuites.scala:534) [info] at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12.apply(parquetSuites.scala:534) [info] at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12.apply(parquetSuites.scala:534) ``` The solution in this PR to get the paths to list from the partition spec and not rely on the default table path alone. unit tests. Author: Tathagata Das <[email protected]> Closes #13022 from tdas/SPARK-15248. (cherry picked from commit 81c68ec) Signed-off-by: Yin Huai <[email protected]>

SPARK-15248

29148c9

Fixed unit tests

8ab243c

tdas force-pushed the SPARK-15248 branch from fd3fde6 to 8ab243c Compare May 10, 2016 21:57

tdas reviewed May 10, 2016
View reviewed changes

tdas reviewed May 11, 2016
View reviewed changes

Renamed and changed logic

6b81f8f

yhuai reviewed May 11, 2016
View reviewed changes

asfgit closed this in 81c68ec May 11, 2016

mallman mentioned this pull request Jun 30, 2016

[SPARK-15968][SQL] Nonempty partitioned metastore tables are not cached #13818

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-15248][SQL] Make MetastoreFileCatalog consider directories from partition specs of a partitioned metastore table #13022

[SPARK-15248][SQL] Make MetastoreFileCatalog consider directories from partition specs of a partitioned metastore table #13022

Uh oh!

tdas commented May 10, 2016 •

edited

Loading

Uh oh!

SparkQA commented May 10, 2016

Uh oh!

tdas commented May 10, 2016

Uh oh!

tdas May 10, 2016 •

edited

Loading

Uh oh!

yin Jun 21, 2016

Uh oh!

yhuai Jun 21, 2016

Uh oh!

SparkQA commented May 10, 2016

Uh oh!

tdas May 11, 2016 •

edited

Loading

Uh oh!

SparkQA commented May 11, 2016

Uh oh!

yhuai May 11, 2016

Uh oh!

tdas May 11, 2016

Uh oh!

yhuai commented May 11, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-15248][SQL] Make MetastoreFileCatalog consider directories from partition specs of a partitioned metastore table #13022

[SPARK-15248][SQL] Make MetastoreFileCatalog consider directories from partition specs of a partitioned metastore table #13022

Uh oh!

Conversation

tdas commented May 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 10, 2016

Uh oh!

tdas commented May 10, 2016

Uh oh!

tdas May 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yin Jun 21, 2016

Choose a reason for hiding this comment

Uh oh!

yhuai Jun 21, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 10, 2016

Uh oh!

tdas May 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 11, 2016

Uh oh!

yhuai May 11, 2016

Choose a reason for hiding this comment

Uh oh!

tdas May 11, 2016

Choose a reason for hiding this comment

Uh oh!

yhuai commented May 11, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tdas commented May 10, 2016 •

edited

Loading

tdas May 10, 2016 •

edited

Loading

tdas May 11, 2016 •

edited

Loading