[SPARK-14459] [SQL] Detect relation partitioning and adjust the logical plan #12239

rdblue · 2016-04-07T16:55:25Z

What changes were proposed in this pull request?

This detects a relation's partitioning and adds checks to the analyzer.
If an InsertIntoTable node has no partitioning, it is replaced by the
relation's partition scheme and input columns are correctly adjusted,
placing the partition columns at the end in partition order. If an
InsertIntoTable node has partitioning, it is checked against the table's
reported partitions.

These changes required adding a PartitionedRelation trait to the catalog
interface because Hive's MetastoreRelation doesn't extend
CatalogRelation.

This commit also includes a fix to InsertIntoTable's resolved logic,
which now detects that all expected columns are present, including
dynamic partition columns. Previously, the number of expected columns
was not checked and resolved was true if there were missing columns.

How was this patch tested?

This adds new tests to the InsertIntoTableSuite that are fixed by this PR.

rdblue · 2016-04-07T16:56:04Z

@liancheng, can you look at this? It looks like you're familiar with the SQL/Hive code.

marmbrus · 2016-04-07T18:00:56Z

ok to test

SparkQA · 2016-04-07T19:15:39Z

Test build #55231 has finished for PR 12239 at commit 5e10c7c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2016-04-08T22:28:02Z

@marmbrus, I fixed the failing test. The problem was a query that didn't supply a value for one of the partitions. This commit actually gives a better error message, but it is an AnalysisException rather than a SparkException. Thanks for taking a look at this.

SparkQA · 2016-04-08T22:30:45Z

Test build #55399 has finished for PR 12239 at commit 7c31b73.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2016-04-08T22:46:04Z

Looks like I accidentally pushed a commit I didn't intend to. I've fixed that, but the test failed. This is ok to test (again).

SparkQA · 2016-04-09T00:03:29Z

Test build #55403 has finished for PR 12239 at commit 2a807a9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2016-04-11T19:57:00Z

I just pushed a fix for the last test failure. Should be ready to test again.

SparkQA · 2016-04-11T21:13:10Z

Test build #55533 has finished for PR 12239 at commit 6491788.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class OneRowRelation(outputAttrs: Seq[Attribute]) extends LeafNode

SparkQA · 2016-04-12T02:43:42Z

Test build #55563 has finished for PR 12239 at commit aad3eb5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-04-12T03:56:56Z

cc @liancheng and @cloud-fan

SparkQA · 2016-04-20T19:58:24Z

Test build #56387 has finished for PR 12239 at commit 2247648.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-04-21T00:22:41Z

hi @rxin , are we going to make MetastoreRelation also a CatalogTable?

rxin · 2016-04-21T00:27:43Z

I think so - the current MetastoreRelation doesn't make a lot of sense.

rdblue · 2016-04-21T03:27:03Z

@rxin, @cloud-fan, this PR works for both cases when the table is resolved. I think making MetastoreRelation a CatalogTable would certainly improve things, but it looks like that is a long-term plan rather than something that should be solved before this goes in.

By the way, my other PR to align writes more with user expectations, #12313, also works toward making Hive behave like other relations. It consolidates the pre-write checks and casts and makes it much more user-friendly to write into any relation.

cloud-fan · 2016-04-21T05:08:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

why do we only check size here?

In this case, the user has called partitionBy(...) to set the partition columns by hand. Those don't have to match by name and we can insert casts during the pre-insert checks because we trust that the user gets it right. The only thing valuable to check is that the user got the number of partitions right as a sanity check.

I checked the PreWriteCheck rule, it does check if the user specified partition columns match the real ones. So I think we don't need to check here but just add a comment to say that the PreWriteCheck will do this check.

Works for me.

Removing this check broke 2 tests because the pre-insert code isn't correctly checking it. My follow-up PR fixes and merges the pre-inserts, so I've added it back in this PR with a TODO to remove it. I'll remove it when I rebase the follow-up on this.

cloud-fan · 2016-04-22T00:57:34Z

cc @liancheng to take another look

rdblue · 2016-04-22T17:16:31Z

@cloud-fan, I've rebased on master and fixed the two things you pointed out. Let me know if there's anything else. Thanks for reviewing!

SparkQA · 2016-04-22T17:49:59Z

Test build #56710 has finished for PR 12239 at commit d8aab8a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2016-04-25T23:58:01Z

I noticed that there was a conflict so I rebased on master. Tests are still passing.

SparkQA · 2016-04-27T16:36:22Z

Test build #57137 has finished for PR 12239 at commit 39fe90c.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-27T16:41:07Z

Test build #57138 has finished for PR 12239 at commit 43a6e8b.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-27T18:26:58Z

Test build #57144 has finished for PR 12239 at commit f3c3cdb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

… to match. This detects a relation's partitioning and adds checks to the analyzer. If an InsertIntoTable node has no partitioning, it is replaced by the relation's partition scheme and input columns are correctly adjusted, placing the partition columns at the end in partition order. If an InsertIntoTable node has partitioning, it is checked against the table's reported partitions. These changes required adding a PartitionedRelation trait to the catalog interface because Hive's MetastoreRelation doesn't extend CatalogRelation. This commit also includes a fix to InsertIntoTable's resolved logic, which now detects that all expected columns are present, including dynamic partition columns. Previously, the number of expected columns was not checked and resolved was true if there were missing columns.

This test expected to fail the strict partition check, but with support for table partitioning in the analyzer the problem is caught sooner and has a better error message. The message now complains that the partitioning doesn't match rather than strict mode, which wouldn't help.

OneRowTable doesn't expose its output columns because it is a singleton. The more strict checks in InsertIntoTable's resolution is causing this to fail. This commit fixes the problem by catching the case where a table doesn't define its outputs and considers the table resolved.

SparkQA · 2016-04-29T19:26:23Z

Test build #57350 has finished for PR 12239 at commit 2130db8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2016-04-29T19:48:48Z

@cloud-fan, I rebased on master to avoid the conflicts and tests are all passing. If you have a chance to take another look I'd appreciate it! I think this is about ready.

rdblue · 2016-05-03T17:34:33Z

@rxin, could you take a look at this? I think it's close to being ready and I have a couple of follow-up improvements to Hive/Parquet support that depend on it. Thank you!

rxin · 2016-05-05T07:00:32Z

Ryan - I've added this to the review list for the team.

liancheng · 2016-05-05T07:28:43Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

+      val (partitionColumns, dataColumns) = table.output
+          .partition(a => partition.keySet.contains(a.name))
+      Some(dataColumns ++ partitionColumns.takeRight(numDynamicPartitions))
+    }


Seems that we can omit the if condition and just use a Seq[Attribute] instead of Option[Seq[Attribute]] here?

@liancheng, I added a note about this below. The problem is that some relations return Nil as their output. I think this approach is the least intrusive solution because changing those relations is a large change. Please let me know what you think, and thanks for reviewing!

liancheng · 2016-05-05T07:56:13Z

Hey @rdblue, glad to see you here :) Sorry for the late reply, I somehow missed the initial message.

This mostly LGTM except for two minor comments. Thanks!

SparkQA · 2016-05-05T20:09:01Z

Test build #57909 has finished for PR 12239 at commit 6aa17f4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2016-05-05T20:52:14Z

@liancheng, I originally had the logic you suggest for the expected columns calculation. But, there were test failures because OneRowRelation reports its output columns as Nil and will never match the input. Changing that introduced a huge number of test failures where other code depended on that Nil, so I think the best way forward is to catch the case where the output table has no columns and consider the node resolved.

The down side to this approach is that tables that actually have no columns would be considered resolved, but we will catch this case in later processing. For example, the OneRowRelation test is actually looking for a helpful error message that you can't write to an RDD-based relation.

rdblue · 2016-05-05T20:52:37Z

@liancheng and @rxin, thank you for looking at this!

SparkQA · 2016-05-05T22:18:59Z

Test build #57919 has finished for PR 12239 at commit 2130db8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-05-09T08:59:49Z

I'm merging this to master and branch-2.0. Thanks for fixing this!

…l plan ## What changes were proposed in this pull request? This detects a relation's partitioning and adds checks to the analyzer. If an InsertIntoTable node has no partitioning, it is replaced by the relation's partition scheme and input columns are correctly adjusted, placing the partition columns at the end in partition order. If an InsertIntoTable node has partitioning, it is checked against the table's reported partitions. These changes required adding a PartitionedRelation trait to the catalog interface because Hive's MetastoreRelation doesn't extend CatalogRelation. This commit also includes a fix to InsertIntoTable's resolved logic, which now detects that all expected columns are present, including dynamic partition columns. Previously, the number of expected columns was not checked and resolved was true if there were missing columns. ## How was this patch tested? This adds new tests to the InsertIntoTableSuite that are fixed by this PR. Author: Ryan Blue <[email protected]> Closes #12239 from rdblue/SPARK-14459-detect-hive-partitioning. (cherry picked from commit 652bbb1) Signed-off-by: Cheng Lian <[email protected]>

rdblue · 2016-05-09T15:23:45Z

Thanks for reviewing this, @cloud-fan and @liancheng!

rdblue changed the title ~~SPARK-14459: Detect relation partitioning and adjust the logical plan~~ [SPARK-14459] [SQL] Detect relation partitioning and adjust the logical plan Apr 7, 2016

rdblue force-pushed the SPARK-14459-detect-hive-partitioning branch from 7c31b73 to 2a807a9 Compare April 8, 2016 22:45

rdblue mentioned this pull request Apr 12, 2016

[SPARK-14543] [SQL] Improve InsertIntoTable column resolution. #12313

Closed

rdblue force-pushed the SPARK-14459-detect-hive-partitioning branch 2 times, most recently from d166453 to aad3eb5 Compare April 12, 2016 01:00

cloud-fan reviewed Apr 21, 2016
View reviewed changes

rdblue force-pushed the SPARK-14459-detect-hive-partitioning branch from 2247648 to d8aab8a Compare April 22, 2016 17:15

rdblue force-pushed the SPARK-14459-detect-hive-partitioning branch 2 times, most recently from 023d656 to d87f887 Compare April 22, 2016 22:27

rdblue force-pushed the SPARK-14459-detect-hive-partitioning branch 2 times, most recently from 39fe90c to 43a6e8b Compare April 27, 2016 16:33

rdblue force-pushed the SPARK-14459-detect-hive-partitioning branch from 43a6e8b to f3c3cdb Compare April 27, 2016 17:12

rdblue added 4 commits April 29, 2016 10:34

SPARK-14459: Changes for review feedback.

2130db8

rdblue force-pushed the SPARK-14459-detect-hive-partitioning branch from f3c3cdb to 2130db8 Compare April 29, 2016 17:53

liancheng reviewed May 5, 2016
View reviewed changes

rdblue force-pushed the SPARK-14459-detect-hive-partitioning branch from 6aa17f4 to 2130db8 Compare May 5, 2016 20:42

asfgit closed this in 652bbb1 May 9, 2016

viirya mentioned this pull request Jun 14, 2016

[SPARK-15911][SQL] Remove the additional Project to be consistent with SQL #13631

Closed

[SPARK-14459] [SQL] Detect relation partitioning and adjust the logical plan #12239

[SPARK-14459] [SQL] Detect relation partitioning and adjust the logical plan #12239

Uh oh!

Conversation

rdblue commented Apr 7, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

rdblue commented Apr 7, 2016

Uh oh!

marmbrus commented Apr 7, 2016

Uh oh!

SparkQA commented Apr 7, 2016

Uh oh!

rdblue commented Apr 8, 2016

Uh oh!

SparkQA commented Apr 8, 2016

Uh oh!

rdblue commented Apr 8, 2016

Uh oh!

SparkQA commented Apr 9, 2016

Uh oh!

rdblue commented Apr 11, 2016

Uh oh!

SparkQA commented Apr 11, 2016

Uh oh!

SparkQA commented Apr 12, 2016

Uh oh!

rxin commented Apr 12, 2016

Uh oh!

SparkQA commented Apr 20, 2016

Uh oh!

cloud-fan commented Apr 21, 2016

Uh oh!

rxin commented Apr 21, 2016

Uh oh!

rdblue commented Apr 21, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Apr 22, 2016

Uh oh!

rdblue commented Apr 22, 2016

Uh oh!

SparkQA commented Apr 22, 2016

Uh oh!

rdblue commented Apr 25, 2016

Uh oh!

SparkQA commented Apr 27, 2016

Uh oh!

SparkQA commented Apr 27, 2016

Uh oh!

SparkQA commented Apr 27, 2016

Uh oh!

SparkQA commented Apr 29, 2016

Uh oh!

rdblue commented Apr 29, 2016

Uh oh!

rdblue commented May 3, 2016

Uh oh!

rxin commented May 5, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liancheng commented May 5, 2016

Uh oh!

SparkQA commented May 5, 2016

Uh oh!