Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Apr 7, 2016

What changes were proposed in this pull request?

This detects a relation's partitioning and adds checks to the analyzer.
If an InsertIntoTable node has no partitioning, it is replaced by the
relation's partition scheme and input columns are correctly adjusted,
placing the partition columns at the end in partition order. If an
InsertIntoTable node has partitioning, it is checked against the table's
reported partitions.

These changes required adding a PartitionedRelation trait to the catalog
interface because Hive's MetastoreRelation doesn't extend
CatalogRelation.

This commit also includes a fix to InsertIntoTable's resolved logic,
which now detects that all expected columns are present, including
dynamic partition columns. Previously, the number of expected columns
was not checked and resolved was true if there were missing columns.

How was this patch tested?

This adds new tests to the InsertIntoTableSuite that are fixed by this PR.

@rdblue
Copy link
Contributor Author

rdblue commented Apr 7, 2016

@liancheng, can you look at this? It looks like you're familiar with the SQL/Hive code.

@rdblue rdblue changed the title SPARK-14459: Detect relation partitioning and adjust the logical plan [SPARK-14459] [SQL] Detect relation partitioning and adjust the logical plan Apr 7, 2016
@marmbrus
Copy link
Contributor

marmbrus commented Apr 7, 2016

ok to test

@SparkQA
Copy link

SparkQA commented Apr 7, 2016

Test build #55231 has finished for PR 12239 at commit 5e10c7c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor Author

rdblue commented Apr 8, 2016

@marmbrus, I fixed the failing test. The problem was a query that didn't supply a value for one of the partitions. This commit actually gives a better error message, but it is an AnalysisException rather than a SparkException. Thanks for taking a look at this.

@SparkQA
Copy link

SparkQA commented Apr 8, 2016

Test build #55399 has finished for PR 12239 at commit 7c31b73.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue rdblue force-pushed the SPARK-14459-detect-hive-partitioning branch from 7c31b73 to 2a807a9 Compare April 8, 2016 22:45
@rdblue
Copy link
Contributor Author

rdblue commented Apr 8, 2016

Looks like I accidentally pushed a commit I didn't intend to. I've fixed that, but the test failed. This is ok to test (again).

@SparkQA
Copy link

SparkQA commented Apr 9, 2016

Test build #55403 has finished for PR 12239 at commit 2a807a9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor Author

rdblue commented Apr 11, 2016

I just pushed a fix for the last test failure. Should be ready to test again.

@SparkQA
Copy link

SparkQA commented Apr 11, 2016

Test build #55533 has finished for PR 12239 at commit 6491788.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class OneRowRelation(outputAttrs: Seq[Attribute]) extends LeafNode

@rdblue rdblue force-pushed the SPARK-14459-detect-hive-partitioning branch 2 times, most recently from d166453 to aad3eb5 Compare April 12, 2016 01:00
@SparkQA
Copy link

SparkQA commented Apr 12, 2016

Test build #55563 has finished for PR 12239 at commit aad3eb5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Apr 12, 2016

cc @liancheng and @cloud-fan

@SparkQA
Copy link

SparkQA commented Apr 20, 2016

Test build #56387 has finished for PR 12239 at commit 2247648.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

hi @rxin , are we going to make MetastoreRelation also a CatalogTable?

@rxin
Copy link
Contributor

rxin commented Apr 21, 2016

I think so - the current MetastoreRelation doesn't make a lot of sense.

@rdblue
Copy link
Contributor Author

rdblue commented Apr 21, 2016

@rxin, @cloud-fan, this PR works for both cases when the table is resolved. I think making MetastoreRelation a CatalogTable would certainly improve things, but it looks like that is a long-term plan rather than something that should be solved before this goes in.

By the way, my other PR to align writes more with user expectations, #12313, also works toward making Hive behave like other relations. It consolidates the pre-write checks and casts and makes it much more user-friendly to write into any relation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we only check size here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, the user has called partitionBy(...) to set the partition columns by hand. Those don't have to match by name and we can insert casts during the pre-insert checks because we trust that the user gets it right. The only thing valuable to check is that the user got the number of partitions right as a sanity check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the PreWriteCheck rule, it does check if the user specified partition columns match the real ones. So I think we don't need to check here but just add a comment to say that the PreWriteCheck will do this check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this check broke 2 tests because the pre-insert code isn't correctly checking it. My follow-up PR fixes and merges the pre-inserts, so I've added it back in this PR with a TODO to remove it. I'll remove it when I rebase the follow-up on this.

@cloud-fan
Copy link
Contributor

cc @liancheng to take another look

@rdblue rdblue force-pushed the SPARK-14459-detect-hive-partitioning branch from 2247648 to d8aab8a Compare April 22, 2016 17:15
@rdblue
Copy link
Contributor Author

rdblue commented Apr 22, 2016

@cloud-fan, I've rebased on master and fixed the two things you pointed out. Let me know if there's anything else. Thanks for reviewing!

@SparkQA
Copy link

SparkQA commented Apr 22, 2016

Test build #56710 has finished for PR 12239 at commit d8aab8a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue rdblue force-pushed the SPARK-14459-detect-hive-partitioning branch 2 times, most recently from 023d656 to d87f887 Compare April 22, 2016 22:27
@rdblue
Copy link
Contributor Author

rdblue commented Apr 25, 2016

I noticed that there was a conflict so I rebased on master. Tests are still passing.

@rdblue rdblue force-pushed the SPARK-14459-detect-hive-partitioning branch 2 times, most recently from 39fe90c to 43a6e8b Compare April 27, 2016 16:33
@SparkQA
Copy link

SparkQA commented Apr 27, 2016

Test build #57137 has finished for PR 12239 at commit 39fe90c.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 27, 2016

Test build #57138 has finished for PR 12239 at commit 43a6e8b.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue rdblue force-pushed the SPARK-14459-detect-hive-partitioning branch from 43a6e8b to f3c3cdb Compare April 27, 2016 17:12
@SparkQA
Copy link

SparkQA commented Apr 27, 2016

Test build #57144 has finished for PR 12239 at commit f3c3cdb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

rdblue added 4 commits April 29, 2016 10:34
… to match.

This detects a relation's partitioning and adds checks to the analyzer.
If an InsertIntoTable node has no partitioning, it is replaced by the
relation's partition scheme and input columns are correctly adjusted,
placing the partition columns at the end in partition order. If an
InsertIntoTable node has partitioning, it is checked against the table's
reported partitions.

These changes required adding a PartitionedRelation trait to the catalog
interface because Hive's MetastoreRelation doesn't extend
CatalogRelation.

This commit also includes a fix to InsertIntoTable's resolved logic,
which now detects that all expected columns are present, including
dynamic partition columns. Previously, the number of expected columns
was not checked and resolved was true if there were missing columns.
This test expected to fail the strict partition check, but with support
for table partitioning in the analyzer the problem is caught sooner and
has a better error message. The message now complains that the
partitioning doesn't match rather than strict mode, which wouldn't help.
OneRowTable doesn't expose its output columns because it is a singleton.
The more strict checks in InsertIntoTable's resolution is causing this
to fail. This commit fixes the problem by catching the case where a
table doesn't define its outputs and considers the table resolved.
@rdblue rdblue force-pushed the SPARK-14459-detect-hive-partitioning branch from f3c3cdb to 2130db8 Compare April 29, 2016 17:53
@SparkQA
Copy link

SparkQA commented Apr 29, 2016

Test build #57350 has finished for PR 12239 at commit 2130db8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor Author

rdblue commented Apr 29, 2016

@cloud-fan, I rebased on master to avoid the conflicts and tests are all passing. If you have a chance to take another look I'd appreciate it! I think this is about ready.

@rdblue
Copy link
Contributor Author

rdblue commented May 3, 2016

@rxin, could you take a look at this? I think it's close to being ready and I have a couple of follow-up improvements to Hive/Parquet support that depend on it. Thank you!

@rxin
Copy link
Contributor

rxin commented May 5, 2016

Ryan - I've added this to the review list for the team.

val (partitionColumns, dataColumns) = table.output
.partition(a => partition.keySet.contains(a.name))
Some(dataColumns ++ partitionColumns.takeRight(numDynamicPartitions))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that we can omit the if condition and just use a Seq[Attribute] instead of Option[Seq[Attribute]] here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liancheng, I added a note about this below. The problem is that some relations return Nil as their output. I think this approach is the least intrusive solution because changing those relations is a large change. Please let me know what you think, and thanks for reviewing!

@liancheng
Copy link
Contributor

Hey @rdblue, glad to see you here :) Sorry for the late reply, I somehow missed the initial message.

This mostly LGTM except for two minor comments. Thanks!

@SparkQA
Copy link

SparkQA commented May 5, 2016

Test build #57909 has finished for PR 12239 at commit 6aa17f4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue rdblue force-pushed the SPARK-14459-detect-hive-partitioning branch from 6aa17f4 to 2130db8 Compare May 5, 2016 20:42
@rdblue
Copy link
Contributor Author

rdblue commented May 5, 2016

@liancheng, I originally had the logic you suggest for the expected columns calculation. But, there were test failures because OneRowRelation reports its output columns as Nil and will never match the input. Changing that introduced a huge number of test failures where other code depended on that Nil, so I think the best way forward is to catch the case where the output table has no columns and consider the node resolved.

The down side to this approach is that tables that actually have no columns would be considered resolved, but we will catch this case in later processing. For example, the OneRowRelation test is actually looking for a helpful error message that you can't write to an RDD-based relation.

@rdblue
Copy link
Contributor Author

rdblue commented May 5, 2016

@liancheng and @rxin, thank you for looking at this!

@SparkQA
Copy link

SparkQA commented May 5, 2016

Test build #57919 has finished for PR 12239 at commit 2130db8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

I'm merging this to master and branch-2.0. Thanks for fixing this!

@asfgit asfgit closed this in 652bbb1 May 9, 2016
asfgit pushed a commit that referenced this pull request May 9, 2016
…l plan

## What changes were proposed in this pull request?

This detects a relation's partitioning and adds checks to the analyzer.
If an InsertIntoTable node has no partitioning, it is replaced by the
relation's partition scheme and input columns are correctly adjusted,
placing the partition columns at the end in partition order. If an
InsertIntoTable node has partitioning, it is checked against the table's
reported partitions.

These changes required adding a PartitionedRelation trait to the catalog
interface because Hive's MetastoreRelation doesn't extend
CatalogRelation.

This commit also includes a fix to InsertIntoTable's resolved logic,
which now detects that all expected columns are present, including
dynamic partition columns. Previously, the number of expected columns
was not checked and resolved was true if there were missing columns.

## How was this patch tested?

This adds new tests to the InsertIntoTableSuite that are fixed by this PR.

Author: Ryan Blue <[email protected]>

Closes #12239 from rdblue/SPARK-14459-detect-hive-partitioning.

(cherry picked from commit 652bbb1)
Signed-off-by: Cheng Lian <[email protected]>
@rdblue
Copy link
Contributor Author

rdblue commented May 9, 2016

Thanks for reviewing this, @cloud-fan and @liancheng!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants