-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-14459] [SQL] Detect relation partitioning and adjust the logical plan #12239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@liancheng, can you look at this? It looks like you're familiar with the SQL/Hive code. |
|
ok to test |
|
Test build #55231 has finished for PR 12239 at commit
|
|
@marmbrus, I fixed the failing test. The problem was a query that didn't supply a value for one of the partitions. This commit actually gives a better error message, but it is an AnalysisException rather than a SparkException. Thanks for taking a look at this. |
|
Test build #55399 has finished for PR 12239 at commit
|
7c31b73 to
2a807a9
Compare
|
Looks like I accidentally pushed a commit I didn't intend to. I've fixed that, but the test failed. This is ok to test (again). |
|
Test build #55403 has finished for PR 12239 at commit
|
|
I just pushed a fix for the last test failure. Should be ready to test again. |
|
Test build #55533 has finished for PR 12239 at commit
|
d166453 to
aad3eb5
Compare
|
Test build #55563 has finished for PR 12239 at commit
|
|
cc @liancheng and @cloud-fan |
|
Test build #56387 has finished for PR 12239 at commit
|
|
hi @rxin , are we going to make |
|
I think so - the current MetastoreRelation doesn't make a lot of sense. |
|
@rxin, @cloud-fan, this PR works for both cases when the table is resolved. I think making MetastoreRelation a CatalogTable would certainly improve things, but it looks like that is a long-term plan rather than something that should be solved before this goes in. By the way, my other PR to align writes more with user expectations, #12313, also works toward making Hive behave like other relations. It consolidates the pre-write checks and casts and makes it much more user-friendly to write into any relation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we only check size here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, the user has called partitionBy(...) to set the partition columns by hand. Those don't have to match by name and we can insert casts during the pre-insert checks because we trust that the user gets it right. The only thing valuable to check is that the user got the number of partitions right as a sanity check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the PreWriteCheck rule, it does check if the user specified partition columns match the real ones. So I think we don't need to check here but just add a comment to say that the PreWriteCheck will do this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this check broke 2 tests because the pre-insert code isn't correctly checking it. My follow-up PR fixes and merges the pre-inserts, so I've added it back in this PR with a TODO to remove it. I'll remove it when I rebase the follow-up on this.
|
cc @liancheng to take another look |
2247648 to
d8aab8a
Compare
|
@cloud-fan, I've rebased on master and fixed the two things you pointed out. Let me know if there's anything else. Thanks for reviewing! |
|
Test build #56710 has finished for PR 12239 at commit
|
023d656 to
d87f887
Compare
|
I noticed that there was a conflict so I rebased on master. Tests are still passing. |
39fe90c to
43a6e8b
Compare
|
Test build #57137 has finished for PR 12239 at commit
|
|
Test build #57138 has finished for PR 12239 at commit
|
43a6e8b to
f3c3cdb
Compare
|
Test build #57144 has finished for PR 12239 at commit
|
… to match. This detects a relation's partitioning and adds checks to the analyzer. If an InsertIntoTable node has no partitioning, it is replaced by the relation's partition scheme and input columns are correctly adjusted, placing the partition columns at the end in partition order. If an InsertIntoTable node has partitioning, it is checked against the table's reported partitions. These changes required adding a PartitionedRelation trait to the catalog interface because Hive's MetastoreRelation doesn't extend CatalogRelation. This commit also includes a fix to InsertIntoTable's resolved logic, which now detects that all expected columns are present, including dynamic partition columns. Previously, the number of expected columns was not checked and resolved was true if there were missing columns.
This test expected to fail the strict partition check, but with support for table partitioning in the analyzer the problem is caught sooner and has a better error message. The message now complains that the partitioning doesn't match rather than strict mode, which wouldn't help.
OneRowTable doesn't expose its output columns because it is a singleton. The more strict checks in InsertIntoTable's resolution is causing this to fail. This commit fixes the problem by catching the case where a table doesn't define its outputs and considers the table resolved.
f3c3cdb to
2130db8
Compare
|
Test build #57350 has finished for PR 12239 at commit
|
|
@cloud-fan, I rebased on master to avoid the conflicts and tests are all passing. If you have a chance to take another look I'd appreciate it! I think this is about ready. |
|
@rxin, could you take a look at this? I think it's close to being ready and I have a couple of follow-up improvements to Hive/Parquet support that depend on it. Thank you! |
|
Ryan - I've added this to the review list for the team. |
| val (partitionColumns, dataColumns) = table.output | ||
| .partition(a => partition.keySet.contains(a.name)) | ||
| Some(dataColumns ++ partitionColumns.takeRight(numDynamicPartitions)) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that we can omit the if condition and just use a Seq[Attribute] instead of Option[Seq[Attribute]] here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liancheng, I added a note about this below. The problem is that some relations return Nil as their output. I think this approach is the least intrusive solution because changing those relations is a large change. Please let me know what you think, and thanks for reviewing!
|
Hey @rdblue, glad to see you here :) Sorry for the late reply, I somehow missed the initial message. This mostly LGTM except for two minor comments. Thanks! |
|
Test build #57909 has finished for PR 12239 at commit
|
6aa17f4 to
2130db8
Compare
|
@liancheng, I originally had the logic you suggest for the expected columns calculation. But, there were test failures because The down side to this approach is that tables that actually have no columns would be considered resolved, but we will catch this case in later processing. For example, the |
|
@liancheng and @rxin, thank you for looking at this! |
|
Test build #57919 has finished for PR 12239 at commit
|
|
I'm merging this to master and branch-2.0. Thanks for fixing this! |
…l plan ## What changes were proposed in this pull request? This detects a relation's partitioning and adds checks to the analyzer. If an InsertIntoTable node has no partitioning, it is replaced by the relation's partition scheme and input columns are correctly adjusted, placing the partition columns at the end in partition order. If an InsertIntoTable node has partitioning, it is checked against the table's reported partitions. These changes required adding a PartitionedRelation trait to the catalog interface because Hive's MetastoreRelation doesn't extend CatalogRelation. This commit also includes a fix to InsertIntoTable's resolved logic, which now detects that all expected columns are present, including dynamic partition columns. Previously, the number of expected columns was not checked and resolved was true if there were missing columns. ## How was this patch tested? This adds new tests to the InsertIntoTableSuite that are fixed by this PR. Author: Ryan Blue <[email protected]> Closes #12239 from rdblue/SPARK-14459-detect-hive-partitioning. (cherry picked from commit 652bbb1) Signed-off-by: Cheng Lian <[email protected]>
|
Thanks for reviewing this, @cloud-fan and @liancheng! |
What changes were proposed in this pull request?
This detects a relation's partitioning and adds checks to the analyzer.
If an InsertIntoTable node has no partitioning, it is replaced by the
relation's partition scheme and input columns are correctly adjusted,
placing the partition columns at the end in partition order. If an
InsertIntoTable node has partitioning, it is checked against the table's
reported partitions.
These changes required adding a PartitionedRelation trait to the catalog
interface because Hive's MetastoreRelation doesn't extend
CatalogRelation.
This commit also includes a fix to InsertIntoTable's resolved logic,
which now detects that all expected columns are present, including
dynamic partition columns. Previously, the number of expected columns
was not checked and resolved was true if there were missing columns.
How was this patch tested?
This adds new tests to the InsertIntoTableSuite that are fixed by this PR.