-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-14543] [SQL] Improve InsertIntoTable column resolution. #12313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-14543] [SQL] Improve InsertIntoTable column resolution. #12313
Conversation
cf82d95 to
c4c6820
Compare
|
Test build #55558 has finished for PR 12313 at commit
|
c4c6820 to
f2186e5
Compare
|
Test build #55616 has finished for PR 12313 at commit
|
f2186e5 to
c3e8561
Compare
|
Test build #56239 has finished for PR 12313 at commit
|
c3e8561 to
0a6ec77
Compare
|
Test build #56386 has finished for PR 12313 at commit
|
0a6ec77 to
6c107ff
Compare
|
Test build #56411 has finished for PR 12313 at commit
|
|
Test build #56586 has finished for PR 12313 at commit
|
fcacba5 to
7516ffd
Compare
|
Test build #56605 has finished for PR 12313 at commit
|
|
Test build #56708 has finished for PR 12313 at commit
|
42ff7b7 to
c443b32
Compare
|
Test build #56757 has finished for PR 12313 at commit
|
c443b32 to
ab23c9c
Compare
|
Retest this please. |
|
Test build #58156 has finished for PR 12313 at commit
|
|
@liancheng, @cloud-fan, this commit is a follow-up to #12239 that fixes column resolution when writing to both Hive MetastoreRelations and HadoopFsRelations. Could you review it? I think it would be a good addition to 2.0.0 as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can simply use object ResolveOutputColumns here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix this and the protected method. I think I was instantiating this rule multiple times at one point and this is left-over.
ab23c9c to
3a24e36
Compare
|
@cloud-fan, @liancheng, thanks for reviewing! I've rebased on master and fixed your comments so far. |
|
Test build #59894 has finished for PR 12313 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should use plan.resolveOperators
|
@rdblue Thank you for updating the patch. I was out of town late last week and was busy on spark summit early this week. Sorry for my late reply. Having name-based resolution is very useful! Since this is a major improvement and we are pretty late in the 2.0 QA period, how about we re-target the work to 2.1? Also, I feel it will be good to have a short design doc to discuss the following:
Based on the design, we can break the work to multiple pieces and get them merged for 2.1 release. What do you think? |
This combines Hive's pre-insertion casts (without renames) that handle partitioning with the pre-insertion casts/renames in core. The combined rule, ResolveOutputColumns, will resolve columns by name or by position. Resolving by position will detect cases where the number of columns is incorrect or where the input columns are a permutation of the output columns and fail. When resolving by name, each output column is located by name in the child plan. This handles cases where a subset of a data frame is written out.
This PR now catches this problem during analysis and has a better error message. This commit updates the test for the new message and exception type.
Adding new argumetns to InsertIntoTable requires changes to several files. Instead of adding a long list of optional args, this adds an options map, like the one passed to DataSource. Future options can be added and used only where they are needed.
302464b to
906e68d
Compare
|
@yhuai, whatever release you want to target is fine with me, but I don't think we should block this on a design doc for cleaning up the Since this PR doesn't change the public API, I don't think we should wait until we have a plan for the public API (design doc goal 1) to commit it. Similarly for the second goal, missing columns, extra columns, and metastore updates are beyond the scope here, when this can simply require that the number of columns matches since that's the most conservative strategy (that's what is now implemented). The remaining issue is whether it is a good idea to include the by-name code. I think the concern is that it may change based on the API design doc, but that's really unlikely. For example, adding the option to I'm just trying to avoid dragging this out for a lot longer, or the work of splitting it up needlessly and having to keep rebasing the changes on master. I could be wrong, so if I'm missing something here, then please let me know (and thanks for being patient). Also, I addressed @cloud-fan's comments and rebased on master. |
|
Test build #60327 has finished for PR 12313 at commit
|
|
Really like this PR! It removes one of Hive-specific rule! : ) |
|
If this is too large for merging to 2.0, could @rdblue deliver a small fix for capturing the illegal user inputs? Thanks! |
…ases for by position resolution ## What changes were proposed in this pull request? This PR migrates some test cases introduced in apache#12313 as a follow-up of apache#13754 and apache#13766. These test cases cover `DataFrameWriter.insertInto()`, while the former two only cover SQL `INSERT` statements. Note that the `testPartitionedTable` utility method tests both Hive SerDe tables and data source tables. ## How was this patch tested? N/A Author: Cheng Lian <[email protected]> Closes apache#13810 from liancheng/spark-16037-follow-up-tests.
…ases for by position resolution ## What changes were proposed in this pull request? This PR migrates some test cases introduced in #12313 as a follow-up of #13754 and #13766. These test cases cover `DataFrameWriter.insertInto()`, while the former two only cover SQL `INSERT` statements. Note that the `testPartitionedTable` utility method tests both Hive SerDe tables and data source tables. ## How was this patch tested? N/A Author: Cheng Lian <[email protected]> Closes #13810 from liancheng/spark-16037-follow-up-tests. (cherry picked from commit f4a3d45) Signed-off-by: Yin Huai <[email protected]>
|
Test build #75047 has finished for PR 12313 at commit
|
|
Hi all, I just wonder where we are on this. |
|
We were trying to get this in just before the 2.0 release, which was a bad time. We've just been maintaining it in our version, but I'm going to be rebasing it on to 2.1 soon so I'll see what needs to be done to get it in upstream. |
|
I see. Thank you. |
|
Test build #82245 has finished for PR 12313 at commit
|
|
This is addressed by #21305. |
What changes are proposed in this pull request?
LogicalPlan. It catches cases where there are too many data columns and throws anAnalysisExceptionrather than silently dropping the extra data. It also improves the error message when there are too few columns and warns when the output columns appear to be out of order.MetastoreRelationwith the pre-insert cast and rename forLogicalRelations. Both are now handled as a singleResolveOutputColumnsstep in the analyzer that implements the above improvements.Casts are now UpCasts to avoid silently adding incorrect casts when columns are misaligned.Casts are still somewhat unsafe and should be fixed in a follow-up PR.This is exposed on the:DataFrameWriterHow was this patch tested?
This patch includes unit tests that exercise the cases outlined above.