-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[WIP][SPARK-24051][SQL] Replace Aliases with the same exprId #21184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@hvanhovell I created this PR to discuss all together the problem in details with a basic solution which can help understanding better the problem. Honestly I am not happy with this change. So if we can find better options, any input would be very appreciated. Thanks. |
|
Test build #89937 has finished for PR 21184 at commit
|
|
@gatorsmile what do you think about this approach for the problem we hit in #21737? Currently, this change doesn't solve all the conflicts, but what do you think about the approach? Thanks. |
| checkAnswer(df, df.collect()) | ||
| } | ||
|
|
||
| test("SPARK-24051: using the same alias can produce incorrect result") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case failed without your changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, without the change the result is:
+---+---+---+
| a| b| n|
+---+---+---+
| 1| 0| 2|
| 2| 0| 2|
| 3| 0| 1|
+---+---+---+
| * Replaces [[Alias]] with the same exprId but different references with [[Alias]] having | ||
| * different exprIds. This is a rare situation which can cause incorrect results. | ||
| */ | ||
| object DeduplicateAliases extends Rule[LogicalPlan] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what problem does it try to resolve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main problem which causes the added UT to fail is that FoldablePropagation replaces all foldable aliases which are considered to be the same. If the same alias with same exprId is located in different part of the plan (referencing actually different things, despite they have the same id...) this can cause wrong replacement to happen. So in the added UT, the plan is:
== Analyzed Logical Plan ==
a: int, b: int, n: bigint
Union
:- Project [a#5, b#17, n#19L]
: +- Project [a#5, b#17, n#19L, n#19L]
: +- Window [count(1) windowspecdefinition(specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS n#19L]
: +- Project [a#5, b#6 AS b#17]
: +- Project [_1#2 AS a#5, _2#3 AS b#6]
: +- LocalRelation [_1#2, _2#3]
+- Project [a#12, b#17, n#19L]
+- Project [a#12, b#17, n#19L, n#19L]
+- Window [count(1) windowspecdefinition(specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS n#19L]
+- Project [a#12, b#14 AS b#17]
+- Project [a#12, 0 AS b#14]
+- Project [value#10 AS a#12]
+- LocalRelation [value#10]
Please note that in both the branches of the union we have the same b#17 attribute, but they are referencing different things. As the lower one is a foldable value which evaluates to 0, all the b#17 are replace with a literal 0, causing a wrong result.
Despite we might fix this specific problem in the related Optimizer rule, I think that in general we assume that items with the same id are the same. So I proposed this solution in order to fix all the possible issues which may arise due to this situation which is not expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel the root cause is in FoldablePropagation. We should only replace attribute with literal from children, not siblings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is also true. But in many places in the codebase we just compare attributes using semanticEquals or in some other cases, even equals. Well, if we admit that different attributes can have the same exprId, all these places should be checked in order to be sure that the same problem cannot happen there too. Moreover (this is more a nit), the semanticEquals or sameRef method itself would be wrong according to its semantic, as it may return true even when two attributes don't have the same reference. This is the reason why I opted for this solution, which seems to me cleaner as it solves the root cause of the problem. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kindly ping @cloud-fan
|
Test build #93037 has finished for PR 21184 at commit
|
What changes were proposed in this pull request?
If a user reuses the same column in different selects, it can happen that we have
Aliaswith the sameexprId. These aliases can of course reference different columns/expressions (as in the use case presented in the JIRA). If any of them is a foldable, all of them are replaced with the foldable value in the Optimizer.The PR proposes to replace the conflicting aliases generating new ids for the conflicting ones.
How was this patch tested?
added UT