-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-19981][SQL] Respect aliases in output partitioning of projects and aggregates #17400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #75104 has finished for PR 17400 at commit
|
|
Just curious, how come the fix is not in this code? So anywhere we compare expressions for semantic equality, we can say |
|
Here is a sort example with 1 partition. I believe the extra sort on |
|
ISTM the solution you suggested does not work because the planner actually compares references (that is, |
|
I suggest the following code for |
|
@allengeorge yea, we could there. But, I think we should first make sure about how to fix this issue. I'm not sure that the approach of this pr is the best. cc: @gatorsmile |
91a412e to
0492c0f
Compare
|
Test build #76738 has finished for PR 17400 at commit
|
|
Test build #76740 has finished for PR 17400 at commit
|
|
ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we fix the issue in EnsureRequirements? Aggregate operators can also introduce alias.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, ok. I'll consider again.
|
Test build #77030 has finished for PR 17400 at commit
|
|
Jenkins, retest this please. |
|
Test build #77053 has finished for PR 17400 at commit
|
|
@gatorsmile ping |
|
ping |
|
Test build #79520 has finished for PR 17400 at commit
|
|
Test build #85034 has finished for PR 17400 at commit
|
|
@maropu , any reason why this is on hold for so long? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might do more than you'd like it to (at least if it behaves the way I understand collect first), i.e.
df.select($"x" as "x1, struct($"a" as "a1", $"b" as "b1") as "s1")
x1 and s1 are aliases, a1 and b1 are not. it could even get more complicated if there was an a1 alias in the top level projections list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pr only focuses on aliases, so the point you described above is out-of-scope in this pr. IMO more complicated cases should be fixed in follow-ups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu , I didn't aim for supporting complex partitioning expressions (which deserves its own separate PR), I meant that this code could introduce regressions by 'over-capturing' nested aliases.
- my specific example is wrong since struct is transformed into a named struct (alias is replaced by an explicit name).
|
I think that's because the priority is not much high. This issue causes any problem in your query? |
|
@maropu , yes it does :-) |
|
If possible, could you describe that problem in your case to encourage this work? |
|
in my use case, I aggregate a dataset, the use select to align columns with a case-class. I later try to join the resulting dataset based on the same columns used for aggregattion. |
|
Test build #94999 has finished for PR 17400 at commit
|
|
Test build #95009 has finished for PR 17400 at commit
|
e288288 to
ec3e6d9
Compare
|
Test build #95079 has finished for PR 17400 at commit
|
|
Test build #95078 has finished for PR 17400 at commit
|
|
Test build #95080 has finished for PR 17400 at commit
|
|
Test build #95097 has finished for PR 17400 at commit
|
|
Test build #95102 has finished for PR 17400 at commit
|
|
retest this please |
|
Test build #96061 has finished for PR 17400 at commit
|
|
retest this please |
| import org.apache.spark.sql.catalyst.expressions.{Alias, Expression, NamedExpression} | ||
| import org.apache.spark.sql.catalyst.plans.physical._ | ||
|
|
||
| trait AliasAwareOutputPartitioning extends UnaryExecNode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need a general utility class for this. cc @maryannxue She did the similar things for the other projects in the past. Maybe @maryannxue can help deliver such a utility class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll wait for @maryannxue suggestion.
|
Test build #96161 has finished for PR 17400 at commit
|
|
retest this please |
|
Test build #96164 has finished for PR 17400 at commit
|
|
Test build #97746 has finished for PR 17400 at commit
|
|
Test build #97753 has finished for PR 17400 at commit
|
|
Test build #97788 has finished for PR 17400 at commit
|
What changes were proposed in this pull request?
The current master might wrongly add shuffle operations when projects and aggregates in physical plans have aliases in output expressions. A concrete example is as follows;
In the query, the second
Exchangeis not necessary. The root cause is that the planner wrongly regardskeyandkas different attributes because they have differentexprId. Then, it fails distribution requirement checks inEnsureRequirements. This pr proposes to handle these aliases inEnsureRequirementsso as to check if the operators satisfy their output distribution requirements.How was this patch tested?
Added tests in
SQLQueryTestSuite.