[SPARK-25951][SQL] Ignore aliases for distributions and orderings #22957

mgaido91 · 2018-11-06T16:07:10Z

What changes were proposed in this pull request?

When we canonicalize an Expression, we do not remove Alias. So two expressions which are the same but are renamed are considered to be semantically different. As we rely on semantic equality in order to check if the result is the same for 2 expressions, some optimizations - as showed in the JIRA - may fail to apply, eg. removing redundant shuffles, when a column is renamed.

The PR proposes to ignore Aliases when checking whether distributions and orderings are satisfied by introducing a new method sameResults which ignore Aliases.

Credit should be given to @maropu for the approach suggestion which follows #17400.

Closes #17400.

How was this patch tested?

added UT

SparkQA · 2018-11-06T18:07:02Z

Test build #98521 has finished for PR 22957 at commit b566818.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-07T18:32:19Z

Test build #98552 has finished for PR 22957 at commit 2b00f35.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2018-11-07T19:09:30Z

i didn't look at your new code, but is your old code safe? e.g. a project that depends on the new alias.

mgaido91 · 2018-11-08T13:26:44Z

Thanks for you comment @rxin. It was safe for comparisons (I mean to say: this 2 expressions return the same data), because anyway all the AttributeReferences contain the exprId they referred to, so a removed Alias would have had its exprId in all the AttributeReferences to it. But if this was used for checking which expressions to replace and modifying the plan (as it is done in the PhysicalAggregation), then it is not safe, because it can cause missing some Alias from the resulting plan, leading to an invalid one. So I can say in general it was not safe if we consider all the usages of semanticEquals, but it is safe if we want to know whether the returned data is the same. Hope this answer is clear enough.

mgaido91 · 2018-11-28T09:32:37Z

cc @cloud-fan @gatorsmile

cloud-fan · 2018-11-28T11:45:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala

+   * different output expressions can evaluate to the same result as well (eg. when an expression
+   * is aliased).
+   */
+  def sameResult(other: Expression): Boolean = other match {


I know it's always safer to introduce a new API, does is it really necessary? In Canonicalize, we erase the name for attributes, I think it's reasonable to erase the name of Alias too, as it doesn't affect the output.

that is reasonable but it doesn't solve the problem stated in the JIRA. So the goal here is to avoid that something like a as b is considered different from a in terms of ordering/distribution. If we just erase the name of alias, the 2 expression would still be different because of the presence of Alias itself would make the 2 expressions different.

"erase the name" can also mean remove Alias. If we can't clearly tell the difference between semanticEquals and sameResult, and give a guideline about using which one in which case, I think we should just update semanticEquals(i.e. Canonicalize).

remove Alias is not possible for the reason explained in #22957 (comment). In general, semanticEquals should be used when we want to replace an expression with another, while sameResult should be used in order to check that 2 expressions return the same output.

can you put it in the method doc(both semanticEquals and sameResult)? This makes sense to me.

Sure, thanks.

cloud-fan · 2018-11-28T12:52:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala

+   */
+  def sameResult(other: Expression): Boolean = other match {
+    case a: Alias => sameResult(a.child)
+    case _ => this.semanticEquals(other)


can we also strip the alias of this here? so that we can mark sameResult as final.

I think it is doable, but I didn't want to put too many match where it was not needed. But if you prefer that way I can try and do that.

well, it needs to be overridden by HashPartitioning too, so since I am not able to make it final anyway, I don't think it is a good idea. Well, I can add a match on HashPartitioningtoo, but it doesn't seem a clean solution to me.

we can do

CleanupAliases.trimAliases(this) semanticEquals CleanupAliases.trimAliases(other)

SparkQA · 2018-11-28T15:25:02Z

Test build #99377 has finished for PR 22957 at commit 6c93e70.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-28T16:16:52Z

Test build #99376 has finished for PR 22957 at commit 3831be0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-28T16:44:15Z

Test build #99375 has finished for PR 22957 at commit 0491249.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-11-29T03:01:15Z

retest this please

SparkQA · 2018-11-29T05:29:32Z

Test build #99424 has finished for PR 22957 at commit 6c93e70.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-29T12:52:20Z

Test build #99446 has finished for PR 22957 at commit a306465.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-11-29T12:56:17Z

retest this please

SparkQA · 2018-11-29T14:22:21Z

Test build #99448 has finished for PR 22957 at commit a306465.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-11-29T14:58:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

    e.transformDown {
-      case Alias(child, _) => child
-      case MultiAlias(child, _) => child
+      case Alias(child, _) => trimAliases(child)


what's going on here?

the point is that now this method removes only the first Alias it finds (and it doesn't go on recursively), which is the reason of the UT failure. Also checking the comment on the method it seems not the expected behavior of this method.

it's transformDown, why doesn't it work?

ah, I did a stupid thing here. So the problem is that: since it returns child for this, in transformDown we apply the rule to child children, instead of applying to child itself. So the problem here is with 2 consecutive Alias. Let me find a better fix.

just using transformUpsolves the issue

SparkQA · 2018-11-29T16:17:03Z

Test build #99449 has finished for PR 22957 at commit a306465.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-29T19:15:25Z

Test build #99462 has finished for PR 22957 at commit 13aef71.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-11-30T04:37:20Z

sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala

    }
  }
+
+  test("SPARK-25951: avoid redundant shuffle on rename") {


can we have an end-to-end test as well?

+1 if possible.

ah, good point and indeed very useful. In my previous tests I always used a very simple query to verify this and never the one reported in the JIRA. Now I tried that one and I realized that this fix is not very useful as of now, because in renaming like that in the HashPatitioning there is the AttributeReference to the Alias, rather than the Alias itself. Since that is the common case, the PR as it is now it is not very useful. If I won't be able to figure out a good way for that, I am going to close this. Thanks and sorry for the trouble.

@cloud-fan @viirya I added the test, but as I mentioned I had to do another change in order to make it working. Sorry for the mistake. I'd really appreciate if you could review it again. Thanks.

cloud-fan · 2018-11-30T04:37:48Z

LGTM, cc @viirya as well

viirya · 2018-11-30T05:05:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala


+  /**
+   * Returns true when two expressions will always compute the same result, even if the output may
+   * be different, because of different names or similar differences.


I think here output is a bit confusing. Do we mean the output names?

So sameResult returns if the evaluated results between two expressions are exactly the same?

yes, I mean: sameResult returns true if 2 expressions return the same data even though from plan perspective they are not the same (eg. the output name/exprIds is different as in this case), while semanticEquals ensure they are the same from plan perspective too. If you have better suggestions how to rephrase this, I am happy to improve it. Thanks.

How about replace output with output from plan perspective?

Returns true when two expressions will always compute the same result, even if the output from plan perspective may be different, because of different names or similar differences.

viirya · 2018-11-30T05:17:01Z

This looks good to me. Just a comment about wording.

HeartSaVioR · 2019-02-06T23:18:15Z

Sorry about jumping in, but looks like we missed #17400 (SPARK-19981) which looks like containing the fix what @cloud-fan suggests - reviewers just lost focus on that. For me #17400 looks very concise, so curious how this patch is different from #17400.

mgaido91 · 2019-02-07T08:32:24Z

@HeartSaVioR thanks for pointing that out! Yes, that fix is similar to this, the main difference between these 2 PRs is that this one handles also the case when we have select a as b, a, ... and then we have the partitioning on a, while #17400 doesn't (and hence may introduce a regression). Let me cc @maropu too here, since he worked on this and may have comments/suggestion or he can update his PR to handle that case and we can focus on that PR.

maropu · 2019-02-09T06:04:23Z

@mgaido91 Thanks for letting me know! Could you update this pr based on my fix? If you can, its ok to close my pr in #17400.

mgaido91 · 2019-02-10T11:16:35Z

@maropu thanks for checking this. Do you mean using the trait approach? If so, sure, I am doing. If not, please let me know. Thanks.

SparkQA · 2019-02-10T11:51:36Z

Test build #102147 has finished for PR 22957 at commit 09b9981.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait AliasAwareOutputPartitioning extends UnaryExecNode

SparkQA · 2019-02-10T15:46:43Z

Test build #102150 has finished for PR 22957 at commit 69f9d5e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-10T20:41:05Z

Test build #102154 has finished for PR 22957 at commit 75ef545.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-02-10T22:35:40Z

Just 2 cents, personally I understood @maropu's comment as taking up his PR, in other words, rebasing this branch to his branch to retain his commits, and adding @mgaido91 commit on top of his branch to fix up remaining issue. This would make giving authorship of this PR to both of @maropu and @mgaido91 easier (looks correct way to give credits on actual works).

maropu · 2019-02-11T04:25:18Z

Thanks, @mgaido91 and @HeartSaVioR! The fix @mgaido91 did looks ok to me. But, I don't review the latest version yet and I'll do in a few days.

cloud-fan · 2019-02-11T07:57:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/AliasAwareOutputPartitioning.scala

+ * caused by the rename of an attribute among the partitioning ones, eg.
+ *
+ * spark.range(10).selectExpr("id AS key", "0").repartition($"key").write.saveAsTable("df1")
+ * spark.range(10).selectExpr("id AS key", "0").repartition($"key").write.saveAsTable("df2")


do you mean view here? otherwise the physical plan doesn't match.

mgaido91 · 2019-02-11T08:12:21Z

@HeartSaVioR @maropu rebasing this PR with @maropu 's change and work on it is non-trivial, I can rather close this and create a new one based on @maropu 's branch if you prefer. Otherwise committers can add @maropu as author when this will be merged or @maropu can take this over and create a new PR based on this. Just let me know how you prefer to go on, thanks.

cloud-fan · 2019-02-11T08:13:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

+
+  override private[spark] def pruneInvalidAttribute(invalidAttr: Attribute): Partitioning = {
+    if (this.references.contains(invalidAttr)) {
+      UnknownPartitioning(numPartitions)


Let's add comments to explain it.

HashPartitioning('a, 'b) with output expressions 'a as 'a1, should produce UnknownPartitioning instead of HashPartitioning('a1), which is wrong.

cloud-fan · 2019-02-11T08:15:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

+
+  override private[spark] def pruneInvalidAttribute(invalidAttr: Attribute): Partitioning = {
+    if (this.references.contains(invalidAttr)) {
+      val validExprs = this.children.takeWhile(!_.references.contains(invalidAttr))


this.children -> ordering?

cloud-fan · 2019-02-11T08:19:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

+      if (validExprs.isEmpty) {
+        UnknownPartitioning(numPartitions)
+      } else {
+        RangePartitioning(validExprs, numPartitions)


think about RangePartitioning('a.ASC, 'b.ASC) with output expression 'a as 'a1.

It cannot satisfy ClusteredDistribution('a1), but can still satisfy OrderedDistribution('a1.ASC). I think the expected result should be RangePartitioning('a1.ASC, 'b.ASC) instead of RangePartitioning('a1.ASC), which is wrong.

Why doesn't it satisfy ClusteredDistribution('a1)? I don't agree with what you stated. If b is not in the output, it is useless to have it there. Moreover, when ordering we order always for the first attribute, then for the second, ... So if something is partitioned by RangePartitioning('a1.ASC, 'b.ASC), it is also true that its partitioning is RangePartitioning('a1.ASC, ). So I think that in that case RangePartitioning('a1.ASC) is the right one.

according to RangePartitioning.satisfies0, and the classdoc of RangePartitioning and ClusteredDistribution, RangePartitioning('a.ASC, 'b.ASC) does not satisfy ClusteredDistribution('a).

mmmh, I am not sure about what you mean referring to the classdoc of these 2 classes. I see nothing there about this. Anyway, I see that the implementation is done according what you state, but I do believe that it is wrong (or at least suboptimal if you prefer). If the data is partitioned by sorting it with a.ASC, b.ASC, it is definitely partitioned by sorting it with a.ASC. I think that forall should be an exists. There is also a (very minor) bug in the current implementation; try and running this test (it fails...):

test("partitioning test") { val attr1 = AttributeReference("attr1", IntegerType)() val attr2 = AttributeReference("attr2", IntegerType)() val partitioning = RangePartitioning(Seq.empty, 10) val requiredDistribution = ClusteredDistribution(Seq(attr2, attr1), Some(10)) assert(!partitioning.satisfies(requiredDistribution)) }

please carefully read the classdoc of RangePartitioning and ClusteredDistribution, and see what RangePartitioning guarantees and what ClusteredDistribution requires.

partition 1: (a=1, b=2), (a=1, b=3) partition 2: (a=1, b=4), (a=1, b=5)

This data set is range partitioned by a,b, but not clustered by a.

ah I see now, thanks. Let me update it then accordingly, thanks.

cloud-fan · 2019-02-11T08:29:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/AliasAwareOutputPartitioning.scala

+  final override def outputPartitioning: Partitioning = {
+    child.outputPartitioning match {
+      case partitioning: Expression =>
+        val exprToEquiv = partitioning.references.map { attr =>


can you explain what's going on here? The code is a little hard to follow.

sure, let me add some comments. Thanks.

SparkQA · 2019-02-11T13:39:59Z

Test build #102193 has finished for PR 22957 at commit df3394c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-11T20:21:45Z

Test build #102201 has finished for PR 22957 at commit 78d92bc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-10T18:10:30Z

Test build #104484 has finished for PR 22957 at commit 47a8f71.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2019-04-12T08:39:40Z

@cloud-fan @maropu sorry for pinging you again, I think I addressed all your comments on this, may you please check it again? Thanks.

github-actions · 2020-01-05T00:07:30Z

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

[SPARK-25951][SQL] Remove Alias when canonicalize

b566818

introduce sameResult

2b00f35

mgaido91 changed the title ~~[SPARK-25951][SQL] Remove Alias when canonicalize~~ [SPARK-25951][SQL] Ignore aliases for distributions and orderings Nov 7, 2018

cloud-fan reviewed Nov 28, 2018

View reviewed changes

mgaido91 added 3 commits November 28, 2018 14:14

address comment: add comments

0491249

address comment

3831be0

improve comments

6c93e70

fix recursive aliases

a306465

cloud-fan reviewed Nov 29, 2018

View reviewed changes

fix trimAliases

13aef71

cloud-fan reviewed Nov 30, 2018

View reviewed changes

viirya reviewed Nov 30, 2018

View reviewed changes

use maropu's approach

09b9981

fix

69f9d5e

fix ut failures

75ef545

cloud-fan reviewed Feb 11, 2019

View reviewed changes

adress comments

df3394c

fix rangepartitioning

78d92bc

Merge branch 'master' into SPARK-25951

47a8f71

dongjoon-hyun added the SQL label Jun 14, 2019

maropu mentioned this pull request Dec 20, 2019

[SPARK-30298][SQL] Respect aliases in output partitioning of projects and aggregates #26943

Closed

github-actions bot added the Stale label Jan 5, 2020

github-actions bot closed this Jan 6, 2020

[SPARK-25951][SQL] Ignore aliases for distributions and orderings #22957

[SPARK-25951][SQL] Ignore aliases for distributions and orderings #22957

Uh oh!

Conversation

mgaido91 commented Nov 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Nov 6, 2018

Uh oh!

SparkQA commented Nov 7, 2018

Uh oh!

rxin commented Nov 7, 2018

Uh oh!

mgaido91 commented Nov 8, 2018

Uh oh!

mgaido91 commented Nov 28, 2018

Uh oh!

cloud-fan Nov 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 28, 2018

Uh oh!

SparkQA commented Nov 28, 2018

Uh oh!

SparkQA commented Nov 28, 2018

Uh oh!

cloud-fan commented Nov 29, 2018

Uh oh!

SparkQA commented Nov 29, 2018

Uh oh!

SparkQA commented Nov 29, 2018

Uh oh!

mgaido91 commented Nov 29, 2018

Uh oh!

SparkQA commented Nov 29, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 29, 2018

Uh oh!

SparkQA commented Nov 29, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Nov 30, 2018

mgaido91 commented Nov 6, 2018 •

edited

Loading

cloud-fan Nov 28, 2018 •

edited

Loading

maropu commented Feb 9, 2019 •

edited

Loading

HeartSaVioR commented Feb 10, 2019 •

edited

Loading

maropu commented Feb 11, 2019 •

edited

Loading

cloud-fan Feb 11, 2019 •

edited

Loading