Skip to content

Conversation

@mgaido91
Copy link
Contributor

@mgaido91 mgaido91 commented Nov 6, 2018

What changes were proposed in this pull request?

When we canonicalize an Expression, we do not remove Alias. So two expressions which are the same but are renamed are considered to be semantically different. As we rely on semantic equality in order to check if the result is the same for 2 expressions, some optimizations - as showed in the JIRA - may fail to apply, eg. removing redundant shuffles, when a column is renamed.

The PR proposes to ignore Aliases when checking whether distributions and orderings are satisfied by introducing a new method sameResults which ignore Aliases.

Credit should be given to @maropu for the approach suggestion which follows #17400.

Closes #17400.

How was this patch tested?

added UT

@SparkQA
Copy link

SparkQA commented Nov 6, 2018

Test build #98521 has finished for PR 22957 at commit b566818.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91 mgaido91 changed the title [SPARK-25951][SQL] Remove Alias when canonicalize [SPARK-25951][SQL] Ignore aliases for distributions and orderings Nov 7, 2018
@SparkQA
Copy link

SparkQA commented Nov 7, 2018

Test build #98552 has finished for PR 22957 at commit 2b00f35.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Nov 7, 2018

i didn't look at your new code, but is your old code safe? e.g. a project that depends on the new alias.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Nov 8, 2018

Thanks for you comment @rxin. It was safe for comparisons (I mean to say: this 2 expressions return the same data), because anyway all the AttributeReferences contain the exprId they referred to, so a removed Alias would have had its exprId in all the AttributeReferences to it. But if this was used for checking which expressions to replace and modifying the plan (as it is done in the PhysicalAggregation), then it is not safe, because it can cause missing some Alias from the resulting plan, leading to an invalid one. So I can say in general it was not safe if we consider all the usages of semanticEquals, but it is safe if we want to know whether the returned data is the same. Hope this answer is clear enough.

@mgaido91
Copy link
Contributor Author

cc @cloud-fan @gatorsmile

* different output expressions can evaluate to the same result as well (eg. when an expression
* is aliased).
*/
def sameResult(other: Expression): Boolean = other match {
Copy link
Contributor

@cloud-fan cloud-fan Nov 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's always safer to introduce a new API, does is it really necessary? In Canonicalize, we erase the name for attributes, I think it's reasonable to erase the name of Alias too, as it doesn't affect the output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is reasonable but it doesn't solve the problem stated in the JIRA. So the goal here is to avoid that something like a as b is considered different from a in terms of ordering/distribution. If we just erase the name of alias, the 2 expression would still be different because of the presence of Alias itself would make the 2 expressions different.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"erase the name" can also mean remove Alias. If we can't clearly tell the difference between semanticEquals and sameResult, and give a guideline about using which one in which case, I think we should just update semanticEquals(i.e. Canonicalize).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove Alias is not possible for the reason explained in #22957 (comment). In general, semanticEquals should be used when we want to replace an expression with another, while sameResult should be used in order to check that 2 expressions return the same output.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put it in the method doc(both semanticEquals and sameResult)? This makes sense to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thanks.

*/
def sameResult(other: Expression): Boolean = other match {
case a: Alias => sameResult(a.child)
case _ => this.semanticEquals(other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also strip the alias of this here? so that we can mark sameResult as final.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is doable, but I didn't want to put too many match where it was not needed. But if you prefer that way I can try and do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, it needs to be overridden by HashPartitioning too, so since I am not able to make it final anyway, I don't think it is a good idea. Well, I can add a match on HashPartitioningtoo, but it doesn't seem a clean solution to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can do

CleanupAliases.trimAliases(this) semanticEquals CleanupAliases.trimAliases(other)

@SparkQA
Copy link

SparkQA commented Nov 28, 2018

Test build #99377 has finished for PR 22957 at commit 6c93e70.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 28, 2018

Test build #99376 has finished for PR 22957 at commit 3831be0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 28, 2018

Test build #99375 has finished for PR 22957 at commit 0491249.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Nov 29, 2018

Test build #99424 has finished for PR 22957 at commit 6c93e70.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 29, 2018

Test build #99446 has finished for PR 22957 at commit a306465.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 29, 2018

Test build #99448 has finished for PR 22957 at commit a306465.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

e.transformDown {
case Alias(child, _) => child
case MultiAlias(child, _) => child
case Alias(child, _) => trimAliases(child)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's going on here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the point is that now this method removes only the first Alias it finds (and it doesn't go on recursively), which is the reason of the UT failure. Also checking the comment on the method it seems not the expected behavior of this method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's transformDown, why doesn't it work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I did a stupid thing here. So the problem is that: since it returns child for this, in transformDown we apply the rule to child children, instead of applying to child itself. So the problem here is with 2 consecutive Alias. Let me find a better fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just using transformUpsolves the issue

@SparkQA
Copy link

SparkQA commented Nov 29, 2018

Test build #99449 has finished for PR 22957 at commit a306465.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 29, 2018

Test build #99462 has finished for PR 22957 at commit 13aef71.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
}

test("SPARK-25951: avoid redundant shuffle on rename") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have an end-to-end test as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, good point and indeed very useful. In my previous tests I always used a very simple query to verify this and never the one reported in the JIRA. Now I tried that one and I realized that this fix is not very useful as of now, because in renaming like that in the HashPatitioning there is the AttributeReference to the Alias, rather than the Alias itself. Since that is the common case, the PR as it is now it is not very useful. If I won't be able to figure out a good way for that, I am going to close this. Thanks and sorry for the trouble.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan @viirya I added the test, but as I mentioned I had to do another change in order to make it working. Sorry for the mistake. I'd really appreciate if you could review it again. Thanks.

@cloud-fan
Copy link
Contributor

LGTM, cc @viirya as well


/**
* Returns true when two expressions will always compute the same result, even if the output may
* be different, because of different names or similar differences.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here output is a bit confusing. Do we mean the output names?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So sameResult returns if the evaluated results between two expressions are exactly the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I mean: sameResult returns true if 2 expressions return the same data even though from plan perspective they are not the same (eg. the output name/exprIds is different as in this case), while semanticEquals ensure they are the same from plan perspective too. If you have better suggestions how to rephrase this, I am happy to improve it. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about replace output with output from plan perspective?

Returns true when two expressions will always compute the same result, even if the output
from plan perspective may be different, because of different names or similar differences.

@viirya
Copy link
Member

viirya commented Nov 30, 2018

This looks good to me. Just a comment about wording.

@HeartSaVioR
Copy link
Contributor

Sorry about jumping in, but looks like we missed #17400 (SPARK-19981) which looks like containing the fix what @cloud-fan suggests - reviewers just lost focus on that. For me #17400 looks very concise, so curious how this patch is different from #17400.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Feb 7, 2019

@HeartSaVioR thanks for pointing that out! Yes, that fix is similar to this, the main difference between these 2 PRs is that this one handles also the case when we have select a as b, a, ... and then we have the partitioning on a, while #17400 doesn't (and hence may introduce a regression). Let me cc @maropu too here, since he worked on this and may have comments/suggestion or he can update his PR to handle that case and we can focus on that PR.

@maropu
Copy link
Member

maropu commented Feb 9, 2019

@mgaido91 Thanks for letting me know! Could you update this pr based on my fix? If you can, its ok to close my pr in #17400.

@mgaido91
Copy link
Contributor Author

@maropu thanks for checking this. Do you mean using the trait approach? If so, sure, I am doing. If not, please let me know. Thanks.

@SparkQA
Copy link

SparkQA commented Feb 10, 2019

Test build #102147 has finished for PR 22957 at commit 09b9981.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait AliasAwareOutputPartitioning extends UnaryExecNode

@SparkQA
Copy link

SparkQA commented Feb 10, 2019

Test build #102150 has finished for PR 22957 at commit 69f9d5e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 10, 2019

Test build #102154 has finished for PR 22957 at commit 75ef545.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Feb 10, 2019

Just 2 cents, personally I understood @maropu's comment as taking up his PR, in other words, rebasing this branch to his branch to retain his commits, and adding @mgaido91 commit on top of his branch to fix up remaining issue. This would make giving authorship of this PR to both of @maropu and @mgaido91 easier (looks correct way to give credits on actual works).

@maropu
Copy link
Member

maropu commented Feb 11, 2019

Thanks, @mgaido91 and @HeartSaVioR! The fix @mgaido91 did looks ok to me. But, I don't review the latest version yet and I'll do in a few days.

* caused by the rename of an attribute among the partitioning ones, eg.
*
* spark.range(10).selectExpr("id AS key", "0").repartition($"key").write.saveAsTable("df1")
* spark.range(10).selectExpr("id AS key", "0").repartition($"key").write.saveAsTable("df2")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean view here? otherwise the physical plan doesn't match.

@mgaido91
Copy link
Contributor Author

@HeartSaVioR @maropu rebasing this PR with @maropu 's change and work on it is non-trivial, I can rather close this and create a new one based on @maropu 's branch if you prefer. Otherwise committers can add @maropu as author when this will be merged or @maropu can take this over and create a new PR based on this. Just let me know how you prefer to go on, thanks.


override private[spark] def pruneInvalidAttribute(invalidAttr: Attribute): Partitioning = {
if (this.references.contains(invalidAttr)) {
UnknownPartitioning(numPartitions)
Copy link
Contributor

@cloud-fan cloud-fan Feb 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add comments to explain it.

HashPartitioning('a, 'b) with output expressions 'a as 'a1, should produce UnknownPartitioning instead of HashPartitioning('a1), which is wrong.


override private[spark] def pruneInvalidAttribute(invalidAttr: Attribute): Partitioning = {
if (this.references.contains(invalidAttr)) {
val validExprs = this.children.takeWhile(!_.references.contains(invalidAttr))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.children -> ordering?

if (validExprs.isEmpty) {
UnknownPartitioning(numPartitions)
} else {
RangePartitioning(validExprs, numPartitions)
Copy link
Contributor

@cloud-fan cloud-fan Feb 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think about RangePartitioning('a.ASC, 'b.ASC) with output expression 'a as 'a1.

It cannot satisfy ClusteredDistribution('a1), but can still satisfy OrderedDistribution('a1.ASC). I think the expected result should be RangePartitioning('a1.ASC, 'b.ASC) instead of RangePartitioning('a1.ASC), which is wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't it satisfy ClusteredDistribution('a1)? I don't agree with what you stated. If b is not in the output, it is useless to have it there. Moreover, when ordering we order always for the first attribute, then for the second, ... So if something is partitioned by RangePartitioning('a1.ASC, 'b.ASC), it is also true that its partitioning is RangePartitioning('a1.ASC, ). So I think that in that case RangePartitioning('a1.ASC) is the right one.

Copy link
Contributor

@cloud-fan cloud-fan Feb 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to RangePartitioning.satisfies0, and the classdoc of RangePartitioning and ClusteredDistribution, RangePartitioning('a.ASC, 'b.ASC) does not satisfy ClusteredDistribution('a).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmmh, I am not sure about what you mean referring to the classdoc of these 2 classes. I see nothing there about this. Anyway, I see that the implementation is done according what you state, but I do believe that it is wrong (or at least suboptimal if you prefer). If the data is partitioned by sorting it with a.ASC, b.ASC, it is definitely partitioned by sorting it with a.ASC. I think that forall should be an exists. There is also a (very minor) bug in the current implementation; try and running this test (it fails...):

  test("partitioning test") {
    val attr1 = AttributeReference("attr1", IntegerType)()
    val attr2 = AttributeReference("attr2", IntegerType)()
    val partitioning = RangePartitioning(Seq.empty, 10)
    val requiredDistribution = ClusteredDistribution(Seq(attr2, attr1), Some(10))
    assert(!partitioning.satisfies(requiredDistribution))
  }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please carefully read the classdoc of RangePartitioning and ClusteredDistribution, and see what RangePartitioning guarantees and what ClusteredDistribution requires.

partition 1: (a=1, b=2), (a=1, b=3)
partition 2: (a=1, b=4), (a=1, b=5)

This data set is range partitioned by a,b, but not clustered by a.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see now, thanks. Let me update it then accordingly, thanks.

final override def outputPartitioning: Partitioning = {
child.outputPartitioning match {
case partitioning: Expression =>
val exprToEquiv = partitioning.references.map { attr =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain what's going on here? The code is a little hard to follow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, let me add some comments. Thanks.

@SparkQA
Copy link

SparkQA commented Feb 11, 2019

Test build #102193 has finished for PR 22957 at commit df3394c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 11, 2019

Test build #102201 has finished for PR 22957 at commit 78d92bc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 10, 2019

Test build #104484 has finished for PR 22957 at commit 47a8f71.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

@cloud-fan @maropu sorry for pinging you again, I think I addressed all your comments on this, may you please check it again? Thanks.

@github-actions
Copy link

github-actions bot commented Jan 5, 2020

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

@github-actions github-actions bot added the Stale label Jan 5, 2020
@github-actions github-actions bot closed this Jan 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants