Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Feb 23, 2023

What changes were proposed in this pull request?

This is a follow-up of #37525 . When the project list has aliases, we go to the projectExpression branch which filters away invalid partitioning/ordering that reference non-existing attributes in the current plan node. However, this filtering is missing when the project list has no alias, where we directly return the child partitioning/ordering.

This PR fixes it.

Why are the changes needed?

to make sure we always return valid output partitioning/ordering.

Does this PR introduce any user-facing change?

no

How was this patch tested?

new tests

@cloud-fan
Copy link
Contributor Author

cc @peter-toth

@github-actions github-actions bot added the SQL label Feb 23, 2023
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @sunchao , @huaxingao , @viirya , too.

Copy link
Contributor

@peter-toth peter-toth Feb 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Thanks for this follow-up!

But, I'm not sure this part is correct as sortOrder.children should be filtered with _.references.subsetOf(outputSet) separately.

E.g. this test (is a bit artifical though) fails now:

  test("SPARK-42049: Improve AliasAwareOutputExpression - no alias but still prune expressions 2") {
    withSQLConf(SQLConf.OPTIMIZER_EXCLUDED_RULES.key ->
      Seq(CollapseProject.ruleName, ColumnPruning.ruleName).mkString(",")) {
      val df = spark.range(2).select($"id" as "a", $"id" as "b").select($"a")
      val outputOrdering = df.queryExecution.optimizedPlan.outputOrdering

      assert(outputOrdering.size == 1)
      assert(outputOrdering.head.child.asInstanceOf[Attribute].name == "a")
      assert(outputOrdering.head.sameOrderExpressions.size == 0)
    }
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point!

@viirya
Copy link
Member

viirya commented Feb 23, 2023

Hmm, the failed test seems a related one.

@dongjoon-hyun
Copy link
Member

Oh is it still failing?

@viirya
Copy link
Member

viirya commented Feb 23, 2023

I think it was failing due to latest commit.

@cloud-fan
Copy link
Contributor Author

The failed test checks invalid ordering and I've updated it.

dongjoon-hyun pushed a commit that referenced this pull request Feb 24, 2023
…itioning

### What changes were proposed in this pull request?

This is a follow-up of #37525 . When the project list has aliases, we go to the `projectExpression` branch which filters away invalid partitioning/ordering that reference non-existing attributes in the current plan node. However, this filtering is missing when the project list has no alias, where we directly return the child partitioning/ordering.

This PR fixes it.

### Why are the changes needed?

to make sure we always return valid output partitioning/ordering.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests

Closes #40137 from cloud-fan/alias.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 72922ad)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

Merged to master/3.4.
Thank you, @cloud-fan , @peter-toth , @viirya .

snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
…itioning

### What changes were proposed in this pull request?

This is a follow-up of apache#37525 . When the project list has aliases, we go to the `projectExpression` branch which filters away invalid partitioning/ordering that reference non-existing attributes in the current plan node. However, this filtering is missing when the project list has no alias, where we directly return the child partitioning/ordering.

This PR fixes it.

### Why are the changes needed?

to make sure we always return valid output partitioning/ordering.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests

Closes apache#40137 from cloud-fan/alias.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 72922ad)
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants