-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-33071][SPARK-33536][SQL] Avoid changing dataset_id of LogicalPlan in join() to not break DetectAmbiguousSelfJoin #30488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @cloud-fan @xuanyuanking Could you take a look? Thanks! |
|
Test build #131668 has finished for PR 30488 at commit
|
|
Test build #131728 has finished for PR 30488 at commit
|
|
Seems the failed UT is related. |
|
Yeah...I fixed it just now. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
Outdated
Show resolved
Hide resolved
|
Test build #131778 has finished for PR 30488 at commit
|
5cff25f to
85f6f12
Compare
|
Test build #131994 has finished for PR 30488 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/types/Metadata.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
Outdated
Show resolved
Hide resolved
|
Test build #132041 has finished for PR 30488 at commit
|
|
retest this please |
|
Test build #132054 has finished for PR 30488 at commit
|
|
GA passed, merging to master, thanks! |
|
Test build #132059 has finished for PR 30488 at commit
|
…to nonInheritableMetadataKeys in Alias ### What changes were proposed in this pull request? This PR is a followup of #30488. This PR proposes to rename `Alias.deniedMetadataKeys` to `Alias.nonInheritableMetadataKeys` to make it less confusing. ### Why are the changes needed? To make it easier to maintain and read. ### Does this PR introduce _any_ user-facing change? No. This is rather a code cleanup. ### How was this patch tested? Ran the unittests written in the previous PR manually. Jenkins and GitHub Actions in this PR should also test them. Closes #30682 from HyukjinKwon/SPARK-33071-SPARK-33536. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
…to nonInheritableMetadataKeys in Alias ### What changes were proposed in this pull request? This PR is a followup of #30488. This PR proposes to rename `Alias.deniedMetadataKeys` to `Alias.nonInheritableMetadataKeys` to make it less confusing. ### Why are the changes needed? To make it easier to maintain and read. ### Does this PR introduce _any_ user-facing change? No. This is rather a code cleanup. ### How was this patch tested? Ran the unittests written in the previous PR manually. Jenkins and GitHub Actions in this PR should also test them. Closes #30682 from HyukjinKwon/SPARK-33071-SPARK-33536. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit b5399d4) Signed-off-by: HyukjinKwon <[email protected]>
…to nonInheritableMetadataKeys in Alias ### What changes were proposed in this pull request? This PR is a followup of apache/spark#30488. This PR proposes to rename `Alias.deniedMetadataKeys` to `Alias.nonInheritableMetadataKeys` to make it less confusing. ### Why are the changes needed? To make it easier to maintain and read. ### Does this PR introduce _any_ user-facing change? No. This is rather a code cleanup. ### How was this patch tested? Ran the unittests written in the previous PR manually. Jenkins and GitHub Actions in this PR should also test them. Closes #30682 from HyukjinKwon/SPARK-33071-SPARK-33536. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
…lan in join() to not break DetectAmbiguousSelfJoin
Currently, `join()` uses `withPlan(logicalPlan)` for convenient to call some Dataset functions. But it leads to the `dataset_id` inconsistent between the `logicalPlan` and the original `Dataset`(because `withPlan(logicalPlan)` will create a new Dataset with the new id and reset the `dataset_id` with the new id of the `logicalPlan`). As a result, it breaks the rule `DetectAmbiguousSelfJoin`.
In this PR, we propose to drop the usage of `withPlan` but use the `logicalPlan` directly so its `dataset_id` doesn't change.
Besides, this PR also removes related metadata (`DATASET_ID_KEY`, `COL_POS_KEY`) when an `Alias` tries to construct its own metadata. Because the `Alias` is no longer a reference column after converting to an `Attribute`. To achieve that, we add a new field, `deniedMetadataKeys`, to indicate the metadata that needs to be removed.
For the query below, it returns the wrong result while it should throws ambiguous self join exception instead:
```scala
val emp1 = Seq[TestData](
TestData(1, "sales"),
TestData(2, "personnel"),
TestData(3, "develop"),
TestData(4, "IT")).toDS()
val emp2 = Seq[TestData](
TestData(1, "sales"),
TestData(2, "personnel"),
TestData(3, "develop")).toDS()
val emp3 = emp1.join(emp2, emp1("key") === emp2("key")).select(emp1("*"))
emp1.join(emp3, emp1.col("key") === emp3.col("key"), "left_outer")
.select(emp1.col("*"), emp3.col("key").as("e2")).show()
// wrong result
+---+---------+---+
|key| value| e2|
+---+---------+---+
| 1| sales| 1|
| 2|personnel| 2|
| 3| develop| 3|
| 4| IT| 4|
+---+---------+---+
```
This PR fixes the wrong behaviour.
Yes, users hit the exception instead of the wrong result after this PR.
Added a new unit test.
Closes apache#30488 from Ngone51/fix-self-join.
Authored-by: yi.wu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…to nonInheritableMetadataKeys in Alias This PR is a followup of apache#30488. This PR proposes to rename `Alias.deniedMetadataKeys` to `Alias.nonInheritableMetadataKeys` to make it less confusing. To make it easier to maintain and read. No. This is rather a code cleanup. Ran the unittests written in the previous PR manually. Jenkins and GitHub Actions in this PR should also test them. Closes apache#30682 from HyukjinKwon/SPARK-33071-SPARK-33536. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
…lan in join() to not break DetectAmbiguousSelfJoin
Currently, `join()` uses `withPlan(logicalPlan)` for convenient to call some Dataset functions. But it leads to the `dataset_id` inconsistent between the `logicalPlan` and the original `Dataset`(because `withPlan(logicalPlan)` will create a new Dataset with the new id and reset the `dataset_id` with the new id of the `logicalPlan`). As a result, it breaks the rule `DetectAmbiguousSelfJoin`.
In this PR, we propose to drop the usage of `withPlan` but use the `logicalPlan` directly so its `dataset_id` doesn't change.
Besides, this PR also removes related metadata (`DATASET_ID_KEY`, `COL_POS_KEY`) when an `Alias` tries to construct its own metadata. Because the `Alias` is no longer a reference column after converting to an `Attribute`. To achieve that, we add a new field, `deniedMetadataKeys`, to indicate the metadata that needs to be removed.
For the query below, it returns the wrong result while it should throws ambiguous self join exception instead:
```scala
val emp1 = Seq[TestData](
TestData(1, "sales"),
TestData(2, "personnel"),
TestData(3, "develop"),
TestData(4, "IT")).toDS()
val emp2 = Seq[TestData](
TestData(1, "sales"),
TestData(2, "personnel"),
TestData(3, "develop")).toDS()
val emp3 = emp1.join(emp2, emp1("key") === emp2("key")).select(emp1("*"))
emp1.join(emp3, emp1.col("key") === emp3.col("key"), "left_outer")
.select(emp1.col("*"), emp3.col("key").as("e2")).show()
// wrong result
+---+---------+---+
|key| value| e2|
+---+---------+---+
| 1| sales| 1|
| 2|personnel| 2|
| 3| develop| 3|
| 4| IT| 4|
+---+---------+---+
```
This PR fixes the wrong behaviour.
Yes, users hit the exception instead of the wrong result after this PR.
Added a new unit test.
Closes apache#30488 from Ngone51/fix-self-join.
Authored-by: yi.wu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…to nonInheritableMetadataKeys in Alias ### What changes were proposed in this pull request? This PR is a followup of apache#30488. This PR proposes to rename `Alias.deniedMetadataKeys` to `Alias.nonInheritableMetadataKeys` to make it less confusing. ### Why are the changes needed? To make it easier to maintain and read. ### Does this PR introduce _any_ user-facing change? No. This is rather a code cleanup. ### How was this patch tested? Ran the unittests written in the previous PR manually. Jenkins and GitHub Actions in this PR should also test them. Closes apache#30682 from HyukjinKwon/SPARK-33071-SPARK-33536. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
What changes were proposed in this pull request?
Currently,
join()useswithPlan(logicalPlan)for convenient to call some Dataset functions. But it leads to thedataset_idinconsistent between thelogicalPlanand the originalDataset(becausewithPlan(logicalPlan)will create a new Dataset with the new id and reset thedataset_idwith the new id of thelogicalPlan). As a result, it breaks the ruleDetectAmbiguousSelfJoin.In this PR, we propose to drop the usage of
withPlanbut use thelogicalPlandirectly so itsdataset_iddoesn't change.Besides, this PR also removes related metadata (
DATASET_ID_KEY,COL_POS_KEY) when anAliastries to construct its own metadata. Because theAliasis no longer a reference column after converting to anAttribute. To achieve that, we add a new field,deniedMetadataKeys, to indicate the metadata that needs to be removed.Why are the changes needed?
For the query below, it returns the wrong result while it should throws ambiguous self join exception instead:
This PR fixes the wrong behaviour.
Does this PR introduce any user-facing change?
Yes, users hit the exception instead of the wrong result after this PR.
How was this patch tested?
Added a new unit test.