Skip to content

Conversation

@Ngone51
Copy link
Member

@Ngone51 Ngone51 commented Nov 24, 2020

What changes were proposed in this pull request?

Currently, join() uses withPlan(logicalPlan) for convenient to call some Dataset functions. But it leads to the dataset_id inconsistent between the logicalPlan and the original Dataset(because withPlan(logicalPlan) will create a new Dataset with the new id and reset the dataset_id with the new id of the logicalPlan). As a result, it breaks the rule DetectAmbiguousSelfJoin.

In this PR, we propose to drop the usage of withPlan but use the logicalPlan directly so its dataset_id doesn't change.

Besides, this PR also removes related metadata (DATASET_ID_KEY, COL_POS_KEY) when an Alias tries to construct its own metadata. Because the Alias is no longer a reference column after converting to an Attribute. To achieve that, we add a new field, deniedMetadataKeys, to indicate the metadata that needs to be removed.

Why are the changes needed?

For the query below, it returns the wrong result while it should throws ambiguous self join exception instead:

val emp1 = Seq[TestData](
  TestData(1, "sales"),
  TestData(2, "personnel"),
  TestData(3, "develop"),
  TestData(4, "IT")).toDS()
val emp2 = Seq[TestData](
  TestData(1, "sales"),
  TestData(2, "personnel"),
  TestData(3, "develop")).toDS()
val emp3 = emp1.join(emp2, emp1("key") === emp2("key")).select(emp1("*"))
emp1.join(emp3, emp1.col("key") === emp3.col("key"), "left_outer")
  .select(emp1.col("*"), emp3.col("key").as("e2")).show()

// wrong result
+---+---------+---+
|key|    value| e2|
+---+---------+---+
|  1|    sales|  1|
|  2|personnel|  2|
|  3|  develop|  3|
|  4|       IT|  4|
+---+---------+---+

This PR fixes the wrong behaviour.

Does this PR introduce any user-facing change?

Yes, users hit the exception instead of the wrong result after this PR.

How was this patch tested?

Added a new unit test.

@Ngone51
Copy link
Member Author

Ngone51 commented Nov 24, 2020

cc @cloud-fan @xuanyuanking Could you take a look? Thanks!

@github-actions github-actions bot added the SQL label Nov 24, 2020
@SparkQA
Copy link

SparkQA commented Nov 24, 2020

Test build #131668 has finished for PR 30488 at commit a617f94.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51 Ngone51 changed the title [SPARK-33536][SQL] Avoid changing dataset_id of LogicalPlan in join() to not break DetectAmbiguousSelfJoin [SPARK-33071][SPARK-33536][SQL] Avoid changing dataset_id of LogicalPlan in join() to not break DetectAmbiguousSelfJoin Nov 25, 2020
@SparkQA
Copy link

SparkQA commented Nov 25, 2020

Test build #131728 has finished for PR 30488 at commit 05bca19.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xuanyuanking
Copy link
Member

Seems the failed UT is related.

org.apache.spark.sql.DataFrameSelfJoinSuite.SPARK-28344: don't fail if there is no ambiguous self join

@Ngone51
Copy link
Member Author

Ngone51 commented Nov 25, 2020

Yeah...I fixed it just now.

@SparkQA
Copy link

SparkQA commented Nov 25, 2020

Test build #131778 has finished for PR 30488 at commit 5cff25f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 30, 2020

Test build #131994 has finished for PR 30488 at commit 85f6f12.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51 Ngone51 closed this Dec 1, 2020
@Ngone51 Ngone51 reopened this Dec 1, 2020
@SparkQA
Copy link

SparkQA commented Dec 2, 2020

Test build #132041 has finished for PR 30488 at commit df04549.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Dec 2, 2020

Test build #132054 has finished for PR 30488 at commit e06b223.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

GA passed, merging to master, thanks!

@cloud-fan cloud-fan closed this in a082f46 Dec 2, 2020
@SparkQA
Copy link

SparkQA commented Dec 2, 2020

Test build #132059 has finished for PR 30488 at commit d13b88c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

HyukjinKwon added a commit that referenced this pull request Dec 9, 2020
…to nonInheritableMetadataKeys in Alias

### What changes were proposed in this pull request?

This PR is a followup of #30488. This PR proposes to rename `Alias.deniedMetadataKeys` to `Alias.nonInheritableMetadataKeys` to make it less confusing.

### Why are the changes needed?

To make it easier to maintain and read.

### Does this PR introduce _any_ user-facing change?

No. This is rather a code cleanup.

### How was this patch tested?

Ran the unittests written in the previous PR manually. Jenkins and GitHub Actions in this PR should also test them.

Closes #30682 from HyukjinKwon/SPARK-33071-SPARK-33536.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
HyukjinKwon added a commit that referenced this pull request Dec 9, 2020
…to nonInheritableMetadataKeys in Alias

### What changes were proposed in this pull request?

This PR is a followup of #30488. This PR proposes to rename `Alias.deniedMetadataKeys` to `Alias.nonInheritableMetadataKeys` to make it less confusing.

### Why are the changes needed?

To make it easier to maintain and read.

### Does this PR introduce _any_ user-facing change?

No. This is rather a code cleanup.

### How was this patch tested?

Ran the unittests written in the previous PR manually. Jenkins and GitHub Actions in this PR should also test them.

Closes #30682 from HyukjinKwon/SPARK-33071-SPARK-33536.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit b5399d4)
Signed-off-by: HyukjinKwon <[email protected]>
a0x8o added a commit to a0x8o/spark that referenced this pull request Dec 9, 2020
…to nonInheritableMetadataKeys in Alias

### What changes were proposed in this pull request?

This PR is a followup of apache/spark#30488. This PR proposes to rename `Alias.deniedMetadataKeys` to `Alias.nonInheritableMetadataKeys` to make it less confusing.

### Why are the changes needed?

To make it easier to maintain and read.

### Does this PR introduce _any_ user-facing change?

No. This is rather a code cleanup.

### How was this patch tested?

Ran the unittests written in the previous PR manually. Jenkins and GitHub Actions in this PR should also test them.

Closes #30682 from HyukjinKwon/SPARK-33071-SPARK-33536.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
rshkv pushed a commit to palantir/spark that referenced this pull request Jan 28, 2021
…lan in join() to not break DetectAmbiguousSelfJoin

Currently, `join()` uses `withPlan(logicalPlan)` for convenient to call some Dataset functions. But it leads to the `dataset_id` inconsistent between the `logicalPlan` and the original `Dataset`(because `withPlan(logicalPlan)` will create a new Dataset with the new id and reset the `dataset_id` with the new id of the `logicalPlan`). As a result, it breaks the rule `DetectAmbiguousSelfJoin`.

In this PR, we propose to drop the usage of `withPlan` but use the `logicalPlan` directly so its `dataset_id` doesn't change.

Besides, this PR also removes related metadata (`DATASET_ID_KEY`,  `COL_POS_KEY`) when an `Alias` tries to construct its own metadata. Because the `Alias` is no longer a reference column after converting to an `Attribute`.  To achieve that, we add a new field, `deniedMetadataKeys`, to indicate the metadata that needs to be removed.

For the query below, it returns the wrong result while it should throws ambiguous self join exception instead:

```scala
val emp1 = Seq[TestData](
  TestData(1, "sales"),
  TestData(2, "personnel"),
  TestData(3, "develop"),
  TestData(4, "IT")).toDS()
val emp2 = Seq[TestData](
  TestData(1, "sales"),
  TestData(2, "personnel"),
  TestData(3, "develop")).toDS()
val emp3 = emp1.join(emp2, emp1("key") === emp2("key")).select(emp1("*"))
emp1.join(emp3, emp1.col("key") === emp3.col("key"), "left_outer")
  .select(emp1.col("*"), emp3.col("key").as("e2")).show()

// wrong result
+---+---------+---+
|key|    value| e2|
+---+---------+---+
|  1|    sales|  1|
|  2|personnel|  2|
|  3|  develop|  3|
|  4|       IT|  4|
+---+---------+---+
```
This PR fixes the wrong behaviour.

Yes, users hit the exception instead of the wrong result after this PR.

Added a new unit test.

Closes apache#30488 from Ngone51/fix-self-join.

Authored-by: yi.wu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
rshkv pushed a commit to palantir/spark that referenced this pull request Jan 28, 2021
…to nonInheritableMetadataKeys in Alias

This PR is a followup of apache#30488. This PR proposes to rename `Alias.deniedMetadataKeys` to `Alias.nonInheritableMetadataKeys` to make it less confusing.

To make it easier to maintain and read.

No. This is rather a code cleanup.

Ran the unittests written in the previous PR manually. Jenkins and GitHub Actions in this PR should also test them.

Closes apache#30682 from HyukjinKwon/SPARK-33071-SPARK-33536.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
laflechejonathan pushed a commit to palantir/spark that referenced this pull request Sep 27, 2021
…lan in join() to not break DetectAmbiguousSelfJoin

Currently, `join()` uses `withPlan(logicalPlan)` for convenient to call some Dataset functions. But it leads to the `dataset_id` inconsistent between the `logicalPlan` and the original `Dataset`(because `withPlan(logicalPlan)` will create a new Dataset with the new id and reset the `dataset_id` with the new id of the `logicalPlan`). As a result, it breaks the rule `DetectAmbiguousSelfJoin`.

In this PR, we propose to drop the usage of `withPlan` but use the `logicalPlan` directly so its `dataset_id` doesn't change.

Besides, this PR also removes related metadata (`DATASET_ID_KEY`,  `COL_POS_KEY`) when an `Alias` tries to construct its own metadata. Because the `Alias` is no longer a reference column after converting to an `Attribute`.  To achieve that, we add a new field, `deniedMetadataKeys`, to indicate the metadata that needs to be removed.

For the query below, it returns the wrong result while it should throws ambiguous self join exception instead:

```scala
val emp1 = Seq[TestData](
  TestData(1, "sales"),
  TestData(2, "personnel"),
  TestData(3, "develop"),
  TestData(4, "IT")).toDS()
val emp2 = Seq[TestData](
  TestData(1, "sales"),
  TestData(2, "personnel"),
  TestData(3, "develop")).toDS()
val emp3 = emp1.join(emp2, emp1("key") === emp2("key")).select(emp1("*"))
emp1.join(emp3, emp1.col("key") === emp3.col("key"), "left_outer")
  .select(emp1.col("*"), emp3.col("key").as("e2")).show()

// wrong result
+---+---------+---+
|key|    value| e2|
+---+---------+---+
|  1|    sales|  1|
|  2|personnel|  2|
|  3|  develop|  3|
|  4|       IT|  4|
+---+---------+---+
```
This PR fixes the wrong behaviour.

Yes, users hit the exception instead of the wrong result after this PR.

Added a new unit test.

Closes apache#30488 from Ngone51/fix-self-join.

Authored-by: yi.wu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
laflechejonathan pushed a commit to palantir/spark that referenced this pull request Sep 27, 2021
…to nonInheritableMetadataKeys in Alias

### What changes were proposed in this pull request?

This PR is a followup of apache#30488. This PR proposes to rename `Alias.deniedMetadataKeys` to `Alias.nonInheritableMetadataKeys` to make it less confusing.

### Why are the changes needed?

To make it easier to maintain and read.

### Does this PR introduce _any_ user-facing change?

No. This is rather a code cleanup.

### How was this patch tested?

Ran the unittests written in the previous PR manually. Jenkins and GitHub Actions in this PR should also test them.

Closes apache#30682 from HyukjinKwon/SPARK-33071-SPARK-33536.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants