[SPARK-37829][SQL][3.3] Dataframe.joinWith outer-join should return a null value for unmatched row #40858

kings129 · 2023-04-19T19:13:58Z

What changes were proposed in this pull request?

This is a pull request to port the fix from the master branch to version 3.3. PR

When doing an outer join with joinWith on DataFrames, unmatched rows return Row objects with null fields instead of a single null value. This is not a expected behavior, and it's a regression introduced in this commit. This pull request aims to fix the regression, note this is not a full rollback of the commit, do not add back "schema" variable.

case class ClassData(a: String, b: Int)
val left = Seq(ClassData("a", 1), ClassData("b", 2)).toDF
val right = Seq(ClassData("x", 2), ClassData("y", 3)).toDF

left.joinWith(right, left("b") === right("b"), "left_outer").collect

Wrong results (current behavior):    Array(([a,1],[null,null]), ([b,2],[x,2]))
Correct results:                     Array(([a,1],null), ([b,2],[x,2]))

Why are the changes needed?

We need to address the regression mentioned above. It results in unexpected behavior changes in the Dataframe joinWith API between versions 2.4.8 and 3.0.0+. This could potentially cause data correctness issues for users who expect the old behavior when using Spark 3.0.0+.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit test (use the same test in previous closed pull request, credit to Clément de Groc) Run sql-core and sql-catalyst submodules locally with ./build/mvn clean package -pl sql/core,sql/catalyst

Closes #40755 from kings129/encoder_bug_fix.

Authored-by: --global [email protected]

… value for unmatched row ### What changes were proposed in this pull request? When doing an outer join with joinWith on DataFrames, unmatched rows return Row objects with null fields instead of a single null value. This is not a expected behavior, and it's a regression introduced in [this commit](apache@cd92f25). This pull request aims to fix the regression, note this is not a full rollback of the commit, do not add back "schema" variable. ``` case class ClassData(a: String, b: Int) val left = Seq(ClassData("a", 1), ClassData("b", 2)).toDF val right = Seq(ClassData("x", 2), ClassData("y", 3)).toDF left.joinWith(right, left("b") === right("b"), "left_outer").collect ``` ``` Wrong results (current behavior): Array(([a,1],[null,null]), ([b,2],[x,2])) Correct results: Array(([a,1],null), ([b,2],[x,2])) ``` ### Why are the changes needed? We need to address the regression mentioned above. It results in unexpected behavior changes in the Dataframe joinWith API between versions 2.4.8 and 3.0.0+. This could potentially cause data correctness issues for users who expect the old behavior when using Spark 3.0.0+. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test (use the same test in previous [closed pull request](apache#35140), credit to Clément de Groc) Run sql-core and sql-catalyst submodules locally with ./build/mvn clean package -pl sql/core,sql/catalyst Closes apache#40755 from kings129/encoder_bug_fix. Authored-by: --global <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

viirya · 2023-04-19T19:33:36Z

The PR title is broken.

dongjoon-hyun

+1, LGTM. Merged to branch-3.3.

… null value for unmatched row ### What changes were proposed in this pull request? This is a pull request to port the fix from the master branch to version 3.3. [PR](#40755) When doing an outer join with joinWith on DataFrames, unmatched rows return Row objects with null fields instead of a single null value. This is not a expected behavior, and it's a regression introduced in [this commit](cd92f25). This pull request aims to fix the regression, note this is not a full rollback of the commit, do not add back "schema" variable. ``` case class ClassData(a: String, b: Int) val left = Seq(ClassData("a", 1), ClassData("b", 2)).toDF val right = Seq(ClassData("x", 2), ClassData("y", 3)).toDF left.joinWith(right, left("b") === right("b"), "left_outer").collect ``` ``` Wrong results (current behavior): Array(([a,1],[null,null]), ([b,2],[x,2])) Correct results: Array(([a,1],null), ([b,2],[x,2])) ``` ### Why are the changes needed? We need to address the regression mentioned above. It results in unexpected behavior changes in the Dataframe joinWith API between versions 2.4.8 and 3.0.0+. This could potentially cause data correctness issues for users who expect the old behavior when using Spark 3.0.0+. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test (use the same test in previous [closed pull request](#35140), credit to Clément de Groc) Run sql-core and sql-catalyst submodules locally with ./build/mvn clean package -pl sql/core,sql/catalyst Closes #40755 from kings129/encoder_bug_fix. Authored-by: --global <xuqiang129gmail.com> Closes #40858 from kings129/fix_encoder_branch_33. Authored-by: --global <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

github-actions bot added the SQL label Apr 19, 2023

kings129 mentioned this pull request Apr 19, 2023

[SPARK-37829][SQL] Dataframe.joinWith outer-join should return a null value for unmatched row #40755

Closed

viirya approved these changes Apr 19, 2023

View reviewed changes

kings129 changed the title ~~[SPARK-37829][SQL] Dataframe.joinWith outer-join should return a null…~~ [SPARK-37829][SQL] Dataframe.joinWith outer-join should return a null value for unmatched row Apr 19, 2023

cloud-fan approved these changes Apr 20, 2023

View reviewed changes

zhengruifeng changed the title ~~[SPARK-37829][SQL] Dataframe.joinWith outer-join should return a null value for unmatched row~~ [SPARK-37829][SQL][3.3] Dataframe.joinWith outer-join should return a null value for unmatched row Apr 20, 2023

dongjoon-hyun approved these changes May 7, 2023

View reviewed changes

dongjoon-hyun closed this May 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-37829][SQL][3.3] Dataframe.joinWith outer-join should return a null value for unmatched row #40858

[SPARK-37829][SQL][3.3] Dataframe.joinWith outer-join should return a null value for unmatched row #40858

Uh oh!

kings129 commented Apr 19, 2023 •

edited

Loading

Uh oh!

viirya commented Apr 19, 2023

Uh oh!

dongjoon-hyun left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-37829][SQL][3.3] Dataframe.joinWith outer-join should return a null value for unmatched row #40858

[SPARK-37829][SQL][3.3] Dataframe.joinWith outer-join should return a null value for unmatched row #40858

Uh oh!

Conversation

kings129 commented Apr 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

viirya commented Apr 19, 2023

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kings129 commented Apr 19, 2023 •

edited

Loading