[SPARK-37829][SQL] DataFrame.joinWith should return null rows for missing values #35140

cdegroc · 2022-01-07T20:15:23Z

What changes were proposed in this pull request?

Add a unit test demonstrating the regression on DataFrame.joinWith.
Revert commit cd92f25be5a221e0d4618925f7bc9dfd3bb8cb59 making the test pass.

Why are the changes needed?

Doing an outer-join using joinWith on DataFrames used to return missing values as null in Spark 2.4.8, but returns them as Rows with null values in Spark 3.0.0+.
The regression has been introduced in commit cd92f25be5a221e0d4618925f7bc9dfd3bb8cb59.

Does this PR introduce any user-facing change?

No

How was this patch tested?

A unit test was added.

Ran unit tests for the sql-core and sql-catalyst submodules with ./build/mvn clean package -pl sql/core,cql/catalyst

…n issue

…sion" This reverts commit cd92f25.

AmplabJenkins · 2022-01-08T18:28:47Z

Can one of the admins verify this patch?

HyukjinKwon · 2022-01-09T01:34:10Z

cc @cloud-fan @viirya FYI

gengliangwang · 2022-01-10T09:06:35Z

@cdegroc What is the relationship between this PR and #35139? The titles are the same

cdegroc · 2022-01-10T09:15:44Z

@gengliangwang both PRs solve SPARK-37829 in the same way. Only one PR should be picked (up to the maintainers).

The difference is that this one solves it by completely reverting the commit that introduced the bug, while the other keeps the current version of the code but fixes it to resolve the bug.

github-actions · 2022-04-21T00:20:00Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

… value for unmatched row ### What changes were proposed in this pull request? When doing an outer join with joinWith on DataFrames, unmatched rows return Row objects with null fields instead of a single null value. This is not a expected behavior, and it's a regression introduced in [this commit](cd92f25). This pull request aims to fix the regression, note this is not a full rollback of the commit, do not add back "schema" variable. ``` case class ClassData(a: String, b: Int) val left = Seq(ClassData("a", 1), ClassData("b", 2)).toDF val right = Seq(ClassData("x", 2), ClassData("y", 3)).toDF left.joinWith(right, left("b") === right("b"), "left_outer").collect ``` ``` Wrong results (current behavior): Array(([a,1],[null,null]), ([b,2],[x,2])) Correct results: Array(([a,1],null), ([b,2],[x,2])) ``` ### Why are the changes needed? We need to address the regression mentioned above. It results in unexpected behavior changes in the Dataframe joinWith API between versions 2.4.8 and 3.0.0+. This could potentially cause data correctness issues for users who expect the old behavior when using Spark 3.0.0+. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test (use the same test in previous [closed pull request](#35140), credit to Clément de Groc) Run sql-core and sql-catalyst submodules locally with ./build/mvn clean package -pl sql/core,sql/catalyst Closes #40755 from kings129/encoder_bug_fix. Authored-by: --global <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

… value for unmatched row ### What changes were proposed in this pull request? When doing an outer join with joinWith on DataFrames, unmatched rows return Row objects with null fields instead of a single null value. This is not a expected behavior, and it's a regression introduced in [this commit](cd92f25). This pull request aims to fix the regression, note this is not a full rollback of the commit, do not add back "schema" variable. ``` case class ClassData(a: String, b: Int) val left = Seq(ClassData("a", 1), ClassData("b", 2)).toDF val right = Seq(ClassData("x", 2), ClassData("y", 3)).toDF left.joinWith(right, left("b") === right("b"), "left_outer").collect ``` ``` Wrong results (current behavior): Array(([a,1],[null,null]), ([b,2],[x,2])) Correct results: Array(([a,1],null), ([b,2],[x,2])) ``` ### Why are the changes needed? We need to address the regression mentioned above. It results in unexpected behavior changes in the Dataframe joinWith API between versions 2.4.8 and 3.0.0+. This could potentially cause data correctness issues for users who expect the old behavior when using Spark 3.0.0+. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test (use the same test in previous [closed pull request](#35140), credit to Clément de Groc) Run sql-core and sql-catalyst submodules locally with ./build/mvn clean package -pl sql/core,sql/catalyst Closes #40755 from kings129/encoder_bug_fix. Authored-by: --global <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 74ce620) Signed-off-by: Wenchen Fan <[email protected]>

… value for unmatched row ### What changes were proposed in this pull request? When doing an outer join with joinWith on DataFrames, unmatched rows return Row objects with null fields instead of a single null value. This is not a expected behavior, and it's a regression introduced in [this commit](apache@cd92f25). This pull request aims to fix the regression, note this is not a full rollback of the commit, do not add back "schema" variable. ``` case class ClassData(a: String, b: Int) val left = Seq(ClassData("a", 1), ClassData("b", 2)).toDF val right = Seq(ClassData("x", 2), ClassData("y", 3)).toDF left.joinWith(right, left("b") === right("b"), "left_outer").collect ``` ``` Wrong results (current behavior): Array(([a,1],[null,null]), ([b,2],[x,2])) Correct results: Array(([a,1],null), ([b,2],[x,2])) ``` ### Why are the changes needed? We need to address the regression mentioned above. It results in unexpected behavior changes in the Dataframe joinWith API between versions 2.4.8 and 3.0.0+. This could potentially cause data correctness issues for users who expect the old behavior when using Spark 3.0.0+. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test (use the same test in previous [closed pull request](apache#35140), credit to Clément de Groc) Run sql-core and sql-catalyst submodules locally with ./build/mvn clean package -pl sql/core,sql/catalyst Closes apache#40755 from kings129/encoder_bug_fix. Authored-by: --global <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

… null value for unmatched row ### What changes were proposed in this pull request? This is a pull request to port the fix from the master branch to version 3.3. [PR](#40755) When doing an outer join with joinWith on DataFrames, unmatched rows return Row objects with null fields instead of a single null value. This is not a expected behavior, and it's a regression introduced in [this commit](cd92f25). This pull request aims to fix the regression, note this is not a full rollback of the commit, do not add back "schema" variable. ``` case class ClassData(a: String, b: Int) val left = Seq(ClassData("a", 1), ClassData("b", 2)).toDF val right = Seq(ClassData("x", 2), ClassData("y", 3)).toDF left.joinWith(right, left("b") === right("b"), "left_outer").collect ``` ``` Wrong results (current behavior): Array(([a,1],[null,null]), ([b,2],[x,2])) Correct results: Array(([a,1],null), ([b,2],[x,2])) ``` ### Why are the changes needed? We need to address the regression mentioned above. It results in unexpected behavior changes in the Dataframe joinWith API between versions 2.4.8 and 3.0.0+. This could potentially cause data correctness issues for users who expect the old behavior when using Spark 3.0.0+. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test (use the same test in previous [closed pull request](#35140), credit to Clément de Groc) Run sql-core and sql-catalyst submodules locally with ./build/mvn clean package -pl sql/core,sql/catalyst Closes #40755 from kings129/encoder_bug_fix. Authored-by: --global <xuqiang129gmail.com> Closes #40858 from kings129/fix_encoder_branch_33. Authored-by: --global <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

… value for unmatched row ### What changes were proposed in this pull request? When doing an outer join with joinWith on DataFrames, unmatched rows return Row objects with null fields instead of a single null value. This is not a expected behavior, and it's a regression introduced in [this commit](apache@cd92f25). This pull request aims to fix the regression, note this is not a full rollback of the commit, do not add back "schema" variable. ``` case class ClassData(a: String, b: Int) val left = Seq(ClassData("a", 1), ClassData("b", 2)).toDF val right = Seq(ClassData("x", 2), ClassData("y", 3)).toDF left.joinWith(right, left("b") === right("b"), "left_outer").collect ``` ``` Wrong results (current behavior): Array(([a,1],[null,null]), ([b,2],[x,2])) Correct results: Array(([a,1],null), ([b,2],[x,2])) ``` ### Why are the changes needed? We need to address the regression mentioned above. It results in unexpected behavior changes in the Dataframe joinWith API between versions 2.4.8 and 3.0.0+. This could potentially cause data correctness issues for users who expect the old behavior when using Spark 3.0.0+. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test (use the same test in previous [closed pull request](apache#35140), credit to Clément de Groc) Run sql-core and sql-catalyst submodules locally with ./build/mvn clean package -pl sql/core,sql/catalyst Closes apache#40755 from kings129/encoder_bug_fix. Authored-by: --global <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 74ce620) Signed-off-by: Wenchen Fan <[email protected]>

cdegroc added 2 commits January 7, 2022 13:29

[SPARK-37829][SQL][TESTS] Add a test demonstrating joinWith outer joi…

79f4d6a

…n issue

Revert "[SPARK-25746][SQL][FOLLOWUP] do not add unnecessary If expres…

e1e12ae

…sion" This reverts commit cd92f25.

github-actions bot added the SQL label Jan 7, 2022

github-actions bot added the Stale label Apr 21, 2022

github-actions bot closed this Apr 22, 2022

kings129 mentioned this pull request Apr 12, 2023

[SPARK-37829][SQL] Dataframe.joinWith outer-join should return a null value for unmatched row #40755

Closed

kings129 mentioned this pull request Apr 19, 2023

[SPARK-37829][SQL][3.3] Dataframe.joinWith outer-join should return a null value for unmatched row #40858

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-37829][SQL] DataFrame.joinWith should return null rows for missing values #35140

[SPARK-37829][SQL] DataFrame.joinWith should return null rows for missing values #35140

Uh oh!

cdegroc commented Jan 7, 2022

Uh oh!

AmplabJenkins commented Jan 8, 2022

Uh oh!

HyukjinKwon commented Jan 9, 2022

Uh oh!

gengliangwang commented Jan 10, 2022

Uh oh!

cdegroc commented Jan 10, 2022

Uh oh!

github-actions bot commented Apr 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-37829][SQL] DataFrame.joinWith should return null rows for missing values #35140

[SPARK-37829][SQL] DataFrame.joinWith should return null rows for missing values #35140

Uh oh!

Conversation

cdegroc commented Jan 7, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AmplabJenkins commented Jan 8, 2022

Uh oh!

HyukjinKwon commented Jan 9, 2022

Uh oh!

gengliangwang commented Jan 10, 2022

Uh oh!

cdegroc commented Jan 10, 2022

Uh oh!

github-actions bot commented Apr 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants