Skip to content

Conversation

@kings129
Copy link
Contributor

@kings129 kings129 commented Apr 19, 2023

What changes were proposed in this pull request?

This is a pull request to port the fix from the master branch to version 3.3. PR

When doing an outer join with joinWith on DataFrames, unmatched rows return Row objects with null fields instead of a single null value. This is not a expected behavior, and it's a regression introduced in this commit. This pull request aims to fix the regression, note this is not a full rollback of the commit, do not add back "schema" variable.

case class ClassData(a: String, b: Int)
val left = Seq(ClassData("a", 1), ClassData("b", 2)).toDF
val right = Seq(ClassData("x", 2), ClassData("y", 3)).toDF

left.joinWith(right, left("b") === right("b"), "left_outer").collect
Wrong results (current behavior):    Array(([a,1],[null,null]), ([b,2],[x,2]))
Correct results:                     Array(([a,1],null), ([b,2],[x,2]))

Why are the changes needed?

We need to address the regression mentioned above. It results in unexpected behavior changes in the Dataframe joinWith API between versions 2.4.8 and 3.0.0+. This could potentially cause data correctness issues for users who expect the old behavior when using Spark 3.0.0+.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit test (use the same test in previous closed pull request, credit to Clément de Groc) Run sql-core and sql-catalyst submodules locally with ./build/mvn clean package -pl sql/core,sql/catalyst

Closes #40755 from kings129/encoder_bug_fix.

Authored-by: --global [email protected]

… value for unmatched row

### What changes were proposed in this pull request?

When doing an outer join with joinWith on DataFrames, unmatched rows return Row objects with null fields instead of a single null value. This is not a expected behavior, and it's a regression introduced in [this commit](apache@cd92f25).
This pull request aims to fix the regression, note this is not a full rollback of the commit, do not add back "schema" variable.

```
case class ClassData(a: String, b: Int)
val left = Seq(ClassData("a", 1), ClassData("b", 2)).toDF
val right = Seq(ClassData("x", 2), ClassData("y", 3)).toDF

left.joinWith(right, left("b") === right("b"), "left_outer").collect
```

```
Wrong results (current behavior):    Array(([a,1],[null,null]), ([b,2],[x,2]))
Correct results:                     Array(([a,1],null), ([b,2],[x,2]))
```

### Why are the changes needed?

We need to address the regression mentioned above. It results in unexpected behavior changes in the Dataframe joinWith API between versions 2.4.8 and 3.0.0+. This could potentially cause data correctness issues for users who expect the old behavior when using Spark 3.0.0+.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added unit test (use the same test in previous [closed pull request](apache#35140), credit to Clément de Groc)
Run sql-core and sql-catalyst submodules locally with ./build/mvn clean package -pl sql/core,sql/catalyst

Closes apache#40755 from kings129/encoder_bug_fix.

Authored-by: --global <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@viirya
Copy link
Member

viirya commented Apr 19, 2023

The PR title is broken.

@kings129 kings129 changed the title [SPARK-37829][SQL] Dataframe.joinWith outer-join should return a null… [SPARK-37829][SQL] Dataframe.joinWith outer-join should return a null value for unmatched row Apr 19, 2023
@zhengruifeng zhengruifeng changed the title [SPARK-37829][SQL] Dataframe.joinWith outer-join should return a null value for unmatched row [SPARK-37829][SQL][3.3] Dataframe.joinWith outer-join should return a null value for unmatched row Apr 20, 2023
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Merged to branch-3.3.

dongjoon-hyun pushed a commit that referenced this pull request May 7, 2023
… null value for unmatched row

### What changes were proposed in this pull request?

This is a pull request to port the fix from the master branch to version 3.3. [PR](#40755)

When doing an outer join with joinWith on DataFrames, unmatched rows return Row objects with null fields instead of a single null value. This is not a expected behavior, and it's a regression introduced in [this commit](cd92f25). This pull request aims to fix the regression, note this is not a full rollback of the commit, do not add back "schema" variable.

```
case class ClassData(a: String, b: Int)
val left = Seq(ClassData("a", 1), ClassData("b", 2)).toDF
val right = Seq(ClassData("x", 2), ClassData("y", 3)).toDF

left.joinWith(right, left("b") === right("b"), "left_outer").collect
```

```
Wrong results (current behavior):    Array(([a,1],[null,null]), ([b,2],[x,2]))
Correct results:                     Array(([a,1],null), ([b,2],[x,2]))
```

### Why are the changes needed?

We need to address the regression mentioned above. It results in unexpected behavior changes in the Dataframe joinWith API between versions 2.4.8 and 3.0.0+. This could potentially cause data correctness issues for users who expect the old behavior when using Spark 3.0.0+.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added unit test (use the same test in previous [closed pull request](#35140), credit to Clément de Groc) Run sql-core and sql-catalyst submodules locally with ./build/mvn clean package -pl sql/core,sql/catalyst

Closes #40755 from kings129/encoder_bug_fix.

Authored-by: --global <xuqiang129gmail.com>

Closes #40858 from kings129/fix_encoder_branch_33.

Authored-by: --global <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants