[SPARK-37829][SQL] DataFrame.joinWith should return null rows for missing values #35139

cdegroc · 2022-01-07T18:52:17Z

What changes were proposed in this pull request?

We add a unit test demonstrating a regression on DataFrame.joinWith and fix the regression by updating ExpressionEncoder. The fix is equivalent to reverting this commit.

Why are the changes needed?

Doing an outer-join using joinWith on DataFrames used to return missing values as null in Spark 2.4.8, but returns them as Rows with null values in Spark 3.0.0+.

The regression has been introduced in this commit.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added a unit test. This unit test succeeds with Spark 2.4.8 but fails with Spark 3.0.0+.

The new unit test does a left outer join on two DataFrames using the joinWith method.
The join is performed on the b field of ClassData (Ints).
The row ClassData("a", 1) on the left side of the join has no matching row on the right side of the join as there is no row with value 1 for field b.
The missing value (of Row type) is represented as a GenericRowWithSchema(Array(null, null), rightFieldSchema) instead of a null value making the test fail.

This new test is identical to this one and only differs in that it uses DataFrames instead of Datasets.

I've run unit tests for the sql-core and sql-catalyst submodules locally with ./build/mvn clean package -pl sql/core,cql/catalyst

AmplabJenkins · 2022-01-08T18:28:50Z

Can one of the admins verify this patch?

HyukjinKwon · 2022-01-09T01:32:57Z

cc @cloud-fan @viirya FYI

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala

cloud-fan · 2022-01-10T08:47:39Z

Add a unit test demonstrating the regression on DataFrame.joinWith.

Let's be more clear about "What changes were proposed in this pull request?". I thought this is a test-only PR but it actually fixed a bug. Please describe the test in How was this patch tested? section.

cloud-fan · 2022-01-10T08:49:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala

do you mean the objDeserializer can be something that doesn't propagate nulls, so we need to manually check null input and create null literal? If so please add some code comments to explain it.

It seems so. I'm new to this part of the code so you'll certainly have a better idea of what's going on here.
Could it happen because joinWith creates a tuple from what were top-level Rows?

Can you run the test locally, add some print here to see what's the objDeserializer that causes the bug?

Oh sure! It's a CreateExternalRow Expression. If I println(enc.objDeserializer), it prints

createexternalrow(getcolumnbyordinal(0, StructField(a,StringType,true), StructField(b,IntegerType,false))._0.toString, getcolumnbyordinal(0, StructField(a,StringType,true), StructField(b,IntegerType,false))._1, StructField(a,StringType,true), StructField(b,IntegerType,false)) createexternalrow(getcolumnbyordinal(0, StructField(a,StringType,true), StructField(b,IntegerType,false))._0.toString, getcolumnbyordinal(0, StructField(a,StringType,true), StructField(b,IntegerType,false))._1, StructField(a,StringType,true), StructField(b,IntegerType,false))

Unless there's a broader change to make, we could reduce the blast radius by:

limiting the change to CreateExternalRow (i.e. check enc.objDeserializer.isInstanceOf[CreateExternalRow])?

having a dedicated tuple ExpressionEncoder for Dataset.joinWith (i.e. ~ update ExpressionEncoder.tuple to add a nullSafe: Boolean = false flag, set it to true for Dataset.joinWith and manually propagate nulls if true) ?

The difference comes from the RowEncoder deserializer:

When using a Dataset[T], the ExpressionEncoder is used and calls ScalaReflection.deserializerForType to get a deserializer for class T, which automatically wraps the expression in a null-safe expression.

When using a DataFrame, the RowEncoder is used and returns a CreateExternalRow (not wrapped in a null-safe expression).

I'm not sure there's an easy way to solve this, as the RowEncoder should guarantee (afaik) that top-level Rows aren't null.

Actually I think everything is already summarized in your initial PR that patched the tuple encoder to wrap deserializers in a null-safe way: #13425

as the RowEncoder should guarantee (afaik) that top-level Rows aren't null.

This is not true for outer join. Shall we also add a null check to wrap CreateExternalRow?

Breaking the assumption that top-level rows can't be null would represent a huge amount of work afaiu. I've tried simply wrapping CreateExternalRow with a null check and a number of tests started failing as they were assuming top-level rows couldn't be null.
Instead, updating joinWith seems more practical as we'd just want to handle what looks like a corner-case?

OK let's update joinWith then.

Pushed a new commit. There are multiple ways to implement this. Please let me know what you think.

cloud-fan · 2022-01-10T08:49:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala

shouldn't we check input.nullable?

input is a GetStructField(GetColumnByOrdinal) and calling nullable on it will throw org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to nullable on unresolved object.

I've used enc.objSerializer.nullable because that's what was used before the regression.

…n issue

Wrap tuple fields deserializers in null checks when calling on DataFrames as top-level rows are not nullable and won't propagate null values.

cloud-fan · 2022-01-17T17:10:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala

+   */
+  private[sql] def nullSafe(exprEnc: ExpressionEncoder[Row]): ExpressionEncoder[Row] = {
+    val newDeserializerInput = GetColumnByOrdinal(0, exprEnc.objSerializer.dataType)
+    val newDeserializer: Expression = if (exprEnc.objSerializer.nullable) {


Sorry the code here is a bit confusing. We check exprEnc.objSerializer.nullable and then we construct IsNull(newDeserializerInput)? What's their connection?

cloud-fan · 2022-01-17T17:12:51Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+    // As we might be running on DataFrames, we need a custom encoder that will properly
+    // handle null top-level Rows.
+    def nullSafe[V](exprEnc: ExpressionEncoder[V]): ExpressionEncoder[V] = {
+      if (exprEnc.clsTag.runtimeClass != classOf[Row]) {


This looks a bit ugly.

I've tried simply wrapping CreateExternalRow with a null check and a number of tests started failing as they were assuming top-level rows couldn't be null.

Are they UT or end-to-end tests? If they are UT, we can simply update the tests because we have changed the assumption.

github-actions · 2022-04-28T00:22:53Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions bot added the SQL label Jan 7, 2022

cdegroc force-pushed the SPARK-37829 branch 2 times, most recently from 72603a8 to 51992c8 Compare January 7, 2022 18:58

cloud-fan reviewed Jan 9, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala Outdated Show resolved Hide resolved

viirya reviewed Jan 9, 2022

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala Outdated Show resolved Hide resolved

viirya reviewed Jan 9, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jan 10, 2022

View reviewed changes

gengliangwang mentioned this pull request Jan 10, 2022

[SPARK-37829][SQL] DataFrame.joinWith should return null rows for missing values #35140

Closed

cdegroc added 2 commits January 17, 2022 17:25

[SPARK-37829][SQL][TESTS] Add a test demonstrating joinWith outer joi…

ce6295b

…n issue

[SPARK-37829][SQL] Add if(isnull ...) check for DataFrame.joinWith

3831f76

Wrap tuple fields deserializers in null checks when calling on DataFrames as top-level rows are not nullable and won't propagate null values.

cdegroc force-pushed the SPARK-37829 branch from edc5285 to 3831f76 Compare January 17, 2022 16:40

cloud-fan reviewed Jan 17, 2022

View reviewed changes

github-actions bot added the Stale label Apr 28, 2022

github-actions bot closed this Apr 29, 2022

[SPARK-37829][SQL] DataFrame.joinWith should return null rows for missing values #35139

[SPARK-37829][SQL] DataFrame.joinWith should return null rows for missing values #35139

Uh oh!

Conversation

cdegroc commented Jan 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AmplabJenkins commented Jan 8, 2022

Uh oh!

HyukjinKwon commented Jan 9, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cloud-fan commented Jan 10, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 28, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cdegroc commented Jan 7, 2022 •

edited

Loading