Skip to content

Conversation

@cdegroc
Copy link

@cdegroc cdegroc commented Jan 7, 2022

What changes were proposed in this pull request?

We add a unit test demonstrating a regression on DataFrame.joinWith and fix the regression by updating ExpressionEncoder. The fix is equivalent to reverting this commit.

Why are the changes needed?

Doing an outer-join using joinWith on DataFrames used to return missing values as null in Spark 2.4.8, but returns them as Rows with null values in Spark 3.0.0+.

The regression has been introduced in this commit.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added a unit test. This unit test succeeds with Spark 2.4.8 but fails with Spark 3.0.0+.

The new unit test does a left outer join on two DataFrames using the joinWith method.
The join is performed on the b field of ClassData (Ints).
The row ClassData("a", 1) on the left side of the join has no matching row on the right side of the join as there is no row with value 1 for field b.
The missing value (of Row type) is represented as a GenericRowWithSchema(Array(null, null), rightFieldSchema) instead of a null value making the test fail.

This new test is identical to this one and only differs in that it uses DataFrames instead of Datasets.

I've run unit tests for the sql-core and sql-catalyst submodules locally with ./build/mvn clean package -pl sql/core,cql/catalyst

@github-actions github-actions bot added the SQL label Jan 7, 2022
@cdegroc cdegroc force-pushed the SPARK-37829 branch 2 times, most recently from 72603a8 to 51992c8 Compare January 7, 2022 18:58
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon
Copy link
Member

cc @cloud-fan @viirya FYI

@cloud-fan
Copy link
Contributor

Add a unit test demonstrating the regression on DataFrame.joinWith.

Let's be more clear about "What changes were proposed in this pull request?". I thought this is a test-only PR but it actually fixed a bug. Please describe the test in How was this patch tested? section.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean the objDeserializer can be something that doesn't propagate nulls, so we need to manually check null input and create null literal? If so please add some code comments to explain it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems so. I'm new to this part of the code so you'll certainly have a better idea of what's going on here.
Could it happen because joinWith creates a tuple from what were top-level Rows?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run the test locally, add some print here to see what's the objDeserializer that causes the bug?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sure! It's a CreateExternalRow Expression. If I println(enc.objDeserializer), it prints

createexternalrow(getcolumnbyordinal(0, StructField(a,StringType,true), StructField(b,IntegerType,false))._0.toString, getcolumnbyordinal(0, StructField(a,StringType,true), StructField(b,IntegerType,false))._1, StructField(a,StringType,true), StructField(b,IntegerType,false))
createexternalrow(getcolumnbyordinal(0, StructField(a,StringType,true), StructField(b,IntegerType,false))._0.toString, getcolumnbyordinal(0, StructField(a,StringType,true), StructField(b,IntegerType,false))._1, StructField(a,StringType,true), StructField(b,IntegerType,false))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless there's a broader change to make, we could reduce the blast radius by:

  • limiting the change to CreateExternalRow (i.e. check enc.objDeserializer.isInstanceOf[CreateExternalRow])?
  • having a dedicated tuple ExpressionEncoder for Dataset.joinWith (i.e. ~ update ExpressionEncoder.tuple to add a nullSafe: Boolean = false flag, set it to true for Dataset.joinWith and manually propagate nulls if true) ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference comes from the RowEncoder deserializer:

I'm not sure there's an easy way to solve this, as the RowEncoder should guarantee (afaik) that top-level Rows aren't null.

Actually I think everything is already summarized in your initial PR that patched the tuple encoder to wrap deserializers in a null-safe way: #13425

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as the RowEncoder should guarantee (afaik) that top-level Rows aren't null.

This is not true for outer join. Shall we also add a null check to wrap CreateExternalRow?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaking the assumption that top-level rows can't be null would represent a huge amount of work afaiu. I've tried simply wrapping CreateExternalRow with a null check and a number of tests started failing as they were assuming top-level rows couldn't be null.
Instead, updating joinWith seems more practical as we'd just want to handle what looks like a corner-case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK let's update joinWith then.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a new commit. There are multiple ways to implement this. Please let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we check input.nullable?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input is a GetStructField(GetColumnByOrdinal) and calling nullable on it will throw org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to nullable on unresolved object.

I've used enc.objSerializer.nullable because that's what was used before the regression.

Wrap tuple fields deserializers in null checks when calling on
DataFrames as top-level rows are not nullable and won't propagate
null values.
*/
private[sql] def nullSafe(exprEnc: ExpressionEncoder[Row]): ExpressionEncoder[Row] = {
val newDeserializerInput = GetColumnByOrdinal(0, exprEnc.objSerializer.dataType)
val newDeserializer: Expression = if (exprEnc.objSerializer.nullable) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry the code here is a bit confusing. We check exprEnc.objSerializer.nullable and then we construct IsNull(newDeserializerInput)? What's their connection?

// As we might be running on DataFrames, we need a custom encoder that will properly
// handle null top-level Rows.
def nullSafe[V](exprEnc: ExpressionEncoder[V]): ExpressionEncoder[V] = {
if (exprEnc.clsTag.runtimeClass != classOf[Row]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a bit ugly.

I've tried simply wrapping CreateExternalRow with a null check and a number of tests started failing as they were assuming top-level rows couldn't be null.

Are they UT or end-to-end tests? If they are UT, we can simply update the tests because we have changed the assumption.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Apr 28, 2022
@github-actions github-actions bot closed this Apr 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants