[SPARK-48921][SQL][3.5] ScalaUDF encoders in subquery should be resolved for MergeInto #47406

viirya · 2024-07-18T20:11:21Z

What changes were proposed in this pull request?

We got a customer issue that a MergeInto query on Iceberg table works earlier but cannot work after upgrading to Spark 3.4.

The error looks like

Caused by: org.apache.spark.SparkRuntimeException: Error while decoding: org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to nullable on unresolved object
upcast(getcolumnbyordinal(0, StringType), StringType, - root class: java.lang.String).toString.

The source table of MergeInto uses ScalaUDF. The error happens when Spark invokes the deserializer of input encoder of the ScalaUDF and the deserializer is not resolved yet.

The encoders of ScalaUDF are resolved by the rule ResolveEncodersInUDF which will be applied at the end of analysis phase.

During rewriting MergeInto to ReplaceData query, Spark creates an Exists subquery and ScalaUDF is part of the plan of the subquery. Note that the ScalaUDF is already resolved by the analyzer.

Then, in ResolveSubquery rule which resolves the subquery, it will resolve the subquery plan if it is not resolved yet. Because the subquery containing ScalaUDF is resolved, the rule skips it so ResolveEncodersInUDF won't be applied on it. So the analyzed ReplaceData query contains a ScalaUDF with encoders unresolved that cause the error.

This patch modifies ResolveSubquery so it will resolve subquery plan if it is not analyzed to make sure subquery plan is fully analyzed.

This patch moves ResolveEncodersInUDF rule before rewriting MergeInto to make sure the ScalaUDF in the subquery plan is fully analyzed.

Why are the changes needed?

Fixing production query error.

Does this PR introduce any user-facing change?

Yes, fixing user-facing issue.

How was this patch tested?

Manually test with MergeInto query and add an unit test.

Was this patch authored or co-authored using generative AI tooling?

No

…geInto

viirya · 2024-07-18T20:19:54Z

cc @dongjoon-hyun

huaxingao · 2024-07-18T20:42:03Z

@yaooqinn Can we include this fix in 3.5.2 release? Thanks!

dongjoon-hyun

+1, LGTM. (Pending CIs).

…ved for MergeInto ### What changes were proposed in this pull request? We got a customer issue that a `MergeInto` query on Iceberg table works earlier but cannot work after upgrading to Spark 3.4. The error looks like ``` Caused by: org.apache.spark.SparkRuntimeException: Error while decoding: org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to nullable on unresolved object upcast(getcolumnbyordinal(0, StringType), StringType, - root class: java.lang.String).toString. ``` The source table of `MergeInto` uses `ScalaUDF`. The error happens when Spark invokes the deserializer of input encoder of the `ScalaUDF` and the deserializer is not resolved yet. The encoders of ScalaUDF are resolved by the rule `ResolveEncodersInUDF` which will be applied at the end of analysis phase. During rewriting `MergeInto` to `ReplaceData` query, Spark creates an `Exists` subquery and `ScalaUDF` is part of the plan of the subquery. Note that the `ScalaUDF` is already resolved by the analyzer. Then, in `ResolveSubquery` rule which resolves the subquery, it will resolve the subquery plan if it is not resolved yet. Because the subquery containing `ScalaUDF` is resolved, the rule skips it so `ResolveEncodersInUDF` won't be applied on it. So the analyzed `ReplaceData` query contains a `ScalaUDF` with encoders unresolved that cause the error. This patch modifies `ResolveSubquery` so it will resolve subquery plan if it is not analyzed to make sure subquery plan is fully analyzed. This patch moves `ResolveEncodersInUDF` rule before rewriting `MergeInto` to make sure the `ScalaUDF` in the subquery plan is fully analyzed. ### Why are the changes needed? Fixing production query error. ### Does this PR introduce _any_ user-facing change? Yes, fixing user-facing issue. ### How was this patch tested? Manually test with `MergeInto` query and add an unit test. ### Was this patch authored or co-authored using generative AI tooling? No Closes #47406 from viirya/fix_subquery_resolve_3.5. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

dongjoon-hyun · 2024-07-18T22:21:47Z

Merged to branch-3.5.

viirya · 2024-07-18T22:34:39Z

Thank you @dongjoon-hyun

yaooqinn · 2024-07-19T02:31:51Z

@dongjoon-hyun @huaxingao @viirya Thank you for the fix, I will collect it into RC2 for 3.5.2

dongjoon-hyun · 2024-07-19T03:06:20Z

Thank you so much, @yaooqinn !

viirya · 2024-07-19T05:02:59Z

Thank you @yaooqinn

…ved for MergeInto ### What changes were proposed in this pull request? We got a customer issue that a `MergeInto` query on Iceberg table works earlier but cannot work after upgrading to Spark 3.4. The error looks like ``` Caused by: org.apache.spark.SparkRuntimeException: Error while decoding: org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to nullable on unresolved object upcast(getcolumnbyordinal(0, StringType), StringType, - root class: java.lang.String).toString. ``` The source table of `MergeInto` uses `ScalaUDF`. The error happens when Spark invokes the deserializer of input encoder of the `ScalaUDF` and the deserializer is not resolved yet. The encoders of ScalaUDF are resolved by the rule `ResolveEncodersInUDF` which will be applied at the end of analysis phase. During rewriting `MergeInto` to `ReplaceData` query, Spark creates an `Exists` subquery and `ScalaUDF` is part of the plan of the subquery. Note that the `ScalaUDF` is already resolved by the analyzer. Then, in `ResolveSubquery` rule which resolves the subquery, it will resolve the subquery plan if it is not resolved yet. Because the subquery containing `ScalaUDF` is resolved, the rule skips it so `ResolveEncodersInUDF` won't be applied on it. So the analyzed `ReplaceData` query contains a `ScalaUDF` with encoders unresolved that cause the error. This patch modifies `ResolveSubquery` so it will resolve subquery plan if it is not analyzed to make sure subquery plan is fully analyzed. This patch moves `ResolveEncodersInUDF` rule before rewriting `MergeInto` to make sure the `ScalaUDF` in the subquery plan is fully analyzed. ### Why are the changes needed? Fixing production query error. ### Does this PR introduce _any_ user-facing change? Yes, fixing user-facing issue. ### How was this patch tested? Manually test with `MergeInto` query and add an unit test. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47406 from viirya/fix_subquery_resolve_3.5. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…ved for MergeInto (apache#536) ### What changes were proposed in this pull request? We got a customer issue that a `MergeInto` query on Iceberg table works earlier but cannot work after upgrading to Spark 3.4. The error looks like ``` Caused by: org.apache.spark.SparkRuntimeException: Error while decoding: org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to nullable on unresolved object upcast(getcolumnbyordinal(0, StringType), StringType, - root class: java.lang.String).toString. ``` The source table of `MergeInto` uses `ScalaUDF`. The error happens when Spark invokes the deserializer of input encoder of the `ScalaUDF` and the deserializer is not resolved yet. The encoders of ScalaUDF are resolved by the rule `ResolveEncodersInUDF` which will be applied at the end of analysis phase. During rewriting `MergeInto` to `ReplaceData` query, Spark creates an `Exists` subquery and `ScalaUDF` is part of the plan of the subquery. Note that the `ScalaUDF` is already resolved by the analyzer. Then, in `ResolveSubquery` rule which resolves the subquery, it will resolve the subquery plan if it is not resolved yet. Because the subquery containing `ScalaUDF` is resolved, the rule skips it so `ResolveEncodersInUDF` won't be applied on it. So the analyzed `ReplaceData` query contains a `ScalaUDF` with encoders unresolved that cause the error. This patch modifies `ResolveSubquery` so it will resolve subquery plan if it is not analyzed to make sure subquery plan is fully analyzed. This patch moves `ResolveEncodersInUDF` rule before rewriting `MergeInto` to make sure the `ScalaUDF` in the subquery plan is fully analyzed. ### Why are the changes needed? Fixing production query error. ### Does this PR introduce _any_ user-facing change? Yes, fixing user-facing issue. ### How was this patch tested? Manually test with `MergeInto` query and add an unit test. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47406 from viirya/fix_subquery_resolve_3.5. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> Co-authored-by: Liang-Chi Hsieh <[email protected]>

SPARK-48921: ScalaUDF encoders in subquery should be resolved for Mer…

1fcafa7

…geInto

github-actions bot added the SQL label Jul 18, 2024

viirya mentioned this pull request Jul 18, 2024

[SPARK-48921][SQL] ScalaUDF encoders in subquery should be resolved for MergeInto #47380

Closed

dongjoon-hyun approved these changes Jul 18, 2024

View reviewed changes

dongjoon-hyun closed this Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-48921][SQL][3.5] ScalaUDF encoders in subquery should be resolved for MergeInto #47406

[SPARK-48921][SQL][3.5] ScalaUDF encoders in subquery should be resolved for MergeInto #47406

Uh oh!

viirya commented Jul 18, 2024

Uh oh!

viirya commented Jul 18, 2024

Uh oh!

huaxingao commented Jul 18, 2024

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun commented Jul 18, 2024

Uh oh!

viirya commented Jul 18, 2024

Uh oh!

yaooqinn commented Jul 19, 2024

Uh oh!

dongjoon-hyun commented Jul 19, 2024

Uh oh!

viirya commented Jul 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-48921][SQL][3.5] ScalaUDF encoders in subquery should be resolved for MergeInto #47406

[SPARK-48921][SQL][3.5] ScalaUDF encoders in subquery should be resolved for MergeInto #47406

Uh oh!

Conversation

viirya commented Jul 18, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

viirya commented Jul 18, 2024

Uh oh!

huaxingao commented Jul 18, 2024

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jul 18, 2024

Uh oh!

viirya commented Jul 18, 2024

Uh oh!

yaooqinn commented Jul 19, 2024

Uh oh!

dongjoon-hyun commented Jul 19, 2024

Uh oh!

viirya commented Jul 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants