[SPARK-49352][SQL] Avoid redundant array transform for identical expression #47843

viirya · 2024-08-22T06:46:14Z

What changes were proposed in this pull request?

This patch avoids ArrayTransform in resolveArrayType function if the resolution expression is the same as input param.

Why are the changes needed?

Our customer encounters significant performance regression when migrating from Spark 3.2 to Spark 3.4 on a Insert Into query which is analyzed as a AppendData on an Iceberg table.
We found that the root cause is in Spark 3.4, TableOutputResolver resolves the query with additional ArrayTransform on an ArrayType field. The ArrayTransform's lambda function is actually an identical function, i.e., the transformation is redundant.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test and manual e2e test

Was this patch authored or co-authored using generative AI tooling?

No

dongjoon-hyun

So, is this caused by the following at Apache Spark 3.5.0, @viirya? Or, do we have this regression at Apache Spark 3.4.x too?

#40655

viirya · 2024-08-22T18:02:53Z

So, is this caused by the following at Apache Spark 3.5.0, @viirya? Or, do we have this regression at Apache Spark 3.4.x too?

[SPARK-42855][SQL] Use runtime null checks in TableOutputResolver #40655

No, resolveArrayType exists before SPARK-42855. Yea, the regression is happened since Spark 3.4.

parthchandra

Amazing fix @viirya. Thank you!

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala

viirya · 2024-08-23T05:49:23Z

Hmm, any idea about test failure on

[info] - SPARK-49121: from_protobuf and to_protobuf SQL functions *** FAILED *** (609 milliseconds)
[info]   157 did not equal 153 Invalid stopIndex of a query context. Actual:SQLQueryContext(Some(3),Some(2),Some(10),Some(157),Some(
[info]   SELECT
[info]     to_protobuf(complex_struct, 42, '/home/runner/work/spark-1/spark-1/connector/protobuf/target/generated-test-sources/descriptor-set-sbt.desc', map())
[info]   FROM protobuf_test_table
[info]   ),None,None) (SparkFunSuite.scala:376)

I ran it locally and didn't see the error. I also don't think this change affects that test which doesn't include any insert into command or V2 write command.

dongjoon-hyun · 2024-08-23T07:32:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala

This seems to be hit scalastyle check.

[error] /__w/spark-1/spark-1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala:420: File line length exceeds 100 characters

dongjoon-hyun

Hi, @viirya . It fails on my local environment.

[info] ProtobufFunctionsSuite:
[info] - SPARK-49121: from_protobuf and to_protobuf SQL functions *** FAILED *** (3 seconds, 918 milliseconds)
[info]   158 did not equal 153 Invalid stopIndex of a query context. Actual:SQLQueryContext(Some(3),Some(2),Some(10),Some(158),Some(
[info]   SELECT
[info]     to_protobuf(complex_struct, 42, '/Users/dongjoon/APACHE/spark-merge/connector/protobuf/target/generated-test-sources/descriptor-set-sbt.desc', map())
[info]   FROM protobuf_test_table
[info]   ),None,None) (SparkFunSuite.scala:376)

However, it's a negative case failure's queryContext. I believe we can update stop = 153 to stop = 158.

viirya · 2024-08-23T14:59:59Z

Hi, @viirya . It fails on my local environment.
However, it's a negative case failure's queryContext. I believe we can update stop = 153 to stop = 158.

Interesting, I don't hit the failure locally. I will update it. Thanks.

viirya · 2024-08-23T15:04:37Z

Updated. Thanks @dongjoon-hyun

dongjoon-hyun · 2024-08-23T16:09:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala

cc @cloud-fan

dongjoon-hyun

Could you make a backporting PR per applicable release branches?

viirya · 2024-08-23T17:02:00Z

Could you make a backporting PR per applicable release branches?

This is a performance regression fix, but not a bug fix. Do we want to backport to earlier branches?

viirya · 2024-08-23T17:03:08Z

Huh, it became to 157 now...

[info] - SPARK-49121: from_protobuf and to_protobuf SQL functions *** FAILED *** (611 milliseconds)
[info]   157 did not equal 158 Invalid stopIndex of a query context. Actual:SQLQueryContext(Some(3),Some(2),Some(10),Some(157),Some(
[info]   SELECT
[info]     to_protobuf(complex_struct, 42, '/home/runner/work/spark-1/spark-1/connector/protobuf/target/generated-test-sources/descriptor-set-sbt.desc', map())
[info]   FROM protobuf_test_table
[info]   ),None,None) (SparkFunSuite.scala:376)

The test seems to be flaky.

I am going to re-trigger it.

dongjoon-hyun · 2024-08-23T17:15:22Z

Oh, so, after this PR, the following ExpectedContext becomes in-deterministic? 153/157/158

spark/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala

Lines 2101 to 2104 in d84f1a3

    
           queryContext = Array(ExpectedContext( 
        
             fragment = s"to_protobuf(complex_struct, 42, '$testFileDescFile', map())", 
        
             start = 10, 
        
             stop = 153))

viirya · 2024-08-23T17:38:10Z

Oh, so, after this PR, the following ExpectedContext becomes in-deterministic? 153/157/158

spark/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala

Lines 2101 to 2104 in d84f1a3

queryContext = Array(ExpectedContext(

fragment = s"to_protobuf(complex_struct, 42, '$testFileDescFile', map())",

start = 10,

stop = 153))

Not sure. But as I mentioned earlier, in the test it is a simple SELECT without any Insert Into, I cannot think why there is impact.

viirya · 2024-08-23T17:39:33Z

Let me try more times to see if I can reproduce locally.

viirya · 2024-08-23T17:49:53Z

Okay. I can reproduce it locally. But when I reverted this change locally, I still reproduce the protobuf test failure.
It seems flaky. I'm looking it and I will open a PR separately after I can fix it.

It is 153 locally while I tried to run the test many times.

dongjoon-hyun · 2024-08-23T18:34:49Z

Interesting. It was 158 on my laptop last night. BTW, could you rebase this to the master branch once more?
When I tested your PR, I rebased to the master branch and run the test.

It is 153 locally while I tried to run the test many times.

viirya · 2024-08-23T18:53:18Z

Rebased. Thanks.

viirya · 2024-08-23T19:05:46Z

Hmm, isn't the stopIndex depending on the file path?

'/home/runner/work/spark-1/spark-1/connector/protobuf/target/generated-test-sources/descriptor-set-sbt.desc

Because my forked repo name is spark-1, it has additional characters so the stop index is different.

dongjoon-hyun · 2024-08-23T19:07:44Z

If then, yes we need to fix this test case itself because this test is environment-dependent.

dongjoon-hyun · 2024-08-23T19:09:32Z

Could you make a backporting PR per applicable release branches?

This is a performance regression fix, but not a bug fix. Do we want to backport to earlier branches?

When we do a release, a performance regression is considered as a bug, isn't it?

viirya · 2024-08-23T19:15:55Z

When we do a release, a performance regression is considered as a bug, isn't it?

Okay. Once this is merged, I will create backport PR(s). Thanks.

viirya · 2024-08-23T19:16:13Z

If then, yes we need to fix this test case itself because this test is environment-dependent.

Yea, I'll create a PR to fix it.

Created the PR: #47859

…ironment-independent ### What changes were proposed in this pull request? This patch modifies `ProtobufFunctionsSuite`'s test case `SPARK-49121: from_protobuf and to_protobuf SQL functions` to check the stop index depending on the file path length. ### Why are the changes needed? During debugging CI failure in #47843, we found that `ProtobufFunctionsSuite`'s test case `SPARK-49121: from_protobuf and to_protobuf SQL functions` is environment-dependent. In the test, it checks the start and stop indices of SQL text fragment but the fragment length depends on the repo name of the author of a PR. ### Does this PR introduce _any_ user-facing change? No, test only. ### How was this patch tested? Unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #47859 from viirya/fix_protobuf_test. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…ysis/TableOutputResolver.scala Co-authored-by: Hyukjin Kwon <[email protected]>

viirya · 2024-08-23T20:43:46Z

I merged the following test-only fix. Could you rebase to master once more, @viirya ?

[SPARK-49373][PROTOBUF][TESTS] Fix ProtobufFunctionsSuite to be environment-independent #47859

Rebased. Thanks @dongjoon-hyun

dongjoon-hyun · 2024-08-23T22:43:49Z

All tests passed. Merged to master.

Please ping me after making backporting PRs for branch-3.5 and branch-3.4, @viirya . Thank you in advance!

viirya · 2024-08-24T00:25:30Z

Thank you @dongjoon-hyun @HyukjinKwon @parthchandra

viirya · 2024-08-24T00:40:07Z

Backport PRs:

3.5: #47863
3.4: #47862

Thanks.

cc @dongjoon-hyun

cloud-fan · 2024-08-25T04:29:53Z

late LGTM

…ssion for map type ### What changes were proposed in this pull request? Similar to #47843, this patch avoids ArrayTransform in `resolveMapType` function if the resolution expression is the same as input param. ### Why are the changes needed? My previous pr #47381 was not merged, but I still think it is an optimization, so I reopened it. During the upgrade from Spark 3.1.1 to 3.5.0, I found a performance regression in map type inserts. There are some extra conversion expressions in project before insert, which doesn't seem to be always necessary. ``` map_from_arrays(transform(map_keys(map#516), lambdafunction(lambda key#652, lambda key#652, false)), transform(map_values(map#516), lambdafunction(lambda value#654, lambda value#654, false))) AS map#656 ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #50245 from wForget/SPARK-48922. Authored-by: wforget <[email protected]> Signed-off-by: beliefer <[email protected]>

…ssion for map type ### What changes were proposed in this pull request? Similar to #47843, this patch avoids ArrayTransform in `resolveMapType` function if the resolution expression is the same as input param. ### Why are the changes needed? My previous pr #47381 was not merged, but I still think it is an optimization, so I reopened it. During the upgrade from Spark 3.1.1 to 3.5.0, I found a performance regression in map type inserts. There are some extra conversion expressions in project before insert, which doesn't seem to be always necessary. ``` map_from_arrays(transform(map_keys(map#516), lambdafunction(lambda key#652, lambda key#652, false)), transform(map_values(map#516), lambdafunction(lambda value#654, lambda value#654, false))) AS map#656 ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #50245 from wForget/SPARK-48922. Authored-by: wforget <[email protected]> Signed-off-by: beliefer <[email protected]> (cherry picked from commit 1be108e) Signed-off-by: beliefer <[email protected]>

…ssion for map type ### What changes were proposed in this pull request? Similar to apache#47843, this patch avoids ArrayTransform in `resolveMapType` function if the resolution expression is the same as input param. ### Why are the changes needed? My previous pr apache#47381 was not merged, but I still think it is an optimization, so I reopened it. During the upgrade from Spark 3.1.1 to 3.5.0, I found a performance regression in map type inserts. There are some extra conversion expressions in project before insert, which doesn't seem to be always necessary. ``` map_from_arrays(transform(map_keys(map#516), lambdafunction(lambda key#652, lambda key#652, false)), transform(map_values(map#516), lambdafunction(lambda value#654, lambda value#654, false))) AS map#656 ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#50245 from wForget/SPARK-48922. Authored-by: wforget <[email protected]> Signed-off-by: beliefer <[email protected]> (cherry picked from commit 1be108e)

…expression for map type ### What changes were proposed in this pull request? Backports #50245 to 3.5 Similar to #47843, this patch avoids ArrayTransform in `resolveMapType` function if the resolution expression is the same as input param. ### Why are the changes needed? My previous pr #47381 was not merged, but I still think it is an optimization, so I reopened it. During the upgrade from Spark 3.1.1 to 3.5.0, I found a performance regression in map type inserts. There are some extra conversion expressions in project before insert, which doesn't seem to be always necessary. ``` map_from_arrays(transform(map_keys(map#516), lambdafunction(lambda key#652, lambda key#652, false)), transform(map_values(map#516), lambdafunction(lambda value#654, lambda value#654, false))) AS map#656 ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #50245 from wForget/SPARK-48922. Authored-by: wforget <643348094qq.com> Signed-off-by: beliefer <beliefer163.com> (cherry picked from commit 1be108e) Closes #50265 from wForget/SPARK-48922-3.5. Authored-by: wforget <[email protected]> Signed-off-by: beliefer <[email protected]>

…ssion for map type ### What changes were proposed in this pull request? Similar to apache#47843, this patch avoids ArrayTransform in `resolveMapType` function if the resolution expression is the same as input param. ### Why are the changes needed? My previous pr apache#47381 was not merged, but I still think it is an optimization, so I reopened it. During the upgrade from Spark 3.1.1 to 3.5.0, I found a performance regression in map type inserts. There are some extra conversion expressions in project before insert, which doesn't seem to be always necessary. ``` map_from_arrays(transform(map_keys(map#516), lambdafunction(lambda key#652, lambda key#652, false)), transform(map_values(map#516), lambdafunction(lambda value#654, lambda value#654, false))) AS map#656 ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#50245 from wForget/SPARK-48922. Authored-by: wforget <[email protected]> Signed-off-by: beliefer <[email protected]> (cherry picked from commit 2eb50e5) Signed-off-by: beliefer <[email protected]>

github-actions bot added the SQL label Aug 22, 2024

dongjoon-hyun reviewed Aug 22, 2024

View reviewed changes

parthchandra approved these changes Aug 23, 2024

View reviewed changes

HyukjinKwon changed the title ~~SPARK-49352: Avoid redundant array transform for identical expression~~ [SPARK-49352][SQL] Avoid redundant array transform for identical expression Aug 23, 2024

HyukjinKwon reviewed Aug 23, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Aug 23, 2024

View reviewed changes

github-actions bot added the PROTOBUF label Aug 23, 2024

dongjoon-hyun reviewed Aug 23, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala Outdated

Copy link

Member

dongjoon-hyun Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @cloud-fan

dongjoon-hyun approved these changes Aug 23, 2024

View reviewed changes

viirya force-pushed the fix_redundant_array_transform branch from 2e45331 to a895741 Compare August 23, 2024 18:53

viirya mentioned this pull request Aug 23, 2024

[SPARK-49373][PROTOBUF][TESTS] Fix ProtobufFunctionsSuite to be environment-independent #47859

Closed

viirya and others added 4 commits August 23, 2024 13:43

SPARK-49352: Avoid redundant array transform for identical expression

dfd3d1f

fix

1db0624

Update sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/anal…

32a2752

…ysis/TableOutputResolver.scala Co-authored-by: Hyukjin Kwon <[email protected]>

Fix style and update test

1760a5d

viirya force-pushed the fix_redundant_array_transform branch from a895741 to 1760a5d Compare August 23, 2024 20:43

github-actions bot removed the PROTOBUF label Aug 23, 2024

dongjoon-hyun closed this in ee97caa Aug 23, 2024

viirya deleted the fix_redundant_array_transform branch August 24, 2024 00:39

kazuyukitanimura mentioned this pull request Mar 6, 2025

[SPARK-48922][SQL] Optimize nested data type insertion performance #47381

Closed

wForget mentioned this pull request Mar 12, 2025

[SPARK-48922][SQL] Avoid redundant array transform of identical expression for map type #50245

Closed

wForget mentioned this pull request Mar 13, 2025

[SPARK-48922][SQL][3.5] Avoid redundant array transform of identical expression for map type #50265

Closed

[SPARK-49352][SQL] Avoid redundant array transform for identical expression #47843

[SPARK-49352][SQL] Avoid redundant array transform for identical expression #47843

Uh oh!

Conversation

viirya commented Aug 22, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Aug 22, 2024

Uh oh!

parthchandra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

viirya commented Aug 23, 2024

Uh oh!

dongjoon-hyun Aug 23, 2024

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

viirya commented Aug 23, 2024

Uh oh!

viirya commented Aug 23, 2024

Uh oh!

dongjoon-hyun Aug 23, 2024

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

viirya commented Aug 23, 2024

Uh oh!

viirya commented Aug 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Aug 23, 2024

Uh oh!

viirya commented Aug 23, 2024

Uh oh!

viirya commented Aug 23, 2024

Uh oh!

viirya commented Aug 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Aug 23, 2024

Uh oh!

viirya commented Aug 23, 2024

Uh oh!

viirya commented Aug 23, 2024

Uh oh!

dongjoon-hyun commented Aug 23, 2024

Uh oh!

dongjoon-hyun commented Aug 23, 2024

Uh oh!

viirya commented Aug 23, 2024

Uh oh!

viirya commented Aug 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Aug 23, 2024

Uh oh!

dongjoon-hyun commented Aug 23, 2024

Uh oh!

viirya commented Aug 24, 2024

Uh oh!

viirya commented Aug 24, 2024

Uh oh!

cloud-fan commented Aug 25, 2024

Uh oh!

Reviewers

Assignees

dongjoon-hyun left a comment •

edited

Loading

viirya commented Aug 23, 2024 •

edited

Loading

viirya commented Aug 23, 2024 •

edited

Loading

viirya commented Aug 23, 2024 •

edited

Loading