-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-49352][SQL] Avoid redundant array transform for identical expression #47843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, |
parthchandra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing fix @viirya. Thank you!
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala
Outdated
Show resolved
Hide resolved
|
Hmm, any idea about test failure on I ran it locally and didn't see the error. I also don't think this change affects that test which doesn't include any insert into command or V2 write command. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be hit scalastyle check.
[error] /__w/spark-1/spark-1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala:420: File line length exceeds 100 characters
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @viirya . It fails on my local environment.
[info] ProtobufFunctionsSuite:
[info] - SPARK-49121: from_protobuf and to_protobuf SQL functions *** FAILED *** (3 seconds, 918 milliseconds)
[info] 158 did not equal 153 Invalid stopIndex of a query context. Actual:SQLQueryContext(Some(3),Some(2),Some(10),Some(158),Some(
[info] SELECT
[info] to_protobuf(complex_struct, 42, '/Users/dongjoon/APACHE/spark-merge/connector/protobuf/target/generated-test-sources/descriptor-set-sbt.desc', map())
[info] FROM protobuf_test_table
[info] ),None,None) (SparkFunSuite.scala:376)
However, it's a negative case failure's queryContext. I believe we can update stop = 153 to stop = 158.
Interesting, I don't hit the failure locally. I will update it. Thanks. |
|
Updated. Thanks @dongjoon-hyun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @cloud-fan
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make a backporting PR per applicable release branches?
This is a performance regression fix, but not a bug fix. Do we want to backport to earlier branches? |
|
Huh, it became to 157 now... The test seems to be flaky. I am going to re-trigger it. |
|
Oh, so, after this PR, the following spark/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala Lines 2101 to 2104 in d84f1a3
|
Not sure. But as I mentioned earlier, in the test it is a simple |
|
Let me try more times to see if I can reproduce locally. |
|
It is 153 locally while I tried to run the test many times. |
|
Interesting. It was
|
2e45331 to
a895741
Compare
|
Rebased. Thanks. |
|
Hmm, isn't the stopIndex depending on the file path? Because my forked repo name is |
|
If then, yes we need to fix this test case itself because this test is environment-dependent. |
When we do a release, a performance regression is considered as a bug, isn't it? |
Okay. Once this is merged, I will create backport PR(s). Thanks. |
Yea, I'll create a PR to fix it. Created the PR: #47859 |
…ironment-independent ### What changes were proposed in this pull request? This patch modifies `ProtobufFunctionsSuite`'s test case `SPARK-49121: from_protobuf and to_protobuf SQL functions` to check the stop index depending on the file path length. ### Why are the changes needed? During debugging CI failure in #47843, we found that `ProtobufFunctionsSuite`'s test case `SPARK-49121: from_protobuf and to_protobuf SQL functions` is environment-dependent. In the test, it checks the start and stop indices of SQL text fragment but the fragment length depends on the repo name of the author of a PR. ### Does this PR introduce _any_ user-facing change? No, test only. ### How was this patch tested? Unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #47859 from viirya/fix_protobuf_test. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…ysis/TableOutputResolver.scala Co-authored-by: Hyukjin Kwon <[email protected]>
a895741 to
1760a5d
Compare
Rebased. Thanks @dongjoon-hyun |
|
All tests passed. Merged to master. Please ping me after making backporting PRs for branch-3.5 and branch-3.4, @viirya . Thank you in advance! |
|
Thank you @dongjoon-hyun @HyukjinKwon @parthchandra |
|
Backport PRs: Thanks. |
|
late LGTM |
…ssion for map type ### What changes were proposed in this pull request? Similar to #47843, this patch avoids ArrayTransform in `resolveMapType` function if the resolution expression is the same as input param. ### Why are the changes needed? My previous pr #47381 was not merged, but I still think it is an optimization, so I reopened it. During the upgrade from Spark 3.1.1 to 3.5.0, I found a performance regression in map type inserts. There are some extra conversion expressions in project before insert, which doesn't seem to be always necessary. ``` map_from_arrays(transform(map_keys(map#516), lambdafunction(lambda key#652, lambda key#652, false)), transform(map_values(map#516), lambdafunction(lambda value#654, lambda value#654, false))) AS map#656 ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #50245 from wForget/SPARK-48922. Authored-by: wforget <[email protected]> Signed-off-by: beliefer <[email protected]>
…ssion for map type ### What changes were proposed in this pull request? Similar to #47843, this patch avoids ArrayTransform in `resolveMapType` function if the resolution expression is the same as input param. ### Why are the changes needed? My previous pr #47381 was not merged, but I still think it is an optimization, so I reopened it. During the upgrade from Spark 3.1.1 to 3.5.0, I found a performance regression in map type inserts. There are some extra conversion expressions in project before insert, which doesn't seem to be always necessary. ``` map_from_arrays(transform(map_keys(map#516), lambdafunction(lambda key#652, lambda key#652, false)), transform(map_values(map#516), lambdafunction(lambda value#654, lambda value#654, false))) AS map#656 ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #50245 from wForget/SPARK-48922. Authored-by: wforget <[email protected]> Signed-off-by: beliefer <[email protected]> (cherry picked from commit 1be108e) Signed-off-by: beliefer <[email protected]>
…ssion for map type ### What changes were proposed in this pull request? Similar to apache#47843, this patch avoids ArrayTransform in `resolveMapType` function if the resolution expression is the same as input param. ### Why are the changes needed? My previous pr apache#47381 was not merged, but I still think it is an optimization, so I reopened it. During the upgrade from Spark 3.1.1 to 3.5.0, I found a performance regression in map type inserts. There are some extra conversion expressions in project before insert, which doesn't seem to be always necessary. ``` map_from_arrays(transform(map_keys(map#516), lambdafunction(lambda key#652, lambda key#652, false)), transform(map_values(map#516), lambdafunction(lambda value#654, lambda value#654, false))) AS map#656 ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#50245 from wForget/SPARK-48922. Authored-by: wforget <[email protected]> Signed-off-by: beliefer <[email protected]> (cherry picked from commit 1be108e)
…expression for map type ### What changes were proposed in this pull request? Backports #50245 to 3.5 Similar to #47843, this patch avoids ArrayTransform in `resolveMapType` function if the resolution expression is the same as input param. ### Why are the changes needed? My previous pr #47381 was not merged, but I still think it is an optimization, so I reopened it. During the upgrade from Spark 3.1.1 to 3.5.0, I found a performance regression in map type inserts. There are some extra conversion expressions in project before insert, which doesn't seem to be always necessary. ``` map_from_arrays(transform(map_keys(map#516), lambdafunction(lambda key#652, lambda key#652, false)), transform(map_values(map#516), lambdafunction(lambda value#654, lambda value#654, false))) AS map#656 ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #50245 from wForget/SPARK-48922. Authored-by: wforget <643348094qq.com> Signed-off-by: beliefer <beliefer163.com> (cherry picked from commit 1be108e) Closes #50265 from wForget/SPARK-48922-3.5. Authored-by: wforget <[email protected]> Signed-off-by: beliefer <[email protected]>
…ssion for map type ### What changes were proposed in this pull request? Similar to apache#47843, this patch avoids ArrayTransform in `resolveMapType` function if the resolution expression is the same as input param. ### Why are the changes needed? My previous pr apache#47381 was not merged, but I still think it is an optimization, so I reopened it. During the upgrade from Spark 3.1.1 to 3.5.0, I found a performance regression in map type inserts. There are some extra conversion expressions in project before insert, which doesn't seem to be always necessary. ``` map_from_arrays(transform(map_keys(map#516), lambdafunction(lambda key#652, lambda key#652, false)), transform(map_values(map#516), lambdafunction(lambda value#654, lambda value#654, false))) AS map#656 ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#50245 from wForget/SPARK-48922. Authored-by: wforget <[email protected]> Signed-off-by: beliefer <[email protected]> (cherry picked from commit 2eb50e5) Signed-off-by: beliefer <[email protected]>
What changes were proposed in this pull request?
This patch avoids
ArrayTransforminresolveArrayTypefunction if the resolution expression is the same as input param.Why are the changes needed?
Our customer encounters significant performance regression when migrating from Spark 3.2 to Spark 3.4 on a
Insert Intoquery which is analyzed as aAppendDataon an Iceberg table.We found that the root cause is in Spark 3.4,
TableOutputResolverresolves the query with additionalArrayTransformon anArrayTypefield. TheArrayTransform's lambda function is actually an identical function, i.e., the transformation is redundant.Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit test and manual e2e test
Was this patch authored or co-authored using generative AI tooling?
No