Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Aug 22, 2024

What changes were proposed in this pull request?

This patch avoids ArrayTransform in resolveArrayType function if the resolution expression is the same as input param.

Why are the changes needed?

Our customer encounters significant performance regression when migrating from Spark 3.2 to Spark 3.4 on a Insert Into query which is analyzed as a AppendData on an Iceberg table.
We found that the root cause is in Spark 3.4, TableOutputResolver resolves the query with additional ArrayTransform on an ArrayType field. The ArrayTransform's lambda function is actually an identical function, i.e., the transformation is redundant.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test and manual e2e test

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Aug 22, 2024
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, is this caused by the following at Apache Spark 3.5.0, @viirya? Or, do we have this regression at Apache Spark 3.4.x too?

@viirya
Copy link
Member Author

viirya commented Aug 22, 2024

So, is this caused by the following at Apache Spark 3.5.0, @viirya? Or, do we have this regression at Apache Spark 3.4.x too?

No, resolveArrayType exists before SPARK-42855. Yea, the regression is happened since Spark 3.4.

Copy link
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing fix @viirya. Thank you!

@HyukjinKwon HyukjinKwon changed the title SPARK-49352: Avoid redundant array transform for identical expression [SPARK-49352][SQL] Avoid redundant array transform for identical expression Aug 23, 2024
@viirya
Copy link
Member Author

viirya commented Aug 23, 2024

Hmm, any idea about test failure on

[info] - SPARK-49121: from_protobuf and to_protobuf SQL functions *** FAILED *** (609 milliseconds)
[info]   157 did not equal 153 Invalid stopIndex of a query context. Actual:SQLQueryContext(Some(3),Some(2),Some(10),Some(157),Some(
[info]   SELECT
[info]     to_protobuf(complex_struct, 42, '/home/runner/work/spark-1/spark-1/connector/protobuf/target/generated-test-sources/descriptor-set-sbt.desc', map())
[info]   FROM protobuf_test_table
[info]   ),None,None) (SparkFunSuite.scala:376)

I ran it locally and didn't see the error. I also don't think this change affects that test which doesn't include any insert into command or V2 write command.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be hit scalastyle check.

[error] /__w/spark-1/spark-1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala:420: File line length exceeds 100 characters

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @viirya . It fails on my local environment.

[info] ProtobufFunctionsSuite:
[info] - SPARK-49121: from_protobuf and to_protobuf SQL functions *** FAILED *** (3 seconds, 918 milliseconds)
[info]   158 did not equal 153 Invalid stopIndex of a query context. Actual:SQLQueryContext(Some(3),Some(2),Some(10),Some(158),Some(
[info]   SELECT
[info]     to_protobuf(complex_struct, 42, '/Users/dongjoon/APACHE/spark-merge/connector/protobuf/target/generated-test-sources/descriptor-set-sbt.desc', map())
[info]   FROM protobuf_test_table
[info]   ),None,None) (SparkFunSuite.scala:376)

However, it's a negative case failure's queryContext. I believe we can update stop = 153 to stop = 158.

@viirya
Copy link
Member Author

viirya commented Aug 23, 2024

Hi, @viirya . It fails on my local environment.
However, it's a negative case failure's queryContext. I believe we can update stop = 153 to stop = 158.

Interesting, I don't hit the failure locally. I will update it. Thanks.

@viirya
Copy link
Member Author

viirya commented Aug 23, 2024

Updated. Thanks @dongjoon-hyun

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make a backporting PR per applicable release branches?

@viirya
Copy link
Member Author

viirya commented Aug 23, 2024

Could you make a backporting PR per applicable release branches?

This is a performance regression fix, but not a bug fix. Do we want to backport to earlier branches?

@viirya
Copy link
Member Author

viirya commented Aug 23, 2024

Huh, it became to 157 now...

[info] - SPARK-49121: from_protobuf and to_protobuf SQL functions *** FAILED *** (611 milliseconds)
[info]   157 did not equal 158 Invalid stopIndex of a query context. Actual:SQLQueryContext(Some(3),Some(2),Some(10),Some(157),Some(
[info]   SELECT
[info]     to_protobuf(complex_struct, 42, '/home/runner/work/spark-1/spark-1/connector/protobuf/target/generated-test-sources/descriptor-set-sbt.desc', map())
[info]   FROM protobuf_test_table
[info]   ),None,None) (SparkFunSuite.scala:376)

The test seems to be flaky.

I am going to re-trigger it.

@dongjoon-hyun
Copy link
Member

Oh, so, after this PR, the following ExpectedContext becomes in-deterministic? 153/157/158

queryContext = Array(ExpectedContext(
fragment = s"to_protobuf(complex_struct, 42, '$testFileDescFile', map())",
start = 10,
stop = 153))

@viirya
Copy link
Member Author

viirya commented Aug 23, 2024

Oh, so, after this PR, the following ExpectedContext becomes in-deterministic? 153/157/158

queryContext = Array(ExpectedContext(
fragment = s"to_protobuf(complex_struct, 42, '$testFileDescFile', map())",
start = 10,
stop = 153))

Not sure. But as I mentioned earlier, in the test it is a simple SELECT without any Insert Into, I cannot think why there is impact.

@viirya
Copy link
Member Author

viirya commented Aug 23, 2024

Let me try more times to see if I can reproduce locally.

@viirya
Copy link
Member Author

viirya commented Aug 23, 2024

Okay. I can reproduce it locally. But when I reverted this change locally, I still reproduce the protobuf test failure.
It seems flaky. I'm looking it and I will open a PR separately after I can fix it.

It is 153 locally while I tried to run the test many times.

@dongjoon-hyun
Copy link
Member

Interesting. It was 158 on my laptop last night. BTW, could you rebase this to the master branch once more?
When I tested your PR, I rebased to the master branch and run the test.

It is 153 locally while I tried to run the test many times.

@viirya viirya force-pushed the fix_redundant_array_transform branch from 2e45331 to a895741 Compare August 23, 2024 18:53
@viirya
Copy link
Member Author

viirya commented Aug 23, 2024

Rebased. Thanks.

@viirya
Copy link
Member Author

viirya commented Aug 23, 2024

Hmm, isn't the stopIndex depending on the file path?

'/home/runner/work/spark-1/spark-1/connector/protobuf/target/generated-test-sources/descriptor-set-sbt.desc

Because my forked repo name is spark-1, it has additional characters so the stop index is different.

@dongjoon-hyun
Copy link
Member

If then, yes we need to fix this test case itself because this test is environment-dependent.

@dongjoon-hyun
Copy link
Member

Could you make a backporting PR per applicable release branches?

This is a performance regression fix, but not a bug fix. Do we want to backport to earlier branches?

When we do a release, a performance regression is considered as a bug, isn't it?

@viirya
Copy link
Member Author

viirya commented Aug 23, 2024

When we do a release, a performance regression is considered as a bug, isn't it?

Okay. Once this is merged, I will create backport PR(s). Thanks.

@viirya
Copy link
Member Author

viirya commented Aug 23, 2024

If then, yes we need to fix this test case itself because this test is environment-dependent.

Yea, I'll create a PR to fix it.

Created the PR: #47859

dongjoon-hyun pushed a commit that referenced this pull request Aug 23, 2024
…ironment-independent

### What changes were proposed in this pull request?

This patch modifies `ProtobufFunctionsSuite`'s test case `SPARK-49121: from_protobuf and to_protobuf SQL functions` to check the stop index depending on the file path length.

### Why are the changes needed?

During debugging CI failure in #47843, we found that `ProtobufFunctionsSuite`'s test case `SPARK-49121: from_protobuf and to_protobuf SQL functions` is environment-dependent.
In the test, it checks the start and stop indices of SQL text fragment but the fragment length depends on the repo name of the author of a PR.

### Does this PR introduce _any_ user-facing change?

No, test only.

### How was this patch tested?

Unit test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47859 from viirya/fix_protobuf_test.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@viirya viirya force-pushed the fix_redundant_array_transform branch from a895741 to 1760a5d Compare August 23, 2024 20:43
@viirya
Copy link
Member Author

viirya commented Aug 23, 2024

I merged the following test-only fix. Could you rebase to master once more, @viirya ?

Rebased. Thanks @dongjoon-hyun

@dongjoon-hyun
Copy link
Member

All tests passed. Merged to master.

Please ping me after making backporting PRs for branch-3.5 and branch-3.4, @viirya . Thank you in advance!

@viirya
Copy link
Member Author

viirya commented Aug 24, 2024

Thank you @dongjoon-hyun @HyukjinKwon @parthchandra

@viirya viirya deleted the fix_redundant_array_transform branch August 24, 2024 00:39
@viirya
Copy link
Member Author

viirya commented Aug 24, 2024

Backport PRs:

3.5: #47863
3.4: #47862

Thanks.

cc @dongjoon-hyun

@cloud-fan
Copy link
Contributor

late LGTM

beliefer pushed a commit that referenced this pull request Mar 13, 2025
…ssion for map type

### What changes were proposed in this pull request?

Similar to #47843, this patch avoids ArrayTransform in `resolveMapType` function if the resolution expression is the same as input param.

### Why are the changes needed?

My previous pr #47381 was not merged, but I still think it is an optimization, so I reopened it.

During the upgrade from Spark 3.1.1 to 3.5.0, I found a performance regression in map type inserts.

There are some extra conversion expressions in project before insert, which doesn't seem to be always necessary.

```
map_from_arrays(transform(map_keys(map#516), lambdafunction(lambda key#652, lambda key#652, false)), transform(map_values(map#516), lambdafunction(lambda value#654, lambda value#654, false))) AS map#656
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

added unit test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #50245 from wForget/SPARK-48922.

Authored-by: wforget <[email protected]>
Signed-off-by: beliefer <[email protected]>
beliefer pushed a commit that referenced this pull request Mar 13, 2025
…ssion for map type

### What changes were proposed in this pull request?

Similar to #47843, this patch avoids ArrayTransform in `resolveMapType` function if the resolution expression is the same as input param.

### Why are the changes needed?

My previous pr #47381 was not merged, but I still think it is an optimization, so I reopened it.

During the upgrade from Spark 3.1.1 to 3.5.0, I found a performance regression in map type inserts.

There are some extra conversion expressions in project before insert, which doesn't seem to be always necessary.

```
map_from_arrays(transform(map_keys(map#516), lambdafunction(lambda key#652, lambda key#652, false)), transform(map_values(map#516), lambdafunction(lambda value#654, lambda value#654, false))) AS map#656
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

added unit test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #50245 from wForget/SPARK-48922.

Authored-by: wforget <[email protected]>
Signed-off-by: beliefer <[email protected]>
(cherry picked from commit 1be108e)
Signed-off-by: beliefer <[email protected]>
wForget added a commit to wForget/spark that referenced this pull request Mar 13, 2025
…ssion for map type

### What changes were proposed in this pull request?

Similar to apache#47843, this patch avoids ArrayTransform in `resolveMapType` function if the resolution expression is the same as input param.

### Why are the changes needed?

My previous pr apache#47381 was not merged, but I still think it is an optimization, so I reopened it.

During the upgrade from Spark 3.1.1 to 3.5.0, I found a performance regression in map type inserts.

There are some extra conversion expressions in project before insert, which doesn't seem to be always necessary.

```
map_from_arrays(transform(map_keys(map#516), lambdafunction(lambda key#652, lambda key#652, false)), transform(map_values(map#516), lambdafunction(lambda value#654, lambda value#654, false))) AS map#656
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

added unit test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#50245 from wForget/SPARK-48922.

Authored-by: wforget <[email protected]>
Signed-off-by: beliefer <[email protected]>

(cherry picked from commit 1be108e)
beliefer pushed a commit that referenced this pull request Mar 13, 2025
…expression for map type

### What changes were proposed in this pull request?

Backports #50245 to 3.5

Similar to #47843, this patch avoids ArrayTransform in `resolveMapType` function if the resolution expression is the same as input param.

### Why are the changes needed?

My previous pr #47381 was not merged, but I still think it is an optimization, so I reopened it.

During the upgrade from Spark 3.1.1 to 3.5.0, I found a performance regression in map type inserts.

There are some extra conversion expressions in project before insert, which doesn't seem to be always necessary.

```
map_from_arrays(transform(map_keys(map#516), lambdafunction(lambda key#652, lambda key#652, false)), transform(map_values(map#516), lambdafunction(lambda value#654, lambda value#654, false))) AS map#656
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

added unit test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #50245 from wForget/SPARK-48922.

Authored-by: wforget <643348094qq.com>
Signed-off-by: beliefer <beliefer163.com>

(cherry picked from commit 1be108e)

Closes #50265 from wForget/SPARK-48922-3.5.

Authored-by: wforget <[email protected]>
Signed-off-by: beliefer <[email protected]>
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 14, 2025
…ssion for map type

### What changes were proposed in this pull request?

Similar to apache#47843, this patch avoids ArrayTransform in `resolveMapType` function if the resolution expression is the same as input param.

### Why are the changes needed?

My previous pr apache#47381 was not merged, but I still think it is an optimization, so I reopened it.

During the upgrade from Spark 3.1.1 to 3.5.0, I found a performance regression in map type inserts.

There are some extra conversion expressions in project before insert, which doesn't seem to be always necessary.

```
map_from_arrays(transform(map_keys(map#516), lambdafunction(lambda key#652, lambda key#652, false)), transform(map_values(map#516), lambdafunction(lambda value#654, lambda value#654, false))) AS map#656
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

added unit test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#50245 from wForget/SPARK-48922.

Authored-by: wforget <[email protected]>
Signed-off-by: beliefer <[email protected]>
(cherry picked from commit 2eb50e5)
Signed-off-by: beliefer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants