Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Aug 24, 2024

What changes were proposed in this pull request?

This patch avoids ArrayTransform in resolveArrayType function if the resolution expression is the same as input param.

Why are the changes needed?

Our customer encounters significant performance regression when migrating from Spark 3.2 to Spark 3.4 on a Insert Into query which is analyzed as a AppendData on an Iceberg table.
We found that the root cause is in Spark 3.4, TableOutputResolver resolves the query with additional ArrayTransform on an ArrayType field. The ArrayTransform's lambda function is actually an identical function, i.e., the transformation is redundant.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test and manual e2e test

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Aug 24, 2024
@viirya
Copy link
Member Author

viirya commented Aug 24, 2024

cc @dongjoon-hyun

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (Pending CIs). Thank you, @viirya .

dongjoon-hyun pushed a commit that referenced this pull request Aug 24, 2024
… expression

### What changes were proposed in this pull request?

This patch avoids `ArrayTransform` in `resolveArrayType` function if the resolution expression is the same as input param.

### Why are the changes needed?

Our customer encounters significant performance regression when migrating from Spark 3.2 to Spark 3.4 on a `Insert Into` query which is analyzed as a `AppendData` on an Iceberg table.
We found that the root cause is in Spark 3.4, `TableOutputResolver` resolves the query with additional `ArrayTransform` on an `ArrayType` field. The `ArrayTransform`'s lambda function is actually an identical function, i.e., the transformation is redundant.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test and manual e2e test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47863 from viirya/fix_redundant_array_transform_3.5.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

Merged to branch-3.5.

@viirya
Copy link
Member Author

viirya commented Aug 24, 2024

Thanks @dongjoon-hyun

@viirya viirya deleted the fix_redundant_array_transform_3.5 branch August 24, 2024 05:27
turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
… expression (apache#553)

### What changes were proposed in this pull request?

This patch avoids `ArrayTransform` in `resolveArrayType` function if the resolution expression is the same as input param.

### Why are the changes needed?

Our customer encounters significant performance regression when migrating from Spark 3.2 to Spark 3.4 on a `Insert Into` query which is analyzed as a `AppendData` on an Iceberg table.
We found that the root cause is in Spark 3.4, `TableOutputResolver` resolves the query with additional `ArrayTransform` on an `ArrayType` field. The `ArrayTransform`'s lambda function is actually an identical function, i.e., the transformation is redundant.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test and manual e2e test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#47863 from viirya/fix_redundant_array_transform_3.5.

Authored-by: Liang-Chi Hsieh <[email protected]>

Signed-off-by: Dongjoon Hyun <[email protected]>
Co-authored-by: Liang-Chi Hsieh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants