Skip to content

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jan 31, 2025

What changes were proposed in this pull request?

This PR aims to increase S3A Vector IO threshold for range merge.

Why are the changes needed?

Apache Spark 4.0.0 supported Hadoop Vectored IO via ORC and Parquet.

As a part of HADOOP-18855 VectorIO API tuning/stabilization, Apache Hadoop 3.4.2 will have new threshold default values. We had better follow these update in advance until Apache Hadoop 3.4.2 is released.

Does this PR introduce any user-facing change?

No, Hadoop Vectored IO features are new in Apache Spark 4.0.0 .

How was this patch tested?

Pass the CIs.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the CORE label Jan 31, 2025
Copy link
Contributor

@cnauroth cnauroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 (non-binding)

Thanks for the update here, @dongjoon-hyun !

CC: @steveloughran

@dongjoon-hyun
Copy link
Member Author

Thank you, @cnauroth .

@dongjoon-hyun
Copy link
Member Author

Could you review this PR when you have some time, @huaxingao ?

Copy link
Contributor

@huaxingao huaxingao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Pending CI

@dongjoon-hyun
Copy link
Member Author

Thank you, @huaxingao . All tests passed.

Screenshot 2025-01-31 at 10 20 19

dongjoon-hyun added a commit that referenced this pull request Jan 31, 2025
### What changes were proposed in this pull request?

This PR aims to increase S3A Vector IO threshold for range merge.

### Why are the changes needed?

Apache Spark 4.0.0 supported Hadoop Vectored IO via ORC and Parquet.

As a part of [HADOOP-18855 VectorIO API tuning/stabilization](https://issues.apache.org/jira/browse/HADOOP-18855), Apache Hadoop 3.4.2 will have new threshold default values. We had better follow these update in advance until Apache Hadoop 3.4.2 is released.

- apache/hadoop#7281

### Does this PR introduce _any_ user-facing change?

No, Hadoop Vectored IO features are new in Apache Spark 4.0.0 .

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #49748 from dongjoon-hyun/SPARK-51049.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit b62c3f4)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member Author

Merged to master/4.0.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-51049 branch January 31, 2025 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants