[SPARK-42040][SQL] SPJ: Introduce a new API for V2 input partition to report partition statistics #45314

zhuqi-lucas · 2024-02-28T11:31:30Z

What changes were proposed in this pull request?

This PR introduces a new V2 min-in HasPartitionStatistics which can be used to return the partition statistics of a InputPartition.

Why are the changes needed?

As part of the Storage Partitioned Join work (SPIP), we'll need to introduce a way for a V2 InputPartition to return its partition size, It's useful for a InputPartition to also report its size (in bytes), so that Spark can use the info to decide whether partition grouping should be applied or not.

This will be used later in follow-up PRs.

Does this PR introduce any user-facing change?

Yes, a new V2 mix-in HasPartitionStatistics will be introduced.

How was this patch tested?

Extended InMemoryTable to support this new interface, and added a new unit test to verify the API, and the mock the partition statistics to test.

Was this patch authored or co-authored using generative AI tooling?

no

…t partition size

zhuqi-lucas · 2024-02-28T11:32:58Z

cc @sunchao @szehon-ho I try to help this task, Could you take a look? Thanks!

sunchao

cc @aokolnychyi @RussellSpitzer @rdblue do you think this could be useful for Iceberg to pass partition stats to Spark? SPJ could leverage this to make better decisions on how to combine partitions (like which side to choose during partially clustered distribution), but I'm not sure whether there are more use cases.

sunchao · 2024-03-05T18:24:20Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/HasPartitionSize.java

+ * @see org.apache.spark.sql.connector.read.SupportsReportPartitioning
+ * @since 4.0.0
+ */
+public interface HasPartitionSize extends InputPartition {


I wonder if we can make this more general and support partition level stats as well, like number of rows.

Thank you @sunchao for review, and this is a good suggestion, i check the Iceberg code, it includes the sizeBytes and estimatedRowsCount, filesCount. Let me address this!

@Override default long sizeBytes() { return tasks().stream().mapToLong(ScanTask::sizeBytes).sum(); } @Override default long estimatedRowsCount() { return tasks().stream().mapToLong(ScanTask::estimatedRowsCount).sum(); } @Override default int filesCount() { return tasks().stream().mapToInt(ScanTask::filesCount).sum(); }

Addressed in latest PR.

sunchao

LGTM with one nit, cc @cloud-fan too

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/HasPartitionStatistics.java

beliefer · 2024-03-19T10:59:23Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/HasPartitionStatistics.java

+ * @see org.apache.spark.sql.connector.read.SupportsReportPartitioning
+ * @since 4.0.0
+ */
+public interface HasPartitionStatistics extends InputPartition {


How about ReportStatisticsPartition ?

Thanks @beliefer for review , because we use HasPartitionKey for the partition key, so i keep the name for HasPartitionStatistics, it is consistent for SPJ feature.

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/HasPartitionStatistics.java

sunchao · 2024-03-26T05:52:45Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/HasPartitionStatistics.java

+   * Returns the value of the partition statistics associated to this partition.
+   */
+  OptionalLong sizeInBytes();
+  OptionalLong numRows();


@zhuqi-lucas could we add some comments for numRows and fileCount too?

Thank you @sunchao for this suggestion, addressed latest PR.

sunchao · 2024-03-26T05:53:15Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/HasPartitionStatistics.java

+ */
+public interface HasPartitionStatistics extends InputPartition {
+  /**
+   * Returns the value of the partition statistics associated to this partition.


Hmm I think this comment is not correct?

Addressed in latest PR.

szehon-ho · 2024-03-26T18:49:30Z

cc @aokolnychyi @RussellSpitzer @rdblue do you think this could be useful for Iceberg to pass partition stats to Spark? SPJ could leverage this to make better decisions on how to combine partitions (like which side to choose during partially clustered distribution), but I'm not sure whether there are more use cases.

@sunchao Aside from picking the side of partially clustered distribution, would we also be able to use it to group smaller partitions? Example a table is partition by date, and older days have not much data (on both sides), group many of the older days into the same task.

Similar to AQE coalesce partitions, but it looks like that applies only after shuffle, so looks like it doesnt apply for SPJ?

szehon-ho · 2024-03-26T19:01:50Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/HasPartitionStatistics.java

+   * Returns the size in bytes of the partition statistics associated to this partition.
+   */
+  OptionalLong sizeInBytes();
+  /**


Nit: can we add some newline between method and next line?

Thank you @szehon-ho for review, addressed in latest PR.

sunchao · 2024-03-27T04:50:33Z

@sunchao Aside from picking the side of partially clustered distribution, would we also be able to use it to group smaller partitions? Example a table is partition by date, and older days have not much data (on both sides), group many of the older days into the same task.

Yea I I think that would be an interesting use case. If we know the partitions from both sides of the join AND the size for each partition, we can probably make some better decisions.

Similar to AQE coalesce partitions, but it looks like that applies only after shuffle, so looks like it doesnt apply for SPJ?

Right, this doesn't to SPJ.

sunchao · 2024-03-27T04:57:57Z

Thanks, merged to master!

… report partition statistics ### What changes were proposed in this pull request? This PR introduces a new V2 min-in HasPartitionStatistics which can be used to return the partition statistics of a InputPartition. ### Why are the changes needed? As part of the Storage Partitioned Join work ([SPIP](https://issues.apache.org/jira/browse/SPARK-37166)), we'll need to introduce a way for a V2 InputPartition to return its partition size, It's useful for a InputPartition to also report its size (in bytes), so that Spark can use the info to decide whether partition grouping should be applied or not. This will be used later in follow-up PRs. ### Does this PR introduce _any_ user-facing change? Yes, a new V2 mix-in HasPartitionStatistics will be introduced. ### How was this patch tested? Extended InMemoryTable to support this new interface, and added a new unit test to verify the API, and the mock the partition statistics to test. ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#45314 from zhuqi-lucas/SPARK-42040. Authored-by: qzhu <[email protected]> Signed-off-by: Your Name <[email protected]>

SPARK-42040: SPJ: Introduce a new API for V2 input partition to repor…

5961fe7

…t partition size

github-actions bot added the SQL label Feb 28, 2024

HyukjinKwon changed the title ~~SPARK-42040: SPJ: Introduce a new API for V2 input partition to …t partition size~~ [SPARK-42040][SQL]: SPJ: Introduce a new API for V2 input partition to …t partition size Feb 29, 2024

HyukjinKwon changed the title ~~[SPARK-42040][SQL]: SPJ: Introduce a new API for V2 input partition to …t partition size~~ [SPARK-42040][SQL] SPJ: Introduce a new API for V2 input partition to partition size Feb 29, 2024

zhuqi-lucas changed the title ~~[SPARK-42040][SQL] SPJ: Introduce a new API for V2 input partition to partition size~~ [SPARK-42040][SQL] SPJ: Introduce a new API for V2 input partition to report partition size Feb 29, 2024

sunchao reviewed Mar 5, 2024

View reviewed changes

zhuqi-lucas changed the title ~~[SPARK-42040][SQL] SPJ: Introduce a new API for V2 input partition to report partition size~~ [SPARK-42040][SQL] SPJ: Introduce a new API for V2 input partition to report partition statistics Mar 6, 2024

qzhu added 2 commits March 6, 2024 11:39

Address comments.

c7e464f

Merge remote-tracking branch 'upstream/master' into SPARK-42040

a606175

zhuqi-lucas requested a review from sunchao March 11, 2024 08:36

sunchao approved these changes Mar 18, 2024

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/HasPartitionStatistics.java Outdated Show resolved Hide resolved

cloud-fan approved these changes Mar 19, 2024

View reviewed changes

beliefer reviewed Mar 19, 2024

View reviewed changes

qzhu added 2 commits March 26, 2024 11:22

Merge remote-tracking branch 'upstream/master' into SPARK-42040

8c6c1e0

Address new comments

c3bcc09

sunchao reviewed Mar 26, 2024

View reviewed changes

Address new comments

48742a3

szehon-ho reviewed Mar 26, 2024

View reviewed changes

Address new comments

9ceaea0

sunchao closed this in eef44f0 Mar 27, 2024

IgorBerman mentioned this pull request Jun 18, 2025

Feature/spark 44647 backport 3.5.6 #51218

Closed

[SPARK-42040][SQL] SPJ: Introduce a new API for V2 input partition to report partition statistics #45314

[SPARK-42040][SQL] SPJ: Introduce a new API for V2 input partition to report partition statistics #45314

Uh oh!

Conversation

zhuqi-lucas commented Feb 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

zhuqi-lucas commented Feb 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sunchao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhuqi-lucas Mar 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunchao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhuqi-lucas Mar 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho commented Mar 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunchao commented Mar 27, 2024

Uh oh!

sunchao commented Mar 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zhuqi-lucas commented Feb 28, 2024 •

edited

Loading

zhuqi-lucas commented Feb 28, 2024 •

edited

Loading

zhuqi-lucas Mar 6, 2024 •

edited

Loading

zhuqi-lucas Mar 26, 2024 •

edited

Loading

szehon-ho commented Mar 26, 2024 •

edited

Loading