[SPARK-45036][FOLLOWUP][SQL] SPJ: Make sure result partitions are sorted according to partition values #42839

sunchao · 2023-09-06T20:09:10Z

What changes were proposed in this pull request?

This PR makes sure the result grouped partitions from DataSourceV2ScanExec#groupPartitions are sorted according to the partition values. Previously in the #42757 we were assuming Scala would preserve the input ordering but apparently that's not the case.

Why are the changes needed?

See #42757 (comment) for diagnosis. The partition ordering is a fundamental property for SPJ and thus must be guaranteed.

Does this PR introduce any user-facing change?

No

How was this patch tested?

We have tests in KeyGroupedPartitioningSuite to cover this.

Was this patch authored or co-authored using generative AI tooling?

.../src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala

dongjoon-hyun

Could you re-trigger the CI again, @sunchao ?

LuciferYang · 2023-09-07T03:26:14Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala

-        val sortedKeyToPartitions = results.sorted(partitionOrdering)
-        val groupedPartitions = sortedKeyToPartitions
+        val rowOrdering = RowOrdering.createNaturalAscendingOrdering(partitionDataTypes)
+        val sortedKeyToPartitions = results.sorted(rowOrdering.on(_._1))


Suggested change

val sortedKeyToPartitions = results.sorted(rowOrdering.on(_._1))

val sortedKeyToPartitions = results.sorted(rowOrdering.on((t: (InternalRow, _)) => t._1))

To fix Scala 2.13 build

[error] /home/runner/work/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala:147:67: missing parameter type for expanded function ((<x$7: error>) => x$7._1) [error] val sortedKeyToPartitions = results.sorted(rowOrdering.on(_._1))

oops didn't realize it doesn't compile this way.

LuciferYang · 2023-09-07T03:26:29Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala

            .groupBy(_._1)
            .toSeq
            .map { case (key, s) => KeyGroupedPartition(key.row, s.map(_._2)) }
+            .sorted(rowOrdering.on(_.value))


Suggested change

.sorted(rowOrdering.on(_.value))

.sorted(rowOrdering.on((k: KeyGroupedPartition) => k.value))

LuciferYang

LGTM

LuciferYang · 2023-09-07T10:20:34Z

Merged into master. Thanks @sunchao @viirya @dongjoon-hyun @Hisoka-X

dongjoon-hyun · 2023-09-07T18:28:46Z

Thank you, all!

…ted according to partition values ### What changes were proposed in this pull request? This PR makes sure the result grouped partitions from `DataSourceV2ScanExec#groupPartitions` are sorted according to the partition values. Previously in the apache#42757 we were assuming Scala would preserve the input ordering but apparently that's not the case. ### Why are the changes needed? See apache#42757 (comment) for diagnosis. The partition ordering is a fundamental property for SPJ and thus must be guaranteed. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? We have tests in `KeyGroupedPartitioningSuite` to cover this. ### Was this patch authored or co-authored using generative AI tooling? Closes apache#42839 from sunchao/SPARK-45036-followup. Authored-by: Chao Sun <[email protected]> Signed-off-by: yangjie01 <[email protected]>

initial commit

2aa29f3

github-actions bot added the SQL label Sep 6, 2023

sunchao mentioned this pull request Sep 6, 2023

[SPARK-45036][SQL] SPJ: Simplify the logic to handle partially clustered distribution #42757

Closed

viirya reviewed Sep 6, 2023

View reviewed changes

.../src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExecBase.scala Outdated Show resolved Hide resolved

viirya approved these changes Sep 6, 2023

View reviewed changes

Hisoka-X approved these changes Sep 7, 2023

View reviewed changes

dongjoon-hyun reviewed Sep 7, 2023

View reviewed changes

LuciferYang reviewed Sep 7, 2023

View reviewed changes

fix compilation

700ee83

LuciferYang approved these changes Sep 7, 2023

View reviewed changes

LuciferYang closed this in af1615d Sep 7, 2023

IgorBerman mentioned this pull request Jun 18, 2025

Feature/spark 44647 backport 3.5.6 #51218

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-45036][FOLLOWUP][SQL] SPJ: Make sure result partitions are sorted according to partition values #42839

[SPARK-45036][FOLLOWUP][SQL] SPJ: Make sure result partitions are sorted according to partition values #42839

Uh oh!

sunchao commented Sep 6, 2023

Uh oh!

Uh oh!

dongjoon-hyun left a comment

Uh oh!

LuciferYang Sep 7, 2023

Uh oh!

LuciferYang Sep 7, 2023

Uh oh!

sunchao Sep 7, 2023

Uh oh!

LuciferYang Sep 7, 2023

Uh oh!

LuciferYang Sep 7, 2023

Uh oh!

sunchao Sep 7, 2023

Uh oh!

LuciferYang left a comment

Uh oh!

LuciferYang commented Sep 7, 2023

Uh oh!

dongjoon-hyun commented Sep 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	val sortedKeyToPartitions = results.sorted(rowOrdering.on(_._1))
	val sortedKeyToPartitions = results.sorted(rowOrdering.on((t: (InternalRow, _)) => t._1))

	.sorted(rowOrdering.on(_.value))
	.sorted(rowOrdering.on((k: KeyGroupedPartition) => k.value))

[SPARK-45036][FOLLOWUP][SQL] SPJ: Make sure result partitions are sorted according to partition values #42839

[SPARK-45036][FOLLOWUP][SQL] SPJ: Make sure result partitions are sorted according to partition values #42839

Uh oh!

Conversation

sunchao commented Sep 6, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

LuciferYang Sep 7, 2023

Choose a reason for hiding this comment

Uh oh!

LuciferYang Sep 7, 2023

Choose a reason for hiding this comment

Uh oh!

sunchao Sep 7, 2023

Choose a reason for hiding this comment

Uh oh!

LuciferYang Sep 7, 2023

Choose a reason for hiding this comment

Uh oh!

LuciferYang Sep 7, 2023

Choose a reason for hiding this comment

Uh oh!

sunchao Sep 7, 2023

Choose a reason for hiding this comment

Uh oh!

LuciferYang left a comment

Choose a reason for hiding this comment

Uh oh!

LuciferYang commented Sep 7, 2023

Uh oh!

dongjoon-hyun commented Sep 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants