-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-45036][SQL] SPJ: Simplify the logic to handle partially clustered distribution #42757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, its cleaner and a bit easier to read than before
| * | ||
| * @param expressions partition expressions for the partitioning. | ||
| * @param numPartitions the number of partitions | ||
| * @param partitionValues the values for the cluster keys of the distribution, must be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think to add 'final cluster keys' to the javadoc , to make it even more clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
| * Information about key-grouped partitions, which contains a list of grouped partitions as well | ||
| * as the original input partitions before the grouping. | ||
| */ | ||
| private[v2] case class KeyGroupedPartitionInfo( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it would refer to info about one KeyGroupedPartition. How about KeyGroupedPartitionInfos ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it but is Infos a proper plural noun?
| (splits.head.asInstanceOf[HasPartitionKey].partitionKey(), splits) | ||
| }) | ||
|
|
||
| // This means the input partitions are not grouped by partition values. We'll need to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can we clarify 'this' When partially-clustered, input partitions are not grouped by partition values
Nit: groupByPartitionValues seems never actually defined, can we fix it? Does it refer to groupedPartitions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let me improve this comments too.
|
cc @viirya @dongjoon-hyun @cloud-fan @Hisoka-X too |
Hisoka-X
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great!
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
|
Merged to master for Apache Spark 4.0.0. |
|
Thanks all! |
|
@sunchao After this pr is merged, there are test failures in the daily tests of Scala 2.13, we can reproduce the issue locally by executing the following command: run Do you have time to fix this ? @sunchao https://github.com/apache/spark/actions/runs/6088706713/job/16519965988
|
| val sortedKeyToPartitions = results.sorted(partitionOrdering) | ||
| val groupedPartitions = sortedKeyToPartitions | ||
| .map(t => (InternalRowComparableWrapper(t._1, expressions), t._2)) | ||
| .groupBy(_._1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem likely comes from this groupBy, as there are some differences between Scala 2.12 and Scala 2.13.
For example:
- Scala 2.12.18
Welcome to Scala 2.12.18 (OpenJDK 64-Bit Server VM, Java 1.8.0_382).
Type in expressions for evaluation. Or try :help.
scala> val input = Seq((50,50),(51,51),(52,52))
input: Seq[(Int, Int)] = List((50,50), (51,51), (52,52))
scala> input.groupBy(_._1).toSeq
res0: Seq[(Int, Seq[(Int, Int)])] = Vector((50,List((50,50))), (51,List((51,51))), (52,List((52,52))))
- Scala 2.13.8
Welcome to Scala 2.13.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_382).
Type in expressions for evaluation. Or try :help.
scala> val input = Seq((50,50),(51,51),(52,52))
val input: Seq[(Int, Int)] = List((50,50), (51,51), (52,52))
scala> input.groupBy(_._1).toSeq
val res0: Seq[(Int, Seq[(Int, Int)])] = List((52,List((52,52))), (50,List((50,50))), (51,List((51,51))))
We can see that when using Scala 2.13.8, the order of the results has changed.
The possible fix maybe:
- Using another function to replace
groupByto maintain the output order, such asfoldLeft with LinkedHashMap? - Re-sorting the
groupedPartitions?
Perhaps there are other better ways to fix it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @LuciferYang for the findings! Yes it's a bug as I was assuming the order will be preserved in the groupBy. Let me open a follow-up PR to fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #42839 to fix this.
…ted according to partition values ### What changes were proposed in this pull request? This PR makes sure the result grouped partitions from `DataSourceV2ScanExec#groupPartitions` are sorted according to the partition values. Previously in the #42757 we were assuming Scala would preserve the input ordering but apparently that's not the case. ### Why are the changes needed? See #42757 (comment) for diagnosis. The partition ordering is a fundamental property for SPJ and thus must be guaranteed. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? We have tests in `KeyGroupedPartitioningSuite` to cover this. ### Was this patch authored or co-authored using generative AI tooling? Closes #42839 from sunchao/SPARK-45036-followup. Authored-by: Chao Sun <[email protected]> Signed-off-by: yangjie01 <[email protected]>
…red distribution In SPJ, currently the logic to handle partially clustered distribution is a bit complicated. For instance, when the feature is eanbled (by enabling both `conf.v2BucketingPushPartValuesEnabled` and `conf.v2BucketingPartiallyClusteredDistributionEnabled`), Spark should postpone the combining of input splits until it is about to create an input RDD in `BatchScanExec`. To implement this, `groupPartitions` in `DataSourceV2ScanExecBase` currently takes the flag as input and has two different behaviors, which could be confusing. This PR introduces a new field in `KeyGroupedPartitioning`, named `originalPartitionValues`, that is used to store the original partition values from input before splits combining has been applied. The field is used when partially clustered distribution is enabled. With this, `groupPartitions` becomes easier to understand. In addition, this also simplifies `BatchScanExec.inputRDD` by combining two branches where partially clustered distribution is not enabled. To simplify the current logic in the SPJ w.r.t partially clustered distribution. No Existing tests. Closes apache#42757 from sunchao/SPARK-45036. Authored-by: Chao Sun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…ted according to partition values ### What changes were proposed in this pull request? This PR makes sure the result grouped partitions from `DataSourceV2ScanExec#groupPartitions` are sorted according to the partition values. Previously in the apache#42757 we were assuming Scala would preserve the input ordering but apparently that's not the case. ### Why are the changes needed? See apache#42757 (comment) for diagnosis. The partition ordering is a fundamental property for SPJ and thus must be guaranteed. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? We have tests in `KeyGroupedPartitioningSuite` to cover this. ### Was this patch authored or co-authored using generative AI tooling? Closes apache#42839 from sunchao/SPARK-45036-followup. Authored-by: Chao Sun <[email protected]> Signed-off-by: yangjie01 <[email protected]>
…red distribution In SPJ, currently the logic to handle partially clustered distribution is a bit complicated. For instance, when the feature is eanbled (by enabling both `conf.v2BucketingPushPartValuesEnabled` and `conf.v2BucketingPartiallyClusteredDistributionEnabled`), Spark should postpone the combining of input splits until it is about to create an input RDD in `BatchScanExec`. To implement this, `groupPartitions` in `DataSourceV2ScanExecBase` currently takes the flag as input and has two different behaviors, which could be confusing. This PR introduces a new field in `KeyGroupedPartitioning`, named `originalPartitionValues`, that is used to store the original partition values from input before splits combining has been applied. The field is used when partially clustered distribution is enabled. With this, `groupPartitions` becomes easier to understand. In addition, this also simplifies `BatchScanExec.inputRDD` by combining two branches where partially clustered distribution is not enabled. To simplify the current logic in the SPJ w.r.t partially clustered distribution. No Existing tests. Closes apache#42757 from sunchao/SPARK-45036. Authored-by: Chao Sun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

What changes were proposed in this pull request?
In SPJ, currently the logic to handle partially clustered distribution is a bit complicated. For instance, when the feature is eanbled (by enabling both
conf.v2BucketingPushPartValuesEnabledandconf.v2BucketingPartiallyClusteredDistributionEnabled), Spark should postpone the combining of input splits until it is about to create an input RDD inBatchScanExec. To implement this,groupPartitionsinDataSourceV2ScanExecBasecurrently takes the flag as input and has two different behaviors, which could be confusing.This PR introduces a new field in
KeyGroupedPartitioning, namedoriginalPartitionValues, that is used to store the original partition values from input before splits combining has been applied. The field is used when partially clustered distribution is enabled. With this,groupPartitionsbecomes easier to understand.In addition, this also simplifies
BatchScanExec.inputRDDby combining two branches where partially clustered distribution is not enabled.Why are the changes needed?
To simplify the current logic in the SPJ w.r.t partially clustered distribution.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing tests.
Was this patch authored or co-authored using generative AI tooling?