[SPARK-30751][SQL] Combine the skewed readers into one in AQE skew join optimizations #27493

cloud-fan · 2020-02-07T18:15:56Z

What changes were proposed in this pull request?

This is a followup of #26434

This PR use one special shuffle reader for skew join, so that we only have one join after optimization. In order to do that, this PR

add a very general CustomShuffledRowRDD which support all kind of partition arrangement.
move the logic of coalescing shuffle partitions to a util function, and call it during skew join optimization, to totally decouple with the ReduceNumShufflePartitions rule. It's too complicated to interfere skew join with ReduceNumShufflePartitions, as you need to consider the size of split partitions which don't respect target size already.

Why are the changes needed?

The current skew join optimization has a serious performance issue: the size of the query plan depends on the number and size of skewed partitions.

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing tests

test UI manually:

explain output

AdaptiveSparkPlan(isFinalPlan=true)
+- OverwriteByExpression org.apache.spark.sql.execution.datasources.noop.NoopTable$@403a2ed5, [AlwaysTrue()], org.apache.spark.sql.util.CaseInsensitiveStringMap@1f
   +- *(5) SortMergeJoin(skew=true) [key1#2L], [key2#6L], Inner
      :- *(3) Sort [key1#2L ASC NULLS FIRST], false, 0
      :  +- SkewJoinShuffleReader 2 skewed partitions with size(max=5 KB, min=5 KB, avg=5 KB)
      :     +- ShuffleQueryStage 0
      :        +- Exchange hashpartitioning(key1#2L, 200), true, [id=#53]
      :           +- *(1) Project [(id#0L % 2) AS key1#2L]
      :              +- *(1) Filter isnotnull((id#0L % 2))
      :                 +- *(1) Range (0, 100000, step=1, splits=6)
      +- *(4) Sort [key2#6L ASC NULLS FIRST], false, 0
         +- SkewJoinShuffleReader 2 skewed partitions with size(max=5 KB, min=5 KB, avg=5 KB)
            +- ShuffleQueryStage 1
               +- Exchange hashpartitioning(key2#6L, 200), true, [id=#64]
                  +- *(2) Project [((id#4L % 2) + 1) AS key2#6L]
                     +- *(2) Filter isnotnull(((id#4L % 2) + 1))
                        +- *(2) Range (0, 100000, step=1, splits=6)

cloud-fan · 2020-02-07T18:16:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ShuffledRowRDD.scala

just revert changes made to this file in #26434

cloud-fan · 2020-02-07T18:21:22Z

...core/src/main/scala/org/apache/spark/sql/execution/adaptive/ShufflePartitionsCoalescer.scala

moved from https://github.com/apache/spark/pull/27493/files#diff-e23b4656e59b73d313271d62329eefc2L136

cloud-fan · 2020-02-07T18:21:49Z

sql/core/src/test/scala/org/apache/spark/sql/execution/ShufflePartitionsCoalescerSuite.scala

moved from https://github.com/apache/spark/pull/27493/files#diff-ab139fa5ac5c1a7f4c8bd15970db3567L55

cloud-fan · 2020-02-07T18:22:13Z

cc @hvanhovell @maryannxue @JkSelf

SparkQA · 2020-02-07T18:30:19Z

Test build #118043 has finished for PR 27493 at commit 716964f.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-10T08:05:02Z

Test build #118107 has finished for PR 27493 at commit 1cdf84d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-02-10T08:39:28Z

retest this please

SparkQA · 2020-02-10T13:43:26Z

Test build #118144 has finished for PR 27493 at commit 1cdf84d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maryannxue

Awesome! A few minor comments.

maryannxue · 2020-02-10T15:11:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CustomShuffledRowRDD.scala

+private final case class CustomShufflePartition(
+    index: Int, spec: ShufflePartitionSpec) extends Partition
+
+// TODO: merge this with `ShuffledRowRDD`.


nit: Add "TODO, replace LocalShuffledRowRDD with this RDD"

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

maryannxue · 2020-02-10T15:38:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

-              getNumMappers(left)
-            } else {
-              leftMapIdStartIndices(i + 1)
+          if (!isLeftSkew) {


Can we simplify this if block? Basically, we create partition specs on both sides respectively and then do a cartesian product afterwards.

maryannxue · 2020-02-10T15:43:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

-case class PartialShuffleReaderExec(
-    child: QueryStageExec,
-    excludedPartitions: Set[Int]) extends UnaryExecNode {
+case class SkewJoinShuffleReaderExec(


Eventually we should be able to just have one reader, right?

I think we still need multiple shuffle readers, but use the same RDD.

For example, LocalShuffleReaderExec override outputPartitioning

You could just have one reader, that just takes the output partitioning as a parameter. I am kind of in favor of this since it eliminates 3 nearly identical nodes.

maryannxue · 2020-02-10T15:45:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala

    left: SparkPlan,
    right: SparkPlan,
-    isPartial: Boolean = false) extends BinaryExecNode with CodegenSupport {
+    isSkewJoin: Boolean = false) extends BinaryExecNode with CodegenSupport {


+1. Let's make it more explicit in the explain plan

SparkQA · 2020-02-10T19:48:02Z

Test build #118168 has finished for PR 27493 at commit f5708f2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-11T00:29:21Z

Test build #118178 has finished for PR 27493 at commit 54c2fa5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class PassThroughPartitioning(key: Attribute, base: Int, numPartitions: Int)

SparkQA · 2020-02-11T11:21:41Z

Test build #118230 has finished for PR 27493 at commit eed2aaf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-11T16:09:10Z

Test build #118234 has finished for PR 27493 at commit 9420d0e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2020-02-11T22:39:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CustomShuffledRowRDD.scala

+        tracker.getPreferredLocationsForShuffle(dependency, reducerIndex)
+
+      case CoalescedPartitionSpec(startReducerIndex, endReducerIndex) =>
+        startReducerIndex.until(endReducerIndex).flatMap { reducerIndex =>


More for a follow-up: Is there a way we can order the preferred locations by size? Note that this is already a net improvement over the ShuffledRowRDD where we would use the incorrect reducer

Sounds like a good idea. We may need tracker.getPreferredLocationsForShuffle to return size as well so it involves more changes. Let's leave it for followup.

hvanhovell · 2020-02-12T09:14:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

+      // This is used to delay the creation of non-skew partitions so that we can potentially
+      // coalesce them like `ReduceNumShufflePartitions` does.
+      val nonSkewPartitionIndices = mutable.ArrayBuffer.empty[Int]
+      val skewDesc = mutable.ArrayBuffer.empty[String]


Won't this become impractically large for for shuffles with a high number of input partitions? I would probably log the skewed partitions to the debug log, and create a summary in the skewDesc.

Is it common to have many skewed partitions? I'm fine with a summary, what do you think we should put it in the summary? number of skewed partitions and the min/avg/max size?

It is just that I would not print add a string somewhere (which is printable) that is dependent on the size of the data. I can be more than you bargained for. I think your proposal makes a lot of sense.

SparkQA · 2020-02-12T21:30:02Z

Test build #118312 has finished for PR 27493 at commit d9474f0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2020-02-12T22:39:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

+    if (nonSkewPartitionIndices.length == 1) {
+      Seq(NormalPartitionSpec(nonSkewPartitionIndices.head))
+    } else {
+      val startIndices = ShufflePartitionsCoalescer.coalescePartitions(


This is pretty neat.

hvanhovell · 2020-02-12T22:39:29Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

+        } else {
+          startIndices(i + 1)
+        }
+        CoalescedPartitionSpec(startIndex, endIndex)


OCD: Only create a coalesced spec when we need to coalesce?

hvanhovell · 2020-02-12T22:42:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala

+      val rightSkewDesc = new SkewDesc
+      for (partitionIndex <- 0 until numPartitions) {
+        val leftSize = leftStats.bytesByPartitionId(partitionIndex)
+        val isLeftSkew = isSkewed(leftSize, leftMedSize) && canSplitLeftSide(joinType)


NIT you could move canSplitLeftSide(joinType) & canSplitRightSide(joinType) outside of the loop.

hvanhovell

LGTM

SparkQA · 2020-02-13T17:44:20Z

Test build #118357 has finished for PR 27493 at commit b4a0606.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class SinglePartitionSpec(reducerIndex: Int) extends ShufflePartitionSpec

hvanhovell · 2020-02-13T19:08:37Z

Merging to master/3.0

…in optimizations  ### What changes were proposed in this pull request?  This is a followup of #26434 This PR use one special shuffle reader for skew join, so that we only have one join after optimization. In order to do that, this PR 1. add a very general `CustomShuffledRowRDD` which support all kind of partition arrangement. 2. move the logic of coalescing shuffle partitions to a util function, and call it during skew join optimization, to totally decouple with the `ReduceNumShufflePartitions` rule. It's too complicated to interfere skew join with `ReduceNumShufflePartitions`, as you need to consider the size of split partitions which don't respect target size already. ### Why are the changes needed?  The current skew join optimization has a serious performance issue: the size of the query plan depends on the number and size of skewed partitions. ### Does this PR introduce any user-facing change?  no ### How was this patch tested?  existing tests test UI manually: ![image](https://user-images.githubusercontent.com/3182036/74357390-cfb30480-4dfa-11ea-83f6-825d1b9379ca.png) explain output ``` AdaptiveSparkPlan(isFinalPlan=true) +- OverwriteByExpression org.apache.spark.sql.execution.datasources.noop.NoopTable$403a2ed5, [AlwaysTrue()], org.apache.spark.sql.util.CaseInsensitiveStringMap1f +- *(5) SortMergeJoin(skew=true) [key1#2L], [key2#6L], Inner :- *(3) Sort [key1#2L ASC NULLS FIRST], false, 0 : +- SkewJoinShuffleReader 2 skewed partitions with size(max=5 KB, min=5 KB, avg=5 KB) : +- ShuffleQueryStage 0 : +- Exchange hashpartitioning(key1#2L, 200), true, [id=#53] : +- *(1) Project [(id#0L % 2) AS key1#2L] : +- *(1) Filter isnotnull((id#0L % 2)) : +- *(1) Range (0, 100000, step=1, splits=6) +- *(4) Sort [key2#6L ASC NULLS FIRST], false, 0 +- SkewJoinShuffleReader 2 skewed partitions with size(max=5 KB, min=5 KB, avg=5 KB) +- ShuffleQueryStage 1 +- Exchange hashpartitioning(key2#6L, 200), true, [id=#64] +- *(2) Project [((id#4L % 2) + 1) AS key2#6L] +- *(2) Filter isnotnull(((id#4L % 2) + 1)) +- *(2) Range (0, 100000, step=1, splits=6) ``` Closes #27493 from cloud-fan/aqe. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: herman <[email protected]> (cherry picked from commit a4ceea6) Signed-off-by: herman <[email protected]>

### What changes were proposed in this pull request? When skewed join optimization split more skewed readers, the plan may be very large and can not be shown in ui quickly. The config `spark.sql.adaptive.skewedJoinOptimization.skewedPartitionMaxSplits` is to resolve the above ui shown issue. And after [PR#27493](#27493) combined the skewed readers into one, we not need this config. ### Why are the changes needed? remove the unnecessary config ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing test Closes #27673 from JkSelf/removeMaxSplitNum. Authored-by: jiake <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? When skewed join optimization split more skewed readers, the plan may be very large and can not be shown in ui quickly. The config `spark.sql.adaptive.skewedJoinOptimization.skewedPartitionMaxSplits` is to resolve the above ui shown issue. And after [PR#27493](#27493) combined the skewed readers into one, we not need this config. ### Why are the changes needed? remove the unnecessary config ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing test Closes #27673 from JkSelf/removeMaxSplitNum. Authored-by: jiake <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit f4696ba) Signed-off-by: Wenchen Fan <[email protected]>

…in optimizations  ### What changes were proposed in this pull request?  This is a followup of apache#26434 This PR use one special shuffle reader for skew join, so that we only have one join after optimization. In order to do that, this PR 1. add a very general `CustomShuffledRowRDD` which support all kind of partition arrangement. 2. move the logic of coalescing shuffle partitions to a util function, and call it during skew join optimization, to totally decouple with the `ReduceNumShufflePartitions` rule. It's too complicated to interfere skew join with `ReduceNumShufflePartitions`, as you need to consider the size of split partitions which don't respect target size already. ### Why are the changes needed?  The current skew join optimization has a serious performance issue: the size of the query plan depends on the number and size of skewed partitions. ### Does this PR introduce any user-facing change?  no ### How was this patch tested?  existing tests test UI manually: ![image](https://user-images.githubusercontent.com/3182036/74357390-cfb30480-4dfa-11ea-83f6-825d1b9379ca.png) explain output ``` AdaptiveSparkPlan(isFinalPlan=true) +- OverwriteByExpression org.apache.spark.sql.execution.datasources.noop.NoopTable$403a2ed5, [AlwaysTrue()], org.apache.spark.sql.util.CaseInsensitiveStringMap1f +- *(5) SortMergeJoin(skew=true) [key1#2L], [key2#6L], Inner :- *(3) Sort [key1#2L ASC NULLS FIRST], false, 0 : +- SkewJoinShuffleReader 2 skewed partitions with size(max=5 KB, min=5 KB, avg=5 KB) : +- ShuffleQueryStage 0 : +- Exchange hashpartitioning(key1#2L, 200), true, [id=apache#53] : +- *(1) Project [(id#0L % 2) AS key1#2L] : +- *(1) Filter isnotnull((id#0L % 2)) : +- *(1) Range (0, 100000, step=1, splits=6) +- *(4) Sort [key2#6L ASC NULLS FIRST], false, 0 +- SkewJoinShuffleReader 2 skewed partitions with size(max=5 KB, min=5 KB, avg=5 KB) +- ShuffleQueryStage 1 +- Exchange hashpartitioning(key2#6L, 200), true, [id=apache#64] +- *(2) Project [((id#4L % 2) + 1) AS key2#6L] +- *(2) Filter isnotnull(((id#4L % 2) + 1)) +- *(2) Range (0, 100000, step=1, splits=6) ``` Closes apache#27493 from cloud-fan/aqe. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: herman <[email protected]>

### What changes were proposed in this pull request? When skewed join optimization split more skewed readers, the plan may be very large and can not be shown in ui quickly. The config `spark.sql.adaptive.skewedJoinOptimization.skewedPartitionMaxSplits` is to resolve the above ui shown issue. And after [PR#27493](apache#27493) combined the skewed readers into one, we not need this config. ### Why are the changes needed? remove the unnecessary config ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing test Closes apache#27673 from JkSelf/removeMaxSplitNum. Authored-by: jiake <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

cloud-fan commented Feb 7, 2020

View reviewed changes

Combine the skewed readers into one in AQE skew join optimizations

1cdf84d

cloud-fan force-pushed the aqe branch from 716964f to 1cdf84d Compare February 10, 2020 03:23

improve

f5708f2

maryannxue reviewed Feb 10, 2020

View reviewed changes

dongjoon-hyun added the SQL label Feb 10, 2020

address comments

54c2fa5

restore original test

9420d0e

cloud-fan force-pushed the aqe branch from eed2aaf to 9420d0e Compare February 11, 2020 11:45

hvanhovell reviewed Feb 11, 2020

View reviewed changes

hvanhovell reviewed Feb 12, 2020

View reviewed changes

improve

d9474f0

hvanhovell reviewed Feb 12, 2020

View reviewed changes

hvanhovell approved these changes Feb 12, 2020

View reviewed changes

address comments

b4a0606

hvanhovell closed this in a4ceea6 Feb 13, 2020

JkSelf mentioned this pull request Feb 22, 2020

[SPARK-30922] [SQL] remove the max splits config in skewed join #27673

Closed

[SPARK-30751][SQL] Combine the skewed readers into one in AQE skew join optimizations #27493

[SPARK-30751][SQL] Combine the skewed readers into one in AQE skew join optimizations #27493

Uh oh!

Conversation

cloud-fan commented Feb 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Feb 7, 2020

Uh oh!

SparkQA commented Feb 7, 2020

Uh oh!

SparkQA commented Feb 10, 2020

Uh oh!

cloud-fan commented Feb 10, 2020

Uh oh!

SparkQA commented Feb 10, 2020

Uh oh!

maryannxue left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 10, 2020

Uh oh!

SparkQA commented Feb 11, 2020

Uh oh!

SparkQA commented Feb 11, 2020

Uh oh!

SparkQA commented Feb 11, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Feb 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 12, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 13, 2020

Uh oh!

hvanhovell commented Feb 13, 2020

Uh oh!

Reviewers

cloud-fan commented Feb 7, 2020 •

edited

Loading

cloud-fan Feb 12, 2020 •

edited

Loading