[SPARK-24588][SS] streaming join should require HashClusteredPartitioning from children #21587

cloud-fan · 2018-06-19T00:48:08Z

What changes were proposed in this pull request?

In #19080 we simplified the distribution/partitioning framework, and make all the join-like operators require HashClusteredDistribution from children. Unfortunately streaming join operator was missed.

This can cause wrong result. Think about

val input1 = MemoryStream[Int]
val input2 = MemoryStream[Int]

val df1 = input1.toDF.select('value as 'a, 'value * 2 as 'b)
val df2 = input2.toDF.select('value as 'a, 'value * 2 as 'b).repartition('b)
val joined = df1.join(df2, Seq("a", "b")).select('a)

The physical plan is

*(3) Project [a#5]
+- StreamingSymmetricHashJoin [a#5, b#6], [a#10, b#11], Inner, condition = [ leftOnly = null, rightOnly = null, both = null, full = null ], state info [ checkpoint = <unknown>, runId = 54e31fce-f055-4686-b75d-fcd2b076f8d8, opId = 0, ver = 0, numPartitions = 5], 0, state cleanup [ left = null, right = null ]
   :- Exchange hashpartitioning(a#5, b#6, 5)
   :  +- *(1) Project [value#1 AS a#5, (value#1 * 2) AS b#6]
   :     +- StreamingRelation MemoryStream[value#1], [value#1]
   +- Exchange hashpartitioning(b#11, 5)
      +- *(2) Project [value#3 AS a#10, (value#3 * 2) AS b#11]
         +- StreamingRelation MemoryStream[value#3], [value#3]

The left table is hash partitioned by a, b, while the right table is hash partitioned by b. This means, we may have a matching record that is in different partitions, which should be in the output but not.

How was this patch tested?

N/A

cloud-fan · 2018-06-19T00:48:19Z

cc @tdas @zsxwing

SparkQA · 2018-06-19T03:56:03Z

Test build #92055 has finished for PR 21587 at commit 1f3d9df.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class HashClusteredDistribution(

SparkQA · 2018-06-19T04:07:38Z

Test build #92056 has finished for PR 21587 at commit b69a727.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class HashClusteredDistribution(

HeartSaVioR

LGTM

gatorsmile · 2018-06-19T05:55:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

Do we need to update https://github.com/cloud-fan/spark/blob/b69a7271e4c5c4c1b46f6a4837e12ac714ab33b4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala#L214-L217?

…ng from children

SparkQA · 2018-06-19T09:52:58Z

Test build #92072 has finished for PR 21587 at commit d102da3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class ClusteredDistributionBase(exprs: Seq[Expression]) extends Distribution
case class ClusteredDistribution(
case class HashClusteredDistribution(

bogdanrdc · 2018-06-19T15:03:27Z

maybe also fix SinglePartition.satisfies. It is only checking for ClusteredDistribution and defaults to true otherwise. Luckily, SinglePartition.numPartitions is 1 so EnsureRequirements will still add a Shuffle to make the numPartitions match

SparkQA · 2018-06-19T21:26:59Z

Test build #92094 has finished for PR 21587 at commit 6fc7913.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2018-06-20T00:07:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

  val numPartitions = 1

-  override def satisfies(required: Distribution): Boolean = required match {
+  override def satisfies0(required: Distribution): Boolean = required match {


Can we add docs to explain what is satisfies0 and how it different from satisfies?
Otherwise its quite confusing.
When does one override satisfies, and when does one override satisfies0

added in the base class

tdas · 2018-06-20T00:14:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

 * This is a strictly stronger guarantee than [[ClusteredDistribution]]. Given a tuple and the
 * number of partitions, this distribution strictly requires which partition the tuple should be in.
 */
-case class HashClusteredDistribution(expressions: Seq[Expression]) extends Distribution {


I do not see any new tests in the DistributionSuite. I feel like issues likes this should have specified unit tests in DistributionSuite and shouldnt have to rely on StreamingJoinSuite.

cloud-fan · 2018-06-20T05:31:37Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/DistributionSuite.scala

I've reorganized this test suite and added a bunch of new test cases, to improve the test coverage.

SparkQA · 2018-06-20T07:05:01Z

Test build #92117 has finished for PR 21587 at commit 0795e40.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-20T07:05:02Z

Test build #92119 has finished for PR 21587 at commit 08da2e6.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-06-20T07:32:13Z

retest this please

SparkQA · 2018-06-20T11:24:05Z

Test build #92123 has finished for PR 21587 at commit 08da2e6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-21T22:02:36Z

Test build #92177 has finished for PR 21587 at commit 72466b0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ClusteredDistribution(

gatorsmile · 2018-06-21T22:20:24Z

LGTM

Thanks! Merged to master/2.3

…ning from children ## What changes were proposed in this pull request? In #19080 we simplified the distribution/partitioning framework, and make all the join-like operators require `HashClusteredDistribution` from children. Unfortunately streaming join operator was missed. This can cause wrong result. Think about ``` val input1 = MemoryStream[Int] val input2 = MemoryStream[Int] val df1 = input1.toDF.select('value as 'a, 'value * 2 as 'b) val df2 = input2.toDF.select('value as 'a, 'value * 2 as 'b).repartition('b) val joined = df1.join(df2, Seq("a", "b")).select('a) ``` The physical plan is ``` *(3) Project [a#5] +- StreamingSymmetricHashJoin [a#5, b#6], [a#10, b#11], Inner, condition = [ leftOnly = null, rightOnly = null, both = null, full = null ], state info [ checkpoint = <unknown>, runId = 54e31fce-f055-4686-b75d-fcd2b076f8d8, opId = 0, ver = 0, numPartitions = 5], 0, state cleanup [ left = null, right = null ] :- Exchange hashpartitioning(a#5, b#6, 5) : +- *(1) Project [value#1 AS a#5, (value#1 * 2) AS b#6] : +- StreamingRelation MemoryStream[value#1], [value#1] +- Exchange hashpartitioning(b#11, 5) +- *(2) Project [value#3 AS a#10, (value#3 * 2) AS b#11] +- StreamingRelation MemoryStream[value#3], [value#3] ``` The left table is hash partitioned by `a, b`, while the right table is hash partitioned by `b`. This means, we may have a matching record that is in different partitions, which should be in the output but not. ## How was this patch tested? N/A Author: Wenchen Fan <[email protected]> Closes #21587 from cloud-fan/join. (cherry picked from commit dc8a6be) Signed-off-by: Xiao Li <[email protected]>

### What changes were proposed in this pull request? The changed [unit test](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinSuite.scala#L566) was introduce in #21587, to fix the planner side of thing for stream-stream join. Ideally check the query result should catch the bug, but it would be better to add plan check to make the purpose of unit test more clearly and catch future bug from planner change. ### Why are the changes needed? Improve unit test. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Changed test itself. Closes #32836 from c21/ss-test. Authored-by: Cheng Su <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

HeartSaVioR · 2022-01-28T06:25:19Z

Retrospect: we have to use HashClusteredPartitioning for all stateful operators and we only addressed stream-stream join here and missed others. (Even I reviewed this PR before.)

cloud-fan force-pushed the join branch from 1f3d9df to b69a727 Compare June 19, 2018 01:03

HeartSaVioR approved these changes Jun 19, 2018

View reviewed changes

gatorsmile reviewed Jun 19, 2018

View reviewed changes

StreamingSymmetricHashJoinExec should require HashClusteredPartitioni…

d102da3

…ng from children

cloud-fan force-pushed the join branch from b69a727 to d102da3 Compare June 19, 2018 06:59

address comment

6fc7913

tdas reviewed Jun 20, 2018

View reviewed changes

cloud-fan commented Jun 20, 2018

View reviewed changes

cloud-fan force-pushed the join branch from 0795e40 to e5ebb80 Compare June 20, 2018 05:33

address comment

08da2e6

cloud-fan force-pushed the join branch from e5ebb80 to 08da2e6 Compare June 20, 2018 05:35

reduce diff

72466b0

cloud-fan force-pushed the join branch from eecbe05 to 72466b0 Compare June 21, 2018 18:17

asfgit closed this in dc8a6be Jun 21, 2018

c21 mentioned this pull request Jun 9, 2021

[SPARK-35693][SS][TEST] Add plan check for stream-stream join unit test #32836

Closed

[SPARK-24588][SS] streaming join should require HashClusteredPartitioning from children #21587

[SPARK-24588][SS] streaming join should require HashClusteredPartitioning from children #21587

Uh oh!

Conversation

cloud-fan commented Jun 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Jun 19, 2018

Uh oh!

SparkQA commented Jun 19, 2018

Uh oh!

SparkQA commented Jun 19, 2018

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

gatorsmile Jun 19, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 19, 2018

Uh oh!

bogdanrdc commented Jun 19, 2018

Uh oh!

SparkQA commented Jun 19, 2018

Uh oh!

tdas Jun 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 20, 2018

Choose a reason for hiding this comment

Uh oh!

tdas Jun 20, 2018

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 20, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 20, 2018

Uh oh!

SparkQA commented Jun 20, 2018

Uh oh!

cloud-fan commented Jun 20, 2018

Uh oh!

SparkQA commented Jun 20, 2018

Uh oh!

SparkQA commented Jun 21, 2018

Uh oh!

gatorsmile commented Jun 21, 2018

Uh oh!

HeartSaVioR commented Jan 28, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

cloud-fan commented Jun 19, 2018 •

edited

Loading

tdas Jun 20, 2018 •

edited

Loading