[SPARK-36290][SQL] Pull out join condition #33522

wangyum · 2021-07-26T15:37:56Z

What changes were proposed in this pull request?

Similar to PullOutGroupingExpressions. This pr add a new rule(PullOutJoinCondition) to pull out join condition. Otherwise the expression in join condition may be evaluated three times(ShuffleExchangeExec, SortExec and the join itself). For example:

CREATE TABLE t1 using parquet AS select id as a, id as b from range(100000000L);
CREATE TABLE t2 using parquet AS select id as a, id as b from range(200000000L);
SELECT t1.* FROM t1 JOIN t2 ON translate(t1.a, '123', 'abc') = translate(t2.a, '123', 'abc');

Before this pr:

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [a#6L, b#7L]
   +- SortMergeJoin [translate(cast(a#6L as string), 123, abc)], [translate(cast(a#8L as string), 123, abc)], Inner
      :- Sort [translate(cast(a#6L as string), 123, abc) ASC NULLS FIRST], false, 0
      :  +- Exchange hashpartitioning(translate(cast(a#6L as string), 123, abc), 5), ENSURE_REQUIREMENTS, [id=#89]
      :     +- Filter isnotnull(a#6L)
      :        +- FileScan parquet default.t1[a#6L,b#7L]
      +- Sort [translate(cast(a#8L as string), 123, abc) ASC NULLS FIRST], false, 0
         +- Exchange hashpartitioning(translate(cast(a#8L as string), 123, abc), 5), ENSURE_REQUIREMENTS, [id=#90]
            +- Filter isnotnull(a#8L)
               +- FileScan parquet default.t2[a#8L]

After this pr:

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [a#6L, b#7L]
   +- SortMergeJoin [translate(CAST(spark_catalog.default.t1.a AS STRING), '123', 'abc')#12], [translate(CAST(spark_catalog.default.t2.a AS STRING), '123', 'abc')#13], Inner
      :- Sort [translate(CAST(spark_catalog.default.t1.a AS STRING), '123', 'abc')#12 ASC NULLS FIRST], false, 0
      :  +- Exchange hashpartitioning(translate(CAST(spark_catalog.default.t1.a AS STRING), '123', 'abc')#12, 5), ENSURE_REQUIREMENTS, [id=#53]
      :     +- Project [a#6L, b#7L, translate(cast(a#6L as string), 123, abc) AS translate(CAST(spark_catalog.default.t1.a AS STRING), '123', 'abc')#12]
      :        +- Filter isnotnull(translate(cast(a#6L as string), 123, abc))
      :           +- FileScan parquet default.t1[a#6L,b#7L]
      +- Sort [translate(CAST(spark_catalog.default.t2.a AS STRING), '123', 'abc')#13 ASC NULLS FIRST], false, 0
         +- Exchange hashpartitioning(translate(CAST(spark_catalog.default.t2.a AS STRING), '123', 'abc')#13, 5), ENSURE_REQUIREMENTS, [id=#54]
            +- Project [translate(cast(a#8L as string), 123, abc) AS translate(CAST(spark_catalog.default.t2.a AS STRING), '123', 'abc')#13]
               +- Filter isnotnull(translate(cast(a#8L as string), 123, abc))
                  +- FileScan parquet default.t2[a#8L]

Why are the changes needed?

Improve query performance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test and benchmark test:

import org.apache.spark.benchmark.Benchmark
val numRows = 1024 * 1024 * 15
spark.sql(s"CREATE TABLE t1 using parquet AS select id as a, id as b from range(${numRows}L)")
spark.sql(s"CREATE TABLE t2 using parquet AS select id as a, id as b from range(${numRows}L)")
val benchmark = new Benchmark("Benchmark pull out join condition", numRows, minNumIters = 5)

Seq(false, true).foreach { pullOutEnabled =>
  val name = s"Pull out join condition ${if (pullOutEnabled) "(Enabled)" else "(Disabled)"}"
  benchmark.addCase(name) { _ =>
    withSQLConf("spark.sql.pullOutJoinCondition" -> s"$pullOutEnabled") {
      spark.sql("SELECT t1.* FROM t1 JOIN t2 ON translate(t1.a, '123', 'abc') = translate(t2.a, '123', 'abc')").write.format("noop").mode("Overwrite").save()
    }
  }
}
benchmark.run()

Benchmark result:

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Benchmark pull out join condition:        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Pull out join condition (Disabled)                30197          34046         690          0.5        1919.9       1.0X
Pull out join condition (Enabled)                 19631          20484         535          0.8        1248.1       1.5X

# Conflicts: # sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q72.sf100/explain.txt # sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q72/explain.txt # sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q72.sf100/explain.txt # sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q72/explain.txt

SparkQA · 2021-07-26T17:11:58Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46162/

SparkQA · 2021-07-26T17:46:40Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46162/

SparkQA · 2021-07-26T18:44:53Z

Test build #141646 has finished for PR 33522 at commit bbe2193.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2021-07-28T15:26:43Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46289/

SparkQA · 2021-07-28T16:02:08Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46289/

SparkQA · 2021-07-28T18:11:52Z

Test build #141777 has finished for PR 33522 at commit c16d606.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-31T17:23:16Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46430/

SparkQA · 2021-07-31T17:57:40Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46430/

SparkQA · 2021-08-01T01:00:15Z

Test build #141921 has finished for PR 33522 at commit fc9fa2d.

This patch fails from timeout after a configured wait of 500m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-01T02:25:41Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46432/

SparkQA · 2021-08-01T03:01:01Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46432/

SparkQA · 2021-08-01T04:22:00Z

Test build #141923 has finished for PR 33522 at commit 431e873.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-01T08:32:59Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46438/

SparkQA · 2021-08-01T09:07:19Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46438/

SparkQA · 2021-08-01T13:27:09Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46440/

SparkQA · 2021-08-01T14:00:59Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46440/

SparkQA · 2021-08-01T16:08:26Z

Test build #141928 has finished for PR 33522 at commit 990a1b4.

This patch fails from timeout after a configured wait of 500m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-01T21:01:46Z

Test build #141930 has finished for PR 33522 at commit f098c03.

This patch fails from timeout after a configured wait of 500m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-04T11:24:56Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49370/

wangyum · 2021-11-04T11:49:12Z

I'm surprised that there is no plan change for TPCDS queries...

Pull out does not support BroadcastHashJoin.
Replace Operators and RewriteSubquery will introduce some complex join conditions. The plan will change if we move PullOutJoinCondition after RewriteSubquery.

SparkQA · 2021-11-04T12:08:42Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49370/

SparkQA · 2021-11-04T12:40:37Z

Test build #144900 has finished for PR 33522 at commit 62a5ac0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

SparkQA · 2021-11-04T16:23:45Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49372/

SparkQA · 2021-11-04T16:30:49Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49374/

SparkQA · 2021-11-04T17:11:35Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49374/

SparkQA · 2021-11-04T17:13:50Z

Test build #144902 has finished for PR 33522 at commit b988b9e.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2021-11-04T17:21:20Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49372/

SparkQA · 2021-11-04T18:58:37Z

Test build #144904 has finished for PR 33522 at commit 7f6edf8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait ExpressionBuilder
trait PadExpressionBuilderBase extends ExpressionBuilder
case class StringLPad(str: Expression, len: Expression, pad: Expression)
case class BinaryLPad(str: Expression, len: Expression, pad: Expression, child: Expression)
case class BinaryRPad(str: Expression, len: Expression, pad: Expression, child: Expression)
case class DropIndex(
public class ColumnIOUtil
case class ParquetColumn(
case class DropIndexExec(
case class PushedDownOperators(
case class TableSampleInfo(
class RatePerMicroBatchProvider extends SimpleTableProvider with DataSourceRegister
class RatePerMicroBatchTable(
class RatePerMicroBatchStream(
case class RatePerMicroBatchStreamOffset(offset: Long, timestamp: Long) extends Offset
case class RatePerMicroBatchStreamInputPartition(
class RatePerMicroBatchStreamPartitionReader(

SparkQA · 2021-11-05T05:30:09Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49389/

SparkQA · 2021-11-05T06:12:30Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49389/

SparkQA · 2021-11-05T09:34:13Z

Test build #144918 has finished for PR 33522 at commit 4aebe0a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-11-05T12:32:05Z

...alyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PullOutJoinConditionSuite.scala

+  private val y = testRelation1.subquery('y)
+
+  test("Pull out join keys evaluation(String expressions)") {
+    val joinType = Inner


can we just inline it? the join type doesn't matter anyway.

ditto for other places in this test suite

cloud-fan · 2021-11-05T12:32:56Z

...alyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PullOutJoinConditionSuite.scala

+    val joinType = Inner
+    Seq(Upper("y.d".attr), Substring("y.d".attr, 1, 5)).foreach { udf =>
+      val originalQuery = x.join(y, joinType, Option("x.a".attr === udf))
+        .select("x.a".attr, "y.e".attr)


we don't need to use qualified column names. There is no name conflict.

ditto for other places in this test suite

Removed it. But UDF need it otherwise:

== FAIL: Plans do not match === 'Project [a#0, e#0] 'Project [a#0, e#0] !+- 'Join Inner, (a#0 = upper(y.d)#0) +- 'Join Inner, (upper(d)#0 = a#0) :- Project [a#0, b#0, c#0] :- Project [a#0, b#0, c#0] : +- LocalRelation <empty>, [a#0, b#0, c#0] : +- LocalRelation <empty>, [a#0, b#0, c#0] ! +- Project [d#0, e#0, upper(d#0) AS upper(y.d)#0] +- Project [d#0, e#0, upper(d#0) AS upper(d)#0] +- LocalRelation <empty>, [d#0, e#0] +- LocalRelation <empty>, [d#0, e#0]

cloud-fan · 2021-11-05T12:35:42Z

...alyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PullOutJoinConditionSuite.scala

+
+  test("Pull out join keys evaluation(null expressions)") {
+    val joinType = Inner
+    val udf = Coalesce(Seq("x.b".attr, "x.c".attr))


Can't we merge this test case with the one above?

Seq(Upper("d".attr), Substring("d".attr, 1, 5), Coalesce(Seq("b".attr, "c".attr)))...

Move this test to SQLQuerySuite to test infer more IsNotNull:

spark/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

Lines 4231 to 4253 in 4aebe0a

test("SPARK-36290: Pull out join condition can infer more filter conditions") {

import org.apache.spark.sql.catalyst.dsl.expressions.DslString

withTable("t1", "t2") {

spark.sql("CREATE TABLE t1(a int, b int) using parquet")

spark.sql("CREATE TABLE t2(a string, b string, c string) using parquet")

spark.sql("SELECT t1.* FROM t1 RIGHT JOIN t2 ON coalesce(t1.a, t1.b) = t2.a")

.queryExecution.optimizedPlan.find(_.isInstanceOf[Filter]) match {

case Some(Filter(condition, _)) =>

condition === IsNotNull(Coalesce(Seq("a".attr, "b".attr)))

case _ =>

fail("It should contains Filter")

}

spark.sql("SELECT t1.* FROM t1 LEFT JOIN t2 ON t1.a = t2.a")

.queryExecution.optimizedPlan.find(_.isInstanceOf[Filter]) match {

case Some(Filter(condition, _)) =>

condition === IsNotNull(Cast("a".attr, IntegerType))

case _ =>

fail("It should contains Filter")

}

}

cloud-fan · 2021-11-05T12:36:41Z

...alyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PullOutJoinConditionSuite.scala

+  }
+
+  test("Pull out EqualNullSafe join condition") {
+    withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {


do we need this?

Removed it.

cloud-fan · 2021-11-05T12:38:24Z

sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala

      case p: BatchEvalPythonExec => p
    }
-    assert(pythonEvals.size == 2)
+    assert(pythonEvals.size == 4)


This looks like a big problem. Can you investigate why we run python UDF 2 more times?

It is because we will infer two isnotnull(cast(pythonUDF0 as int)):

== Optimized Logical Plan == Project [a#225, b#226, c#236, d#237] +- Join Inner, (CAST(udf(cast(a as string)) AS INT)#250 = CAST(udf(cast(c as string)) AS INT)#251) :- Project [_1#220 AS a#225, _2#221 AS b#226, cast(pythonUDF0#253 as int) AS CAST(udf(cast(a as string)) AS INT)#250] : +- BatchEvalPython [udf(cast(_1#220 as string))], [pythonUDF0#253] : +- Project [_1#220, _2#221] : +- Filter isnotnull(cast(pythonUDF0#252 as int)) : +- BatchEvalPython [udf(cast(_1#220 as string))], [pythonUDF0#252] : +- LocalRelation [_1#220, _2#221] +- Project [_1#231 AS c#236, _2#232 AS d#237, cast(pythonUDF0#255 as int) AS CAST(udf(cast(c as string)) AS INT)#251] +- BatchEvalPython [udf(cast(_1#231 as string))], [pythonUDF0#255] +- Project [_1#231, _2#232] +- Filter isnotnull(cast(pythonUDF0#254 as int)) +- BatchEvalPython [udf(cast(_1#231 as string))], [pythonUDF0#254] +- LocalRelation [_1#231, _2#232]

Before this pr:

== Optimized Logical Plan == Project [a#225, b#226, c#236, d#237] +- Join Inner, (cast(pythonUDF0#250 as int) = cast(pythonUDF0#251 as int)) :- BatchEvalPython [udf(cast(a#225 as string))], [pythonUDF0#250] : +- Project [_1#220 AS a#225, _2#221 AS b#226] : +- LocalRelation [_1#220, _2#221] +- BatchEvalPython [udf(cast(c#236 as string))], [pythonUDF0#251] +- Project [_1#231 AS c#236, _2#232 AS d#237] +- LocalRelation [_1#231, _2#232]

OK, so it's because filter push down can lead to extra expression evaluation?

Yes. If we disable spark.sql.constraintPropagation.enabled, the plan is:

== Optimized Logical Plan == Project [a#225, b#226, c#236, d#237] +- Join Inner, (CAST(udf(cast(a as string)) AS INT)#250 = CAST(udf(cast(c as string)) AS INT)#251) :- Project [_1#220 AS a#225, _2#221 AS b#226, cast(pythonUDF0#252 as int) AS CAST(udf(cast(a as string)) AS INT)#250] : +- BatchEvalPython [udf(cast(_1#220 as string))], [pythonUDF0#252] : +- LocalRelation [_1#220, _2#221] +- Project [_1#231 AS c#236, _2#232 AS d#237, cast(pythonUDF0#253 as int) AS CAST(udf(cast(c as string)) AS INT)#251] +- BatchEvalPython [udf(cast(_1#231 as string))], [pythonUDF0#253] +- LocalRelation [_1#231, _2#232]

@wangyum @cloud-fan Seems we can just put the rule after InferFiltersFromConstraints, and actually after the earlyScanPushDownRules. The python UDF test passed in my pr, see #36874.

I think pull out has 3 advantages:

Reduce complex join key runs from 3 to 2 for SMJ.

Infer additional filters, sometimes can avoid data skew. For example: [SPARK-31809][SQL] Infer IsNotNull from some special equality join keys #28642

Avoid other rules also handle this case. For example: https://github.com/apache/spark/blob/dee7396204e2f6e7346e220867953fc74cd4253d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushPartialAggregationThroughJoin.scala#L325-L327

It has two disadvantage:

Increase complex join key runs from 1 to 2 for BHJ.

It may increase the data size of shuffle. For example: the join key is: concat(col1, col2, col3, col4 ...).

Personally, I think this rule is valuable. We have been using this rule for half a year.

Increase complex join key runs from 1 to 2 for BHJ.

We can check if the poll out side can be broadcast so it should not be a blocker ?

It may increase the data size of shuffle. For example: the join key is: concat(col1, col2, col3, col4 ...).

This is really a trade-off, one conservative option may be: We only poll out the complex keys which the inside attribute is not the final output. So we can avoid the extra shuffle data as far as possible, for example:

SELECT a FROM t1 JOIN t2 on t1.a = t2.x + 1;

And a config should be introduced for enable or disable easily.

SparkQA · 2021-11-05T16:36:58Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49408/

SparkQA · 2021-11-05T17:13:25Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49408/

SparkQA · 2021-11-05T20:51:22Z

Test build #144936 has finished for PR 33522 at commit c6fca9a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

c21 · 2021-11-06T02:34:13Z

sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala

-      p.isInstanceOf[WholeStageCodegenExec] &&
-        p.asInstanceOf[WholeStageCodegenExec].child.isInstanceOf[BroadcastHashJoinExec]).isDefined)
+    val broadcastHashJoin = df.queryExecution.executedPlan.find {
+      case WholeStageCodegenExec(ProjectExec(_, _: BroadcastHashJoinExec)) => true


wondering why we now need an extra project after join here? Can we remove it? The join seems not have complex key here.

There is a cast.

== Optimized Logical Plan == Project [id#6L, k#2, v#3] +- Join Inner, (id#6L = CAST(k AS BIGINT)#14L), rightHint=(strategy=broadcast) :- Range (0, 10, step=1, splits=Some(2)) +- Project [k#2, v#3, cast(k#2 as bigint) AS CAST(k AS BIGINT)#14L] +- Filter isnotnull(cast(k#2 as bigint)) +- LogicalRDD [k#2, v#3], false

github-actions · 2022-02-15T00:12:12Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Push down join condition evaluation

bbe2193

github-actions bot added the SQL label Jul 26, 2021

wangyum added 2 commits July 28, 2021 21:44

Fix test

02f498d

Fix test

c16d606

PullOutJoinCondition

fc9fa2d

Fix test error

431e873

wangyum changed the title ~~[SPARK-36290][SQL] Push down join condition evaluation~~ [SPARK-36290][SQL] Pull out join condition Aug 1, 2021

Move rule to SparkOptimizer

990a1b4

Fix

f098c03

wangyum marked this pull request as draft August 2, 2021 02:33

Move to Finish Analysis to infer more filters.

6e65595

wangyum added 2 commits November 4, 2021 22:57

Remove !canPlanAsBroadcastHashJoin(j, conf)

b988b9e

Merge remote-tracking branch 'upstream/master' into SPARK-36290

7f6edf8

# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

Fix test error

4aebe0a

cloud-fan reviewed Nov 5, 2021

View reviewed changes

Address comments

c6fca9a

c21 reviewed Nov 6, 2021

View reviewed changes

github-actions bot added the Stale label Feb 15, 2022

github-actions bot closed this Feb 16, 2022

wangyum mentioned this pull request Jun 15, 2022

[SPARK-39475][SQL] Pull out complex join keys for shuffled join #36874

Closed

	test("SPARK-36290: Pull out join condition can infer more filter conditions") {
	import org.apache.spark.sql.catalyst.dsl.expressions.DslString

	withTable("t1", "t2") {
	spark.sql("CREATE TABLE t1(a int, b int) using parquet")
	spark.sql("CREATE TABLE t2(a string, b string, c string) using parquet")

	spark.sql("SELECT t1.* FROM t1 RIGHT JOIN t2 ON coalesce(t1.a, t1.b) = t2.a")
	.queryExecution.optimizedPlan.find(_.isInstanceOf[Filter]) match {
	case Some(Filter(condition, _)) =>
	condition === IsNotNull(Coalesce(Seq("a".attr, "b".attr)))
	case _ =>
	fail("It should contains Filter")
	}

	spark.sql("SELECT t1.* FROM t1 LEFT JOIN t2 ON t1.a = t2.a")
	.queryExecution.optimizedPlan.find(_.isInstanceOf[Filter]) match {
	case Some(Filter(condition, _)) =>
	condition === IsNotNull(Cast("a".attr, IntegerType))
	case _ =>
	fail("It should contains Filter")
	}
	}

[SPARK-36290][SQL] Pull out join condition #33522

[SPARK-36290][SQL] Pull out join condition #33522

Uh oh!

Conversation

wangyum commented Jul 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jul 26, 2021

Uh oh!

SparkQA commented Jul 26, 2021

Uh oh!

SparkQA commented Jul 26, 2021

Uh oh!

SparkQA commented Jul 28, 2021

Uh oh!

SparkQA commented Jul 28, 2021

Uh oh!

SparkQA commented Jul 28, 2021

Uh oh!

SparkQA commented Jul 31, 2021

Uh oh!

SparkQA commented Jul 31, 2021

Uh oh!

SparkQA commented Aug 1, 2021

Uh oh!

SparkQA commented Aug 1, 2021

Uh oh!

SparkQA commented Aug 1, 2021

Uh oh!

SparkQA commented Aug 1, 2021

Uh oh!

SparkQA commented Aug 1, 2021

Uh oh!

SparkQA commented Aug 1, 2021

Uh oh!

SparkQA commented Aug 1, 2021

Uh oh!

SparkQA commented Aug 1, 2021

Uh oh!

SparkQA commented Aug 1, 2021

Uh oh!

SparkQA commented Aug 1, 2021

Uh oh!

SparkQA commented Nov 4, 2021

Uh oh!

wangyum commented Nov 4, 2021

Uh oh!

SparkQA commented Nov 4, 2021

Uh oh!

SparkQA commented Nov 4, 2021

Uh oh!

SparkQA commented Nov 4, 2021

Uh oh!

SparkQA commented Nov 4, 2021

Uh oh!

SparkQA commented Nov 4, 2021

Uh oh!

SparkQA commented Nov 4, 2021

Uh oh!

SparkQA commented Nov 4, 2021

Uh oh!

SparkQA commented Nov 4, 2021

Uh oh!

SparkQA commented Nov 5, 2021

Uh oh!

SparkQA commented Nov 5, 2021

Uh oh!

SparkQA commented Nov 5, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

wangyum commented Jul 26, 2021 •

edited

Loading

wangyum Nov 5, 2021 •

edited

Loading