[SPARK-33427][SQL] Add subexpression elimination for interpreted expression evaluation #30341

viirya · 2020-11-11T22:07:18Z

What changes were proposed in this pull request?

This patch proposes to add subexpression elimination for interpreted expression evaluation. Interpreted expression evaluation is used when codegen was not able to work, for example complex schema.

Why are the changes needed?

Currently we only do subexpression elimination for codegen. For some reasons, we may need to run interpreted expression evaluation. For example, codegen fails to compile and fallbacks to interpreted mode, or complex input/output schema of expressions. It is commonly seen for complex schema from expressions that is possibly caused by the query optimizer too, e.g. SPARK-32945.

We should also support subexpression elimination for interpreted evaluation. That could reduce performance difference when Spark fallbacks from codegen to interpreted expression evaluation, and improve Spark usability.

Benchmark

Update SubExprEliminationBenchmark:

Before:

OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Mac OS X 10.15.6
 Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
 from_json as subExpr:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 -------------------------------------------------------------------------------------------------------------------------
subexpressionElimination on, codegen off           24707          25688         903          0.0   247068775.9       1.0X

After:

OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Mac OS X 10.15.6
 Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
 from_json as subExpr:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 -------------------------------------------------------------------------------------------------------------------------
subexpressionElimination on, codegen off            2360           2435          87          0.0    23604320.7      11.2X

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test. Benchmark manually.

viirya

The main code change should be ready for review. I marked it as WIP because I'd like to add some tests later.

SparkQA · 2020-11-12T02:44:24Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35561/

SparkQA · 2020-11-12T03:06:19Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35561/

viirya · 2020-11-12T04:57:17Z

GitHub Actions were passed actually.

viirya · 2020-11-12T04:58:03Z

cc @cloud-fan @dongjoon-hyun @maropu @HyukjinKwon

SparkQA · 2020-11-12T05:42:28Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35571/

SparkQA · 2020-11-12T06:04:54Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35571/

dongjoon-hyun · 2020-11-12T06:06:59Z

Thank you for pinging me, @viirya .

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EvaluationRunTime.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExpressionProxy.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

SparkQA · 2020-11-12T06:22:47Z

Test build #130955 has finished for PR 30341 at commit 6bab83c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-15T08:14:49Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35709/

viirya · 2020-11-15T08:31:09Z

retest this please

SparkQA · 2020-11-15T08:45:10Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35709/

SparkQA · 2020-11-15T09:15:31Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35710/

SparkQA · 2020-11-15T09:37:17Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35710/

maropu · 2020-11-15T09:48:19Z

Please let us know if you have another comments, @maropu and @cloud-fan .

Yea, I don't have any more comment, so I'll leave this to @cloud-fan .

SparkQA · 2020-11-15T12:53:40Z

Test build #131107 has finished for PR 30341 at commit 77168fe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-11-15T19:25:54Z

sql/core/benchmarks/SubExprEliminationBenchmark-jdk11-results.txt

+subexpressionElimination off, codegen on           25932          26908         916          0.0   259320042.3       1.0X
+subexpressionElimination off, codegen off          26085          26159          65          0.0   260848905.0       1.0X
+subexpressionElimination on, codegen on             2860           2939          72          0.0    28603312.9       9.1X
+subexpressionElimination on, codegen off            2517           2617          93          0.0    25165157.7      10.3X


Thank you for this additional 10.3x.

cloud-fan · 2020-11-16T08:07:36Z

...lyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SubExprEvaluationRuntime.scala

+  private def replaceWithProxy(
+      expr: Expression,
+      proxyMap: Map[Expression, ExpressionProxy]): Expression = {
+    proxyMap.getOrElse(expr, expr.mapChildren(replaceWithProxy(_, proxyMap)))


This is a top-down traverse, which means we will recursively replace expr with proxy even for ExpressionProxy. Is it expected?

Once we replace one expression with ExpressionProxy, we stop traversing down. We only traverse down to children if cannot find current expression in proxyMap. Is this for your question?

Ah sorry I misread the code. You are right.

cloud-fan · 2020-11-16T08:10:30Z

...lyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SubExprEvaluationRuntime.scala

+
+    expressions.foreach(equivalentExpressions.addExprTree(_))
+
+    val proxyMap = mutable.Map.empty[Expression, ExpressionProxy]


is it OK to use a simple map here? Two expressions may not equal to each other even if they semantically equal.

For semantically equal exprs, we put a pair of expr -> proxy into the map and note the proxy is the same. So later we traverse down into expressions, we look at the map. We don't do semantically comparing when looking at this map.

Ah I get it now. Seems we can use IdentityHashMap to be more explicit.

cloud-fan · 2020-11-16T08:17:51Z

...src/test/scala/org/apache/spark/sql/catalyst/expressions/SubExprEvaluationRuntimeSuite.scala

+    // ( (one * two) * (one * two) )
+    assert(proxys.size == 2)
+    val expected = ExpressionProxy(mul2, runtime)
+    assert(proxys.head == expected)


should this be proxys.forall(_ == expected)?

yeah, you're right.

cloud-fan · 2020-11-16T08:19:13Z

...src/test/scala/org/apache/spark/sql/catalyst/expressions/SubExprEvaluationRuntimeSuite.scala

+    })
+    assert(proxys.isEmpty)
+  }
+}


can we test attributes?

val attr1 = AttributeReference("a", ...) val attr2 = attr1.withName("A")

To make sure 2 semantically-equal attributes can be optimized.

EquivalentExpressions skips for LeafExpression. So attributes won't be counted for subexpression.

...lyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SubExprEvaluationRuntime.scala

SparkQA · 2020-11-17T01:46:42Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35786/

SparkQA · 2020-11-17T02:11:31Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35786/

SparkQA · 2020-11-17T05:37:02Z

Test build #131184 has finished for PR 30341 at commit db115d6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-11-17T14:29:36Z

thanks, merging to master!

dongjoon-hyun · 2020-11-17T16:41:28Z

Thank you, @viirya , @maropu , @cloud-fan !

viirya · 2020-11-17T17:16:11Z

Thanks @dongjoon-hyun @maropu @cloud-fan

HyukjinKwon · 2020-11-18T02:38:35Z

Sorry for a late comment. +1, nice.

viirya · 2020-11-18T02:51:10Z

Thanks @HyukjinKwon!

…equantially ### What changes were proposed in this pull request? This follow-up fixes an issue when inserting key/value pairs into `IdentityHashMap` in `SubExprEvaluationRuntime`. ### Why are the changes needed? The last commits to #30341 follows review comment to use `IdentityHashMap`. Because we leverage `IdentityHashMap` to compare keys in reference, we should not convert expression pairs to Scala map before inserting. Scala map compares keys by equality so we will loss keys with different references. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Run benchmark to verify. Closes #30459 from viirya/SPARK-33427-map. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

Add subexpression elimination for interpreted expression evaluation.

c7fdae5

viirya marked this pull request as draft November 11, 2020 22:07

github-actions bot added the SQL label Nov 11, 2020

viirya commented Nov 11, 2020

View reviewed changes

This comment has been minimized.

Sign in to view

Catch cache exception and throw original exception.

33ac8b4

This comment has been minimized.

Sign in to view

Use invalidateAll instead of cleanUp.

6bab83c

Add tests.

ddd3a96

viirya changed the title ~~[WIP][SPARK-33427][SQL] Add subexpression elimination for interpreted expression evaluation~~ [SPARK-33427][SQL] Add subexpression elimination for interpreted expression evaluation Nov 12, 2020

viirya marked this pull request as ready for review November 12, 2020 04:57