Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Nov 11, 2020

What changes were proposed in this pull request?

This patch proposes to add subexpression elimination for interpreted expression evaluation. Interpreted expression evaluation is used when codegen was not able to work, for example complex schema.

Why are the changes needed?

Currently we only do subexpression elimination for codegen. For some reasons, we may need to run interpreted expression evaluation. For example, codegen fails to compile and fallbacks to interpreted mode, or complex input/output schema of expressions. It is commonly seen for complex schema from expressions that is possibly caused by the query optimizer too, e.g. SPARK-32945.

We should also support subexpression elimination for interpreted evaluation. That could reduce performance difference when Spark fallbacks from codegen to interpreted expression evaluation, and improve Spark usability.

Benchmark

Update SubExprEliminationBenchmark:

Before:

OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Mac OS X 10.15.6
 Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
 from_json as subExpr:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 -------------------------------------------------------------------------------------------------------------------------
subexpressionElimination on, codegen off           24707          25688         903          0.0   247068775.9       1.0X

After:

OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Mac OS X 10.15.6
 Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
 from_json as subExpr:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 -------------------------------------------------------------------------------------------------------------------------
subexpressionElimination on, codegen off            2360           2435          87          0.0    23604320.7      11.2X

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test. Benchmark manually.

@viirya viirya marked this pull request as draft November 11, 2020 22:07
@github-actions github-actions bot added the SQL label Nov 11, 2020
Copy link
Member Author

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main code change should be ready for review. I marked it as WIP because I'd like to add some tests later.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35561/

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35561/

@viirya
Copy link
Member Author

viirya commented Nov 12, 2020

GitHub Actions were passed actually.

@viirya viirya changed the title [WIP][SPARK-33427][SQL] Add subexpression elimination for interpreted expression evaluation [SPARK-33427][SQL] Add subexpression elimination for interpreted expression evaluation Nov 12, 2020
@viirya viirya marked this pull request as ready for review November 12, 2020 04:57
@viirya
Copy link
Member Author

viirya commented Nov 12, 2020

cc @cloud-fan @dongjoon-hyun @maropu @HyukjinKwon

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35571/

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35571/

@dongjoon-hyun
Copy link
Member

Thank you for pinging me, @viirya .

@SparkQA
Copy link

SparkQA commented Nov 12, 2020

Test build #130955 has finished for PR 30341 at commit 6bab83c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

This comment has been minimized.

@SparkQA
Copy link

SparkQA commented Nov 15, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35709/

@viirya
Copy link
Member Author

viirya commented Nov 15, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Nov 15, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35709/

@SparkQA
Copy link

SparkQA commented Nov 15, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35710/

@SparkQA
Copy link

SparkQA commented Nov 15, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35710/

@maropu
Copy link
Member

maropu commented Nov 15, 2020

Please let us know if you have another comments, @maropu and @cloud-fan .

Yea, I don't have any more comment, so I'll leave this to @cloud-fan .

@SparkQA
Copy link

SparkQA commented Nov 15, 2020

Test build #131107 has finished for PR 30341 at commit 77168fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

subexpressionElimination off, codegen on 25932 26908 916 0.0 259320042.3 1.0X
subexpressionElimination off, codegen off 26085 26159 65 0.0 260848905.0 1.0X
subexpressionElimination on, codegen on 2860 2939 72 0.0 28603312.9 9.1X
subexpressionElimination on, codegen off 2517 2617 93 0.0 25165157.7 10.3X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this additional 10.3x.

private def replaceWithProxy(
expr: Expression,
proxyMap: Map[Expression, ExpressionProxy]): Expression = {
proxyMap.getOrElse(expr, expr.mapChildren(replaceWithProxy(_, proxyMap)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a top-down traverse, which means we will recursively replace expr with proxy even for ExpressionProxy. Is it expected?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we replace one expression with ExpressionProxy, we stop traversing down. We only traverse down to children if cannot find current expression in proxyMap. Is this for your question?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry I misread the code. You are right.


expressions.foreach(equivalentExpressions.addExprTree(_))

val proxyMap = mutable.Map.empty[Expression, ExpressionProxy]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it OK to use a simple map here? Two expressions may not equal to each other even if they semantically equal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For semantically equal exprs, we put a pair of expr -> proxy into the map and note the proxy is the same. So later we traverse down into expressions, we look at the map. We don't do semantically comparing when looking at this map.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I get it now. Seems we can use IdentityHashMap to be more explicit.

// ( (one * two) * (one * two) )
assert(proxys.size == 2)
val expected = ExpressionProxy(mul2, runtime)
assert(proxys.head == expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be proxys.forall(_ == expected)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, you're right.

})
assert(proxys.isEmpty)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we test attributes?

val attr1 = AttributeReference("a", ...)
val attr2 = attr1.withName("A")

To make sure 2 semantically-equal attributes can be optimized.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EquivalentExpressions skips for LeafExpression. So attributes won't be counted for subexpression.

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35786/

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35786/

@SparkQA
Copy link

SparkQA commented Nov 17, 2020

Test build #131184 has finished for PR 30341 at commit db115d6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 9283484 Nov 17, 2020
@dongjoon-hyun
Copy link
Member

Thank you, @viirya , @maropu , @cloud-fan !

@viirya
Copy link
Member Author

viirya commented Nov 17, 2020

Thanks @dongjoon-hyun @maropu @cloud-fan

@HyukjinKwon
Copy link
Member

Sorry for a late comment. +1, nice.

@viirya
Copy link
Member Author

viirya commented Nov 18, 2020

Thanks @HyukjinKwon!

HyukjinKwon pushed a commit that referenced this pull request Nov 23, 2020
…equantially

### What changes were proposed in this pull request?

This follow-up fixes an issue when inserting key/value pairs into `IdentityHashMap` in `SubExprEvaluationRuntime`.

### Why are the changes needed?

The last commits to #30341 follows review comment to use `IdentityHashMap`. Because we leverage `IdentityHashMap` to compare keys in reference, we should not convert expression pairs to Scala map before inserting. Scala map compares keys by equality so we will loss keys with different references.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Run benchmark to verify.

Closes #30459 from viirya/SPARK-33427-map.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
@viirya viirya deleted the SPARK-33427 branch December 27, 2023 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants