[SPARK-28916][SQL] Split subexpression elimination functions code for Generate[Mutable|Unsafe]Projection #25642

mgaido91 · 2019-09-01T07:02:16Z

What changes were proposed in this pull request?

The PR proposes to split the code for subexpression elimination before inlining the function calls all in the apply method for Generate[Mutable|Unsafe]Projection.

Why are the changes needed?

Before this PR, code generation can fail due to the 64KB code size limit if a lot of subexpression elimination functions are generated. The added UT is a reproducer for the issue (thanks to the JIRA reporter and @HyukjinKwon for it).

Does this PR introduce any user-facing change?

No.

How was this patch tested?

added UT

…avoid 64KB limit

mgaido91 · 2019-09-01T07:02:39Z

cc @cloud-fan @kiszk @maropu

SparkQA · 2019-09-01T10:35:45Z

Test build #109992 has finished for PR 25642 at commit 9295731.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-09-02T07:14:09Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

    }
  }
+
+  test("SPARK-28916: subexrepssion elimination can cause 64kb code limit") {


how long does this test run? We can write a unit test instead if the end-to-end test is too expensive to run.

yes, it takes about 3 mins. I'll try and find a way to create a UT...

cloud-fan · 2019-09-02T07:14:39Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+   * Returns the code for subexpression elimination after splitting it if necessary.
+   */
+  def subexprFunctionsCode: String = {
+    // Wholestage codegen does not allow subexpression elimination


shall we add assert(currentVars == null) to guarantee it?

cc @maropu , I vaguely remember that we do subexpression elimination in hash aggregate in one of your PR.

I added it, thanks. Anyway, when we generate subexpression elimination functions they all take an InternalRow as input. Unless that mechanism changes, I doubt it is usable in wholestage codegen.

shall we add assert(currentVars == null) to guarantee it?

I cannot do that, it doesn't work. In that case just an empty string is returned because the subexpression functions seq is empty.

If so, how about assert(currentVars != null && subexprFunctions.isEmpty) for strict checks?

I can add that, thanks

cloud-fan · 2019-09-02T07:15:49Z

@mgaido91 thanks for the fix! One question: AFAIK we will fallback to interpreted code path if codegen fails compilation. Why it doesn't work here？

mgaido91 · 2019-09-02T08:10:00Z

AFAIK we will fallback to interpreted code path if codegen fails compilation. Why it doesn't work here

It does work. We could see this as an improvement, indeed.

SparkQA · 2019-09-02T09:05:32Z

Test build #110006 has finished for PR 25642 at commit 0167929.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2019-09-02T09:37:57Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+   * Returns the code for subexpression elimination after splitting it if necessary.
+   */
+  def subexprFunctionsCode: String = {
+    // Wholestage codegen does not allow subexpression elimination


Could you elaborate on your thought on this comment?
IMHO, this change is not directly related to whether wholestage codegen performs subexpression elimination by controlling in other places.

The reason of this comment is that here we are passing the InternalRow to the variables and not the currentVars which would be needed in whole-stage codegen. This is done because in subexpression elimination we always put an InternalRow as argument. For this reason, whole-stage codegen disables subexpression elimination.
So I put here this comment in order to highlight that if in the future there will be a work to support subexpression elimination also in wholestage codegen, we need to modify this method too.

mgaido91 · 2019-09-02T09:56:21Z

retest this please

SparkQA · 2019-09-02T10:24:35Z

Test build #110010 has finished for PR 25642 at commit 0167929.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-09-02T12:19:42Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+   * Returns the code for subexpression elimination after splitting it if necessary.
+   */
+  def subexprFunctionsCode: String = {
+    // Wholestage codegen does not allow subexpression elimination: in that case, subexprFunctions


I think this is not true, see subexpressionEliminationForWholeStageCodegen in this class. @mgaido91 can you doule check to see if this fix works for whole-stage-codegen as well?

Seems like whole-stage-codegen has the same issue, but unfortunately splitExpressionsWithCurrentInputs is not completed yet.

I think the right description here is: whole-stage-codegen supports subexpression elimination, but we are not able to split the code for it yet.

cc @maropu @viirya to double-check.

@cloud-fan yes, sorry, this is not very well explained. I mean the wholestage codegen does not honor spark.sql.subexpressionElimination.enabled and even though it is true, it doesn't use subexprFunctions. So in this method we don't have to deal with wholestage codegen. Do you have suggestion on rewording this comment?

how about whole-stage-codegen supports subexpression elimination, and is handled by another code path.

currently looks better.

looks ok, too.

SparkQA · 2019-09-02T15:38:13Z

Test build #110013 has finished for PR 25642 at commit 2c6b64e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-02T18:31:19Z

Test build #110019 has finished for PR 25642 at commit 6f4c524.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-09-02T20:20:52Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

+    df.createOrReplaceTempView("spark64kb")
+    val data = spark.sql("select * from spark64kb limit 10")
+    // This fails if 64Kb limit is reached in code generation
+    data.describe()


Shall we disable fallback to interpreter in this test?

I need to revisit it according to Wenchen's suggestion... unfortunately this might take some time as I am a bit busy these days...

maropu · 2019-09-04T02:11:33Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

+  test("SPARK-28916: subexrepssion elimination can cause 64kb code limit") {
+    val df = spark.range(2).selectExpr((0 to 5000).map(i => s"id as field_$i"): _*)
+    df.createOrReplaceTempView("spark64kb")
+    val data = spark.sql("select * from spark64kb limit 10")


Is this a test for GenerateMutableProjection? How about the case GenerateUnsafeProjection?

maropu · 2019-09-04T02:17:07Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+   */
+  def subexprFunctionsCode: String = {
+    // Whole-stage codegen's subexpression elimination is handled in another code path
+    splitExpressions(subexprFunctions, "subexprFunc_split", Seq("InternalRow" -> INPUT_ROW))


We don't need to check if its empty?

if (subexprFunctions.nonEmpty) { splitExpressions(... } else { "" }

It is not necessary: if it is empty splitExpressions would return an empty string, anyway I can add it if you think it is more clear

maropu · 2019-09-04T04:44:56Z

btw, (this is a off-topic though) the HashAggregateExec code for common subexpr elimination has the same issue? That also expands all generated the code for CSE in a single method now;
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L287

maropu · 2019-09-04T04:54:48Z

To make the title more precise, how about adding For Generate[Mutable|Unsafe]Projection in it?

rednaxelafx · 2019-09-04T06:35:04Z

The PR as is it now is one level better than the status quo. That's probably good enough. But I was curious whether or not it makes more sense to perform a tree-splitting instead of a fixed-level splitting.

Basically, CodegenContext.splitExpressions only performs a fixed one-level splitting, so it splits

orig_func() {
  expr1
  expr2
  expr3
  expr4
  expr5
  expr6
}

into something like the following, assuming the split threshold is an imaginary 2 expressions:

func1() { expr1; expr2 }
func2() { expe3; expr4 }
func3() { expr5; expr6 }

orig_func() {
  func1()
  func2()
  func3()
}

Now, given that we assume the split threshold is 2, after the split this top-level code is still above the split threshold, which is not good.

Instead, it'd be really nice if the splitExpressions utility method can perform tree splitting within itself, without a 1- or 2-level fixed depth split limit.

Doing so would help us be more likely to cap the codegen method size below not only 64KB but also some lower thresholds like 8KB.

kiszk · 2019-09-04T13:08:24Z

@maropu good catch. In addition to that, this line may also cause a very huge method. I think that another PR can address these issues.

kiszk · 2019-09-04T13:10:14Z

@rednaxelafx I had a similar thought in my mind. I was optimistic that HotSpot compiler can apply inlining as possible.

It would be great if splitExpressions can handle splitting better.

mgaido91 · 2019-09-04T13:48:37Z

btw, (this is a off-topic though) the HashAggregateExec code for common subexpr elimination has the same issue? That also expands all generated the code for CSE in a single method now;
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L287

@maropu I think the point there is that in case of wholestage codegen we are not able to split things at the moment, so I am not sure whether is is possible doing anything better there.

cloud-fan · 2019-09-04T13:53:02Z

@mgaido91 It will be possible after #20965 . We can apply this patch in HashAggregateExec later.

mgaido91 · 2019-09-04T13:54:32Z

But I was curious whether or not it makes more sense to perform a tree-splitting instead of a fixed-level splitting.

@rednaxelafx actually we are doing something similar to what you are suggesting. We already have a 2 or more levels splitting feature. The point is that the splitting point is given by the inner classes. We can say that now we assume that the number of function calls which fits in a single inner classes are a safe threshold for the number of function calls inside a specific function. This is not exactly what you are proposing, as it is not driven by the method size conf, but it is close. For more details please check generateInnerClassesFunctionCalls and SPARK-22226.

SparkQA · 2019-09-06T21:10:28Z

Test build #110259 has finished for PR 25642 at commit 2d4b8f8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-09-06T22:32:58Z

retest this please

SparkQA · 2019-09-07T02:19:25Z

Test build #110267 has finished for PR 25642 at commit 2d4b8f8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-09-09T05:30:47Z

thanks, merging to master!

… Generate[Mutable|Unsafe]Projection ### What changes were proposed in this pull request? The PR proposes to split the code for subexpression elimination before inlining the function calls all in the apply method for `Generate[Mutable|Unsafe]Projection`. ### Why are the changes needed? Before this PR, code generation can fail due to the 64KB code size limit if a lot of subexpression elimination functions are generated. The added UT is a reproducer for the issue (thanks to the JIRA reporter and HyukjinKwon for it). ### Does this PR introduce any user-facing change? No. ### How was this patch tested? added UT Closes apache#25642 from mgaido91/SPARK-28916. Authored-by: Marco Gaido <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

… Generate[Mutable|Unsafe]Projection The PR proposes to split the code for subexpression elimination before inlining the function calls all in the apply method for `Generate[Mutable|Unsafe]Projection`. Before this PR, code generation can fail due to the 64KB code size limit if a lot of subexpression elimination functions are generated. The added UT is a reproducer for the issue (thanks to the JIRA reporter and HyukjinKwon for it). No. added UT Closes apache#25642 from mgaido91/SPARK-28916. Authored-by: Marco Gaido <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit c411579) (cherry picked from commit 50c88609dc7e7dea6f747f4edc7b6349c0e2d644)

[SPARK-28916][SQL] Split subexpression elimination functions code to …

9295731

…avoid 64KB limit

cloud-fan reviewed Sep 2, 2019

View reviewed changes

address comment

0167929

kiszk reviewed Sep 2, 2019

View reviewed changes

revert

2c6b64e

cloud-fan reviewed Sep 2, 2019

View reviewed changes

reword comment

6f4c524

dongjoon-hyun added the SQL label Sep 2, 2019

viirya reviewed Sep 2, 2019

View reviewed changes

maropu reviewed Sep 4, 2019

View reviewed changes

mgaido91 changed the title ~~[SPARK-28916][SQL] Split subexpression elimination functions code~~ [SPARK-28916][SQL] Split subexpression elimination functions code for Generate[Mutable|Unsafe]Projection Sep 4, 2019

viirya mentioned this pull request Sep 6, 2019

[SPARK-29013][SQL] Structurally equivalent subexpression elimination #25717

Closed

change uts

2d4b8f8

mgaido91 force-pushed the SPARK-28916 branch from d23dee9 to 2d4b8f8 Compare September 6, 2019 19:45

maropu mentioned this pull request Sep 6, 2019

[SPARK-29008][SQL] Define an individual method for each common subexpression in HashAggregateExec #25710

Closed

cloud-fan closed this in c411579 Sep 9, 2019

[SPARK-28916][SQL] Split subexpression elimination functions code for Generate[Mutable|Unsafe]Projection #25642

[SPARK-28916][SQL] Split subexpression elimination functions code for Generate[Mutable|Unsafe]Projection #25642

Uh oh!

Conversation

mgaido91 commented Sep 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

mgaido91 commented Sep 1, 2019

Uh oh!

SparkQA commented Sep 1, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Sep 2, 2019

Uh oh!

mgaido91 commented Sep 2, 2019

Uh oh!

SparkQA commented Sep 2, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgaido91 commented Sep 2, 2019

Uh oh!

SparkQA commented Sep 2, 2019

Uh oh!

cloud-fan Sep 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 2, 2019

Uh oh!

SparkQA commented Sep 2, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgaido91 Sep 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Sep 4, 2019

mgaido91 commented Sep 1, 2019 •

edited

Loading

cloud-fan Sep 2, 2019 •

edited

Loading

mgaido91 Sep 4, 2019 •

edited

Loading