[SPARK-22520][SQL] Support code generation for large CaseWhen #19752

mgaido91 · 2017-11-14T22:30:49Z

What changes were proposed in this pull request?

Code generation is disabled for CaseWhen when the number of branches is higher than spark.sql.codegen.maxCaseBranches (which defaults to 20). This was done to prevent the well known 64KB method limit exception.
This PR proposes to support code generation also in those cases (without causing exceptions of course). As a side effect, we could get rid of the spark.sql.codegen.maxCaseBranches configuration.

How was this patch tested?

existing UTs

SparkQA · 2017-11-15T01:21:15Z

Test build #83866 has finished for PR 19752 at commit 98eaae9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class CaseWhen(

mgaido91 · 2017-11-17T15:18:31Z

@kiszk may I kindly ask you to review this please? Thanks.

kiszk · 2017-11-19T05:58:19Z

Sure, I will review this.
It may have some overlaps with #18641. I will review this after #18641 due to avoiding a conflict.

mgaido91 · 2017-11-21T18:13:01Z

@kiszk actually checking your PR I think that the same issue addressed there would be handled also here by default. What do you think?

gatorsmile · 2017-11-22T07:04:28Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala

+    val casesCode = if (ctx.INPUT_ROW == null || ctx.currentVars != null) {
+      cases.mkString("\n")
+    } else {
+      ctx.splitExpressions(cases, "caseWhen",


In almost all the cases, we do not need to call splitExpressions after merging the PR #19767?

WDYT?

I think that we need to call it, indeed, as explained in this comment: #19767 (comment)

But I think that this implicitly covers also #18641, even though its main goal is another.

Then, could you show us a test case? Can be a performance test if the function is hard to hit a 64KB limit.

I can reuse the same UT added in #18641 for the 64KB limit, if it is ok for you.

As far as the performance is regarded, I'd need to create a test with many rows, otherwise the overhead is higher than the execution time and such a case is not going to finish in few seconds. Do you want me to post here in the PR some code and the times of execution before and after the PR, without adding it as a test?

mgaido91 · 2017-11-24T15:40:40Z

@gatorsmile I added a test case to check that the execution plan is WholeStageCodegenExec as expected. I also made some performance test using almost the same code, ie.:

val N = 30
val nRows = 1000000
var expr1 = when($"id" === lit(0), 0)
var expr2 = when($"id" === lit(0), 10)
(1 to N).foreach { i =>
  expr1 = expr1.when($"id" === lit(i), -i)
  expr2 = expr2.when($"id" === lit(i + 10), i)
}
time { spark.range(nRows).select(expr1.as("c1"), expr2.otherwise(0).as("c2")).sort("c1").show }

before this PR, it takes on average 1091.690996ms. After the PR, it takes on average 106.894443ns.

Actually there is a problem which is fixed in #18641 and it is not fixed here, ie. when the code contains deeply nested exceptions, the 64KB limit exception can still happen. But this should be handled in a more generic way in #19813.

@kiszk What do you think?

SparkQA · 2017-11-24T18:31:44Z

Test build #84168 has finished for PR 19752 at commit 6225c8e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-11-26T19:28:06Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala

+            """
+        },
+        foldFunctions = { funcCalls =>
+          funcCalls.map(funcCall => s"$conditionMet = $funcCall;").mkString("\n")


When caseWhenConditionMet is false, we do not need to call funcCall .

you are right, but we already have checks about it inside the functions. Do we need also to check it outside?

We want to avoid the extra function calls here. It is not cheap when the number of rows is large. Now, we split the functions pretty aggressively. I saw many new functions are generated.

Ok, I'll do. Then I'd suggest to do the same also in other places. I can check where an analogous pattern is used and create a PR if it is ok.

gatorsmile · 2017-11-26T19:29:24Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala

-    //     }
-    //   }
-    // }
+    val conditionMet = ctx.freshName("caseWhenConditionMet")


Add a comment to explain what it is.

gatorsmile · 2017-11-26T19:29:53Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala

-    //   }
-    // }
+    val conditionMet = ctx.freshName("caseWhenConditionMet")
+    ctx.addMutableState("boolean", ev.isNull, "")


ctx.JAVA_BOOLEAN

gatorsmile · 2017-11-26T19:34:15Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala

+      allConditions.mkString("\n")
+    } else {
+      ctx.splitExpressions(allConditions, "caseWhen",
+        ("InternalRow", ctx.INPUT_ROW) :: ("boolean", conditionMet) :: Nil, returnType = "boolean",


ctx.JAVA_BOOLEAN

gatorsmile · 2017-11-26T19:37:24Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala

+    val code = if (ctx.INPUT_ROW == null || ctx.currentVars != null) {
+      allConditions.mkString("\n")
+    } else {
+      ctx.splitExpressions(allConditions, "caseWhen",


Style issue. Indent

gatorsmile · 2017-11-26T19:44:25Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala

+       1
+      > SELECT CASE WHEN 1 < 0 THEN 1 WHEN 2 > 0 THEN 2.0 ELSE 1.2 END;
+       2
+      > SELECT CASE WHEN 1 < 0 THEN 1 WHEN 2 < 0 THEN 2.0 ELSE null END;


SELECT CASE WHEN 1 < 0 THEN 1 WHEN 2 < 0 THEN 2.0 ELSE null END;
->
SELECT CASE WHEN 1 < 0 THEN 1 WHEN 2 < 0 THEN 2.0 END;

Could you double check Hive returns NULL in the following case?

SELECT CASE WHEN 1 < 0 THEN 1 WHEN 2 < 0 THEN 2.0 END;

I can follow your first suggestion and I can test this on hive, but actually I haven't changed this part of code. I will post ASAP the result in Hive.

I confirm that Hive returns NULL. Then I am updating the description as requested.

gatorsmile · 2017-11-26T19:48:38Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

+      expr2 = expr2.when($"id" === lit(i + 10), i)
+    }
+    val df = spark.range(1).select(expr1, expr2.otherwise(0))
+    df.show


compare the results?

SparkQA · 2017-11-27T00:10:22Z

Test build #84198 has finished for PR 19752 at commit f9c20be.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-27T00:15:59Z

Test build #84199 has finished for PR 19752 at commit 9063583.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-11-27T03:17:08Z

LGTM cc @cloud-fan @kiszk

cloud-fan · 2017-11-27T03:25:39Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala

+    // evaluates to `true` is met and therefore is not needed to go on anymore on the computation
+    // of the following conditions.
+    val conditionMet = ctx.freshName("caseWhenConditionMet")
+    ctx.addMutableState(ctx.JAVA_BOOLEAN, ev.isNull, "")


nit: ctx.addMutableState(ctx.JAVA_BOOLEAN, ev.isNull), as empty string is the default value of the 3rd parameter.

thanks, I branched from a version when there was no default value. I merged and fixed it.

cloud-fan · 2017-11-27T03:29:33Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala

+                if ($conditionMet) {
+                  continue;
+                }"""
+            }.mkString("do {", "", "\n} while (false);")


nit: do we need a \n after do {?

no, since there is a newline at the beginning of each expression.

cloud-fan · 2017-11-27T03:35:41Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala

-  extends CaseWhenBase(branches, elseValue) with Serializable {

  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
-    // Generate code that looks like:


shall we keep this comment and update it?

I don't think it is necessary since now the generated code is way easier and more standard and nowhere else a comment like this is provided. Anyway, if you feel it is needed, I can add it.

cloud-fan · 2017-11-27T03:35:58Z

LGTM except a few minor comments

kiszk · 2017-11-27T07:43:23Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala

+        allConditions.mkString("\n")
+      } else {
+        ctx.splitExpressions(allConditions, "caseWhen",
+          ("InternalRow", ctx.INPUT_ROW) :: (ctx.JAVA_BOOLEAN, conditionMet) :: Nil,


Do we need to pass conditionMet as an argument? I think conditionMet is always false when a function is called.
If true, we can declare conditionMet as a local variable.

After the latest changes, conditionMet can be changed to a local variable.

SparkQA · 2017-11-27T08:05:01Z

Test build #84206 has finished for PR 19752 at commit f4c7896.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-11-27T11:58:39Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala

+              s"""
+                ${ctx.JAVA_BOOLEAN} $conditionMet = false;
+                $func
+                return $conditionMet;


shall we apply the same do while optimization here? i.e.

do { conditionMet = caseWhen_1(i); if(conditionMet) { continue; } } while (false) boolean caseWhen_1(IntenralRow i) { do { if (!conditionMet) { code... set value and isnull... $conditionMet = true; continue; } } }

I think this would complicate the code and I don't think it is worth, since if the code is not split, it means that we don't have many conditions, thus we would save only few if (conditionMet) evaluations. What do you think?

I think in most cases we just split the codes into a few methods, which means, it's more important to apply the do while optimization inside the method(a method may have a lot of conditions checks), than between the methods.

yes, but having this optimization outside means skipping whole methods. Anyway, if you think that this optimization is needed I can do it. I think only that the code readability would be a bit worse but I'll try to address this problem with comments.

cloud-fan · 2017-11-27T12:00:48Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala

+        //     continue;
+        //   }
+        //   ...
+        // } while (false);


Can we simplified to

do { if (caseWhen_1(i)) { continue; } if (caseWhen_2(i)) { continue; } }

No, because in this way we would not set conditionMet. This can cause an error, in the case we have so many functions that their invocation goes beyond the 64KB limit: this problem is fixed by #19480, by creating a method which calls the methods of that class, In this case, if we don't set properly conditionMet, we would have a bug in the implementation.

I don't get it, are you saying

do { if (caseWhen_1(i)) { continue; } if (caseWhen_2(i)) { continue; } }

is worse than

do { condition = caseWhen_1(i); if (condition) { continue; } condition = caseWhen_2(i); if (condition) { continue; } }

?

I am saying that the first one is wrong since it doesn't set condition (which can be returned at the end of the method as of #19480), since it may return wrong results.

I'll try to explain with an example. If we have a lot of methods, this can exceed 64KB:

do { condition = caseWhen_1(i); if (condition) { continue; } condition = caseWhen_2(i); if (condition) { continue; } // a lot of other methods here }

Thus, #19480 can split them into:

do { condition = bunchOf_caseWhen_1(i); if (condition) { continue; } condition = bunchOf_caseWhen_2(i); if (condition) { continue; } // maybe some other here } while (false); ... InnerClass1 { boolean bunchOf_caseWhen_1(InternalRow i) { boolean condition = false; do { condition = caseWhen_1(i); if (condition) { continue; } condition = caseWhen_2(i); if (condition) { continue; } // a lot of other methods here } while (false); return condition; } ... }

in this case, the implementation you are suggesting can return a wrong value in bunchOf_caseWhen_1 and this will affect the correctness of the code.

ah i see, we have to set the condition if the code is inside a method.

do { if (bunchOf_caseWhen_1(i)) { continue; } if (bunchOf_caseWhen_2(i)) { continue; } // maybe some other here } while (false); ... InnerClass1 { boolean bunchOf_caseWhen_1(InternalRow i) { do { if (caseWhen_1(i)) { return true; } if (caseWhen_2(i)) { return true; } // a lot of other methods here } while (false); return false; } ... }

This would be the optimal but it's too much complexity with only a little gain.

this would require a refactoring of the splitExpressions method which is really not worth IMHO. I will leave this as it is now, while I am addressing your other comment, thanks.

SparkQA · 2017-11-27T12:07:08Z

Test build #84212 has finished for PR 19752 at commit 5adb513.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-27T18:44:39Z

Test build #84224 has finished for PR 19752 at commit c7347b1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-27T22:57:15Z

Test build #84229 has finished for PR 19752 at commit dd5f455.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-11-27T23:46:40Z

thanks, merging to master!

## What changes were proposed in this pull request? a minor cleanup for apache#19752 . Remove the outer if as the code is inside `do while` ## How was this patch tested? existing tests Author: Wenchen Fan <[email protected]> Closes apache#19830 from cloud-fan/minor.

## What changes were proposed in this pull request? In many parts of the codebase for code generation, we are splitting the code to avoid exceptions due to the 64KB method size limit. This is generating a lot of methods which are called every time, even though sometime this is not needed. As pointed out here: apache#19752 (comment), this is a not negligible overhead which can be avoided. The PR applies the same approach used in apache#19752 also to the other places where this was feasible. ## How was this patch tested? existing UTs. Author: Marco Gaido <[email protected]> Closes apache#19860 from mgaido91/SPARK-22669.

[SPARK-22520][SQL] Support code generation for large CaseWhen

98eaae9

gatorsmile reviewed Nov 22, 2017

View reviewed changes

adding test case

6225c8e

gatorsmile reviewed Nov 26, 2017

View reviewed changes

mgaido91 added 2 commits November 26, 2017 22:15

review comments

f9c20be

change description example

9063583

cloud-fan reviewed Nov 27, 2017

View reviewed changes

mgaido91 added 2 commits November 27, 2017 08:06

Merge remote-tracking branch 'apache/master' into SPARK-22520

c7f0a92

nit: remove useless init empty string

f4c7896

kiszk reviewed Nov 27, 2017

View reviewed changes

kiszk mentioned this pull request Nov 27, 2017

[SPARK-21413][SQL] Fix 64KB JVM bytecode limit problem in multiple projections with CASE WHEN #18641

Closed

making conditionMet a local variable

5adb513

cloud-fan reviewed Nov 27, 2017

View reviewed changes

mgaido91 added 2 commits November 27, 2017 17:23

implement do while optimization also inside the methods

6b280fd

minor: test style warn

c7347b1

fix bug

dd5f455

asfgit closed this in 087879a Nov 27, 2017

cloud-fan mentioned this pull request Nov 28, 2017

[SPARK-22520][SQL][followup] remove outer if for case when codegen #19830

Closed

mgaido91 mentioned this pull request Dec 1, 2017

[SPARK-22669][SQL] Avoid unnecessary function calls in code generation #19860

Closed

[SPARK-22520][SQL] Support code generation for large CaseWhen #19752

[SPARK-22520][SQL] Support code generation for large CaseWhen #19752

Uh oh!

Conversation

mgaido91 commented Nov 14, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Nov 15, 2017

Uh oh!

mgaido91 commented Nov 17, 2017

Uh oh!

kiszk commented Nov 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgaido91 commented Nov 21, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgaido91 Nov 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgaido91 commented Nov 24, 2017

Uh oh!

SparkQA commented Nov 24, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 27, 2017

Uh oh!

SparkQA commented Nov 27, 2017

Uh oh!

gatorsmile commented Nov 27, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Nov 27, 2017

Uh oh!

Choose a reason for hiding this comment

kiszk commented Nov 19, 2017 •

edited

Loading

mgaido91 Nov 23, 2017 •

edited

Loading

gatorsmile Nov 27, 2017 •

edited

Loading

cloud-fan Nov 27, 2017 •

edited

Loading