-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-22520][SQL] Support code generation for large CaseWhen #19752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #83866 has finished for PR 19752 at commit
|
|
@kiszk may I kindly ask you to review this please? Thanks. |
|
@kiszk actually checking your PR I think that the same issue addressed there would be handled also here by default. What do you think? |
| val casesCode = if (ctx.INPUT_ROW == null || ctx.currentVars != null) { | ||
| cases.mkString("\n") | ||
| } else { | ||
| ctx.splitExpressions(cases, "caseWhen", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In almost all the cases, we do not need to call splitExpressions after merging the PR #19767?
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we need to call it, indeed, as explained in this comment: #19767 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think that this implicitly covers also #18641, even though its main goal is another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, could you show us a test case? Can be a performance test if the function is hard to hit a 64KB limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can reuse the same UT added in #18641 for the 64KB limit, if it is ok for you.
As far as the performance is regarded, I'd need to create a test with many rows, otherwise the overhead is higher than the execution time and such a case is not going to finish in few seconds. Do you want me to post here in the PR some code and the times of execution before and after the PR, without adding it as a test?
|
@gatorsmile I added a test case to check that the execution plan is before this PR, it takes on average 1091.690996ms. After the PR, it takes on average 106.894443ns. Actually there is a problem which is fixed in #18641 and it is not fixed here, ie. when the code contains deeply nested exceptions, the 64KB limit exception can still happen. But this should be handled in a more generic way in #19813. @kiszk What do you think? |
|
Test build #84168 has finished for PR 19752 at commit
|
| """ | ||
| }, | ||
| foldFunctions = { funcCalls => | ||
| funcCalls.map(funcCall => s"$conditionMet = $funcCall;").mkString("\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When caseWhenConditionMet is false, we do not need to call funcCall .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, but we already have checks about it inside the functions. Do we need also to check it outside?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to avoid the extra function calls here. It is not cheap when the number of rows is large. Now, we split the functions pretty aggressively. I saw many new functions are generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll do. Then I'd suggest to do the same also in other places. I can check where an analogous pattern is used and create a PR if it is ok.
| // } | ||
| // } | ||
| // } | ||
| val conditionMet = ctx.freshName("caseWhenConditionMet") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment to explain what it is.
| // } | ||
| // } | ||
| val conditionMet = ctx.freshName("caseWhenConditionMet") | ||
| ctx.addMutableState("boolean", ev.isNull, "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ctx.JAVA_BOOLEAN
| allConditions.mkString("\n") | ||
| } else { | ||
| ctx.splitExpressions(allConditions, "caseWhen", | ||
| ("InternalRow", ctx.INPUT_ROW) :: ("boolean", conditionMet) :: Nil, returnType = "boolean", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ctx.JAVA_BOOLEAN
| val code = if (ctx.INPUT_ROW == null || ctx.currentVars != null) { | ||
| allConditions.mkString("\n") | ||
| } else { | ||
| ctx.splitExpressions(allConditions, "caseWhen", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style issue. Indent
| 1 | ||
| > SELECT CASE WHEN 1 < 0 THEN 1 WHEN 2 > 0 THEN 2.0 ELSE 1.2 END; | ||
| 2 | ||
| > SELECT CASE WHEN 1 < 0 THEN 1 WHEN 2 < 0 THEN 2.0 ELSE null END; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SELECT CASE WHEN 1 < 0 THEN 1 WHEN 2 < 0 THEN 2.0 ELSE null END;
->
SELECT CASE WHEN 1 < 0 THEN 1 WHEN 2 < 0 THEN 2.0 END;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you double check Hive returns NULL in the following case?
SELECT CASE WHEN 1 < 0 THEN 1 WHEN 2 < 0 THEN 2.0 END;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can follow your first suggestion and I can test this on hive, but actually I haven't changed this part of code. I will post ASAP the result in Hive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirm that Hive returns NULL. Then I am updating the description as requested.
| expr2 = expr2.when($"id" === lit(i + 10), i) | ||
| } | ||
| val df = spark.range(1).select(expr1, expr2.otherwise(0)) | ||
| df.show |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compare the results?
|
Test build #84198 has finished for PR 19752 at commit
|
|
Test build #84199 has finished for PR 19752 at commit
|
|
LGTM cc @cloud-fan @kiszk |
| // evaluates to `true` is met and therefore is not needed to go on anymore on the computation | ||
| // of the following conditions. | ||
| val conditionMet = ctx.freshName("caseWhenConditionMet") | ||
| ctx.addMutableState(ctx.JAVA_BOOLEAN, ev.isNull, "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ctx.addMutableState(ctx.JAVA_BOOLEAN, ev.isNull), as empty string is the default value of the 3rd parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I branched from a version when there was no default value. I merged and fixed it.
| if ($conditionMet) { | ||
| continue; | ||
| }""" | ||
| }.mkString("do {", "", "\n} while (false);") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: do we need a \n after do {?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, since there is a newline at the beginning of each expression.
| extends CaseWhenBase(branches, elseValue) with Serializable { | ||
|
|
||
| override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { | ||
| // Generate code that looks like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we keep this comment and update it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is necessary since now the generated code is way easier and more standard and nowhere else a comment like this is provided. Anyway, if you feel it is needed, I can add it.
|
LGTM except a few minor comments |
| allConditions.mkString("\n") | ||
| } else { | ||
| ctx.splitExpressions(allConditions, "caseWhen", | ||
| ("InternalRow", ctx.INPUT_ROW) :: (ctx.JAVA_BOOLEAN, conditionMet) :: Nil, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to pass conditionMet as an argument? I think conditionMet is always false when a function is called.
If true, we can declare conditionMet as a local variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the latest changes, conditionMet can be changed to a local variable.
|
Test build #84206 has finished for PR 19752 at commit
|
| s""" | ||
| ${ctx.JAVA_BOOLEAN} $conditionMet = false; | ||
| $func | ||
| return $conditionMet; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we apply the same do while optimization here? i.e.
do {
conditionMet = caseWhen_1(i);
if(conditionMet) {
continue;
}
} while (false)
boolean caseWhen_1(IntenralRow i) {
do {
if (!conditionMet) {
code...
set value and isnull...
$conditionMet = true;
continue;
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would complicate the code and I don't think it is worth, since if the code is not split, it means that we don't have many conditions, thus we would save only few if (conditionMet) evaluations. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in most cases we just split the codes into a few methods, which means, it's more important to apply the do while optimization inside the method(a method may have a lot of conditions checks), than between the methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but having this optimization outside means skipping whole methods. Anyway, if you think that this optimization is needed I can do it. I think only that the code readability would be a bit worse but I'll try to address this problem with comments.
| // continue; | ||
| // } | ||
| // ... | ||
| // } while (false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simplified to
do {
if (caseWhen_1(i)) {
continue;
}
if (caseWhen_2(i)) {
continue;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because in this way we would not set conditionMet. This can cause an error, in the case we have so many functions that their invocation goes beyond the 64KB limit: this problem is fixed by #19480, by creating a method which calls the methods of that class, In this case, if we don't set properly conditionMet, we would have a bug in the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get it, are you saying
do {
if (caseWhen_1(i)) {
continue;
}
if (caseWhen_2(i)) {
continue;
}
}
is worse than
do {
condition = caseWhen_1(i);
if (condition) {
continue;
}
condition = caseWhen_2(i);
if (condition) {
continue;
}
}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am saying that the first one is wrong since it doesn't set condition (which can be returned at the end of the method as of #19480), since it may return wrong results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to explain with an example. If we have a lot of methods, this can exceed 64KB:
do {
condition = caseWhen_1(i);
if (condition) {
continue;
}
condition = caseWhen_2(i);
if (condition) {
continue;
}
// a lot of other methods here
}
Thus, #19480 can split them into:
do {
condition = bunchOf_caseWhen_1(i);
if (condition) {
continue;
}
condition = bunchOf_caseWhen_2(i);
if (condition) {
continue;
}
// maybe some other here
} while (false);
...
InnerClass1 {
boolean bunchOf_caseWhen_1(InternalRow i) {
boolean condition = false;
do {
condition = caseWhen_1(i);
if (condition) {
continue;
}
condition = caseWhen_2(i);
if (condition) {
continue;
}
// a lot of other methods here
} while (false);
return condition;
}
...
}
in this case, the implementation you are suggesting can return a wrong value in bunchOf_caseWhen_1 and this will affect the correctness of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah i see, we have to set the condition if the code is inside a method.
do {
if (bunchOf_caseWhen_1(i)) {
continue;
}
if (bunchOf_caseWhen_2(i)) {
continue;
}
// maybe some other here
} while (false);
...
InnerClass1 {
boolean bunchOf_caseWhen_1(InternalRow i) {
do {
if (caseWhen_1(i)) {
return true;
}
if (caseWhen_2(i)) {
return true;
}
// a lot of other methods here
} while (false);
return false;
}
...
}
This would be the optimal but it's too much complexity with only a little gain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would require a refactoring of the splitExpressions method which is really not worth IMHO. I will leave this as it is now, while I am addressing your other comment, thanks.
|
Test build #84212 has finished for PR 19752 at commit
|
|
Test build #84224 has finished for PR 19752 at commit
|
|
Test build #84229 has finished for PR 19752 at commit
|
|
thanks, merging to master! |
## What changes were proposed in this pull request? a minor cleanup for apache#19752 . Remove the outer if as the code is inside `do while` ## How was this patch tested? existing tests Author: Wenchen Fan <[email protected]> Closes apache#19830 from cloud-fan/minor.
## What changes were proposed in this pull request? In many parts of the codebase for code generation, we are splitting the code to avoid exceptions due to the 64KB method size limit. This is generating a lot of methods which are called every time, even though sometime this is not needed. As pointed out here: apache#19752 (comment), this is a not negligible overhead which can be avoided. The PR applies the same approach used in apache#19752 also to the other places where this was feasible. ## How was this patch tested? existing UTs. Author: Marco Gaido <[email protected]> Closes apache#19860 from mgaido91/SPARK-22669.
What changes were proposed in this pull request?
Code generation is disabled for CaseWhen when the number of branches is higher than
spark.sql.codegen.maxCaseBranches(which defaults to 20). This was done to prevent the well known 64KB method limit exception.This PR proposes to support code generation also in those cases (without causing exceptions of course). As a side effect, we could get rid of the
spark.sql.codegen.maxCaseBranchesconfiguration.How was this patch tested?
existing UTs