-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-28916][SQL] Split subexpression elimination functions code for Generate[Mutable|Unsafe]Projection #25642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #109992 has finished for PR 25642 at commit
|
| } | ||
| } | ||
|
|
||
| test("SPARK-28916: subexrepssion elimination can cause 64kb code limit") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how long does this test run? We can write a unit test instead if the end-to-end test is too expensive to run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it takes about 3 mins. I'll try and find a way to create a UT...
| * Returns the code for subexpression elimination after splitting it if necessary. | ||
| */ | ||
| def subexprFunctionsCode: String = { | ||
| // Wholestage codegen does not allow subexpression elimination |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we add assert(currentVars == null) to guarantee it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @maropu , I vaguely remember that we do subexpression elimination in hash aggregate in one of your PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it, thanks. Anyway, when we generate subexpression elimination functions they all take an InternalRow as input. Unless that mechanism changes, I doubt it is usable in wholestage codegen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we add
assert(currentVars == null)to guarantee it?
I cannot do that, it doesn't work. In that case just an empty string is returned because the subexpression functions seq is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, how about assert(currentVars != null && subexprFunctions.isEmpty) for strict checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add that, thanks
|
@mgaido91 thanks for the fix! One question: AFAIK we will fallback to interpreted code path if codegen fails compilation. Why it doesn't work here? |
It does work. We could see this as an improvement, indeed. |
|
Test build #110006 has finished for PR 25642 at commit
|
| * Returns the code for subexpression elimination after splitting it if necessary. | ||
| */ | ||
| def subexprFunctionsCode: String = { | ||
| // Wholestage codegen does not allow subexpression elimination |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate on your thought on this comment?
IMHO, this change is not directly related to whether wholestage codegen performs subexpression elimination by controlling in other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason of this comment is that here we are passing the InternalRow to the variables and not the currentVars which would be needed in whole-stage codegen. This is done because in subexpression elimination we always put an InternalRow as argument. For this reason, whole-stage codegen disables subexpression elimination.
So I put here this comment in order to highlight that if in the future there will be a work to support subexpression elimination also in wholestage codegen, we need to modify this method too.
|
retest this please |
|
Test build #110010 has finished for PR 25642 at commit
|
| * Returns the code for subexpression elimination after splitting it if necessary. | ||
| */ | ||
| def subexprFunctionsCode: String = { | ||
| // Wholestage codegen does not allow subexpression elimination: in that case, subexprFunctions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not true, see subexpressionEliminationForWholeStageCodegen in this class. @mgaido91 can you doule check to see if this fix works for whole-stage-codegen as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like whole-stage-codegen has the same issue, but unfortunately splitExpressionsWithCurrentInputs is not completed yet.
I think the right description here is: whole-stage-codegen supports subexpression elimination, but we are not able to split the code for it yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan yes, sorry, this is not very well explained. I mean the wholestage codegen does not honor spark.sql.subexpressionElimination.enabled and even though it is true, it doesn't use subexprFunctions. So in this method we don't have to deal with wholestage codegen. Do you have suggestion on rewording this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about whole-stage-codegen supports subexpression elimination, and is handled by another code path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently looks better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks ok, too.
|
Test build #110013 has finished for PR 25642 at commit
|
|
Test build #110019 has finished for PR 25642 at commit
|
| df.createOrReplaceTempView("spark64kb") | ||
| val data = spark.sql("select * from spark64kb limit 10") | ||
| // This fails if 64Kb limit is reached in code generation | ||
| data.describe() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we disable fallback to interpreter in this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to revisit it according to Wenchen's suggestion... unfortunately this might take some time as I am a bit busy these days...
| test("SPARK-28916: subexrepssion elimination can cause 64kb code limit") { | ||
| val df = spark.range(2).selectExpr((0 to 5000).map(i => s"id as field_$i"): _*) | ||
| df.createOrReplaceTempView("spark64kb") | ||
| val data = spark.sql("select * from spark64kb limit 10") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a test for GenerateMutableProjection? How about the case GenerateUnsafeProjection?
| */ | ||
| def subexprFunctionsCode: String = { | ||
| // Whole-stage codegen's subexpression elimination is handled in another code path | ||
| splitExpressions(subexprFunctions, "subexprFunc_split", Seq("InternalRow" -> INPUT_ROW)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to check if its empty?
if (subexprFunctions.nonEmpty) {
splitExpressions(...
} else {
""
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not necessary: if it is empty splitExpressions would return an empty string, anyway I can add it if you think it is more clear
|
btw, (this is a off-topic though) the HashAggregateExec code for common subexpr elimination has the same issue? That also expands all generated the code for CSE in a single method now; |
|
To make the title more precise, how about adding |
|
The PR as is it now is one level better than the status quo. That's probably good enough. But I was curious whether or not it makes more sense to perform a tree-splitting instead of a fixed-level splitting. Basically, into something like the following, assuming the split threshold is an imaginary Now, given that we assume the split threshold is Instead, it'd be really nice if the Doing so would help us be more likely to cap the codegen method size below not only 64KB but also some lower thresholds like 8KB. |
|
@rednaxelafx I had a similar thought in my mind. I was optimistic that HotSpot compiler can apply inlining as possible. It would be great if |
@maropu I think the point there is that in case of wholestage codegen we are not able to split things at the moment, so I am not sure whether is is possible doing anything better there. |
@rednaxelafx actually we are doing something similar to what you are suggesting. We already have a 2 or more levels splitting feature. The point is that the splitting point is given by the inner classes. We can say that now we assume that the number of function calls which fits in a single inner classes are a safe threshold for the number of function calls inside a specific function. This is not exactly what you are proposing, as it is not driven by the method size conf, but it is close. For more details please check |
d23dee9 to
2d4b8f8
Compare
|
Test build #110259 has finished for PR 25642 at commit
|
|
retest this please |
|
Test build #110267 has finished for PR 25642 at commit
|
|
thanks, merging to master! |
… Generate[Mutable|Unsafe]Projection ### What changes were proposed in this pull request? The PR proposes to split the code for subexpression elimination before inlining the function calls all in the apply method for `Generate[Mutable|Unsafe]Projection`. ### Why are the changes needed? Before this PR, code generation can fail due to the 64KB code size limit if a lot of subexpression elimination functions are generated. The added UT is a reproducer for the issue (thanks to the JIRA reporter and HyukjinKwon for it). ### Does this PR introduce any user-facing change? No. ### How was this patch tested? added UT Closes apache#25642 from mgaido91/SPARK-28916. Authored-by: Marco Gaido <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
… Generate[Mutable|Unsafe]Projection The PR proposes to split the code for subexpression elimination before inlining the function calls all in the apply method for `Generate[Mutable|Unsafe]Projection`. Before this PR, code generation can fail due to the 64KB code size limit if a lot of subexpression elimination functions are generated. The added UT is a reproducer for the issue (thanks to the JIRA reporter and HyukjinKwon for it). No. added UT Closes apache#25642 from mgaido91/SPARK-28916. Authored-by: Marco Gaido <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit c411579) (cherry picked from commit 50c88609dc7e7dea6f747f4edc7b6349c0e2d644)
What changes were proposed in this pull request?
The PR proposes to split the code for subexpression elimination before inlining the function calls all in the apply method for
Generate[Mutable|Unsafe]Projection.Why are the changes needed?
Before this PR, code generation can fail due to the 64KB code size limit if a lot of subexpression elimination functions are generated. The added UT is a reproducer for the issue (thanks to the JIRA reporter and @HyukjinKwon for it).
Does this PR introduce any user-facing change?
No.
How was this patch tested?
added UT