-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18395][SQL] Evaluate common subexpression like lazy variable with a function approach #15837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Part of Generated codes: |
|
Test build #68445 has finished for PR 15837 at commit
|
c0153a6 to
63accf8
Compare
|
Test build #68450 has finished for PR 15837 at commit
|
|
Test build #68452 has finished for PR 15837 at commit
|
|
Test build #68455 has finished for PR 15837 at commit
|
|
ping @cloud-fan @hvanhovell @kiszk |
|
ping @cloud-fan @hvanhovell @kiszk Can you help review this? Thanks. |
|
re-ping @cloud-fan @kiszk @hvanhovell |
|
Test build #68967 has finished for PR 15837 at commit
|
|
Sorry for being late while I was travelling. |
|
Sorry for the delay, but I may not have time to review it before the 2.1 release, can you hold it off until 2.1 release? thanks! |
|
@cloud-fan Sure, no problem. |
|
@kiszk Do you mean to avoid subexpression elimination? |
|
IMHO, it would be good to stop applying subexpression elimination to a very simple expression (e.g. |
|
@kiszk I am hesitant to set a threshold for that. One reason is it will increase the complexity of subexpression elimination implementation and the benefit can be low or uncertain. Another reason is how we set up a good threshold. |
|
ping @cloud-fan Since 2.1 is released now, can you have time to review this? Thanks. |
|
Test build #71240 has finished for PR 15837 at commit
|
|
I'm worried about performance, now we have an extra function call and a if branch to retrieve the result of common subexpressions. Maybe we should just disable subexpression elimination for branch expressions like if and case when. |
…tional expressions ## What changes were proposed in this pull request? As I pointed out in #15807 (comment) , the current subexpression elimination framework has a problem, it always evaluates all common subexpressions at the beginning, even they are inside conditional expressions and may not be accessed. Ideally we should implement it like scala lazy val, so we only evaluate it when it gets accessed at lease once. #15837 tries this approach, but it seems too complicated and may introduce performance regression. This PR simply stops common subexpression elimination for conditional expressions, with some cleanup. ## How was this patch tested? regression test Author: Wenchen Fan <[email protected]> Closes #16659 from cloud-fan/codegen.
|
Close this as alternative one #16659 is merged. |
…tional expressions ## What changes were proposed in this pull request? As I pointed out in apache#15807 (comment) , the current subexpression elimination framework has a problem, it always evaluates all common subexpressions at the beginning, even they are inside conditional expressions and may not be accessed. Ideally we should implement it like scala lazy val, so we only evaluate it when it gets accessed at lease once. apache#15837 tries this approach, but it seems too complicated and may introduce performance regression. This PR simply stops common subexpression elimination for conditional expressions, with some cleanup. ## How was this patch tested? regression test Author: Wenchen Fan <[email protected]> Closes apache#16659 from cloud-fan/codegen.
…tional expressions ## What changes were proposed in this pull request? As I pointed out in apache#15807 (comment) , the current subexpression elimination framework has a problem, it always evaluates all common subexpressions at the beginning, even they are inside conditional expressions and may not be accessed. Ideally we should implement it like scala lazy val, so we only evaluate it when it gets accessed at lease once. apache#15837 tries this approach, but it seems too complicated and may introduce performance regression. This PR simply stops common subexpression elimination for conditional expressions, with some cleanup. ## How was this patch tested? regression test Author: Wenchen Fan <[email protected]> Closes apache#16659 from cloud-fan/codegen.
What changes were proposed in this pull request?
As per the discussion at #15807, we need to change the way of subexpression elimination.
In current approach, common subexpressions are evaluated no matter they are really used or not later. E.g., in the following generated codes,
subexpr2is evaluated even only theifbranch is run.Besides possible performance regression, the expression like
AssertNotNullcan throw exception. So wrongly evaluatingsubexpr2will throw exception unexceptedly..With this patch, now common subexpressions are not evaluated until they are used. We create a function for each common subexpression which evaluates and stores the result as a member variable. We have an initialization status variable to record whether the subexpression is evaluated.
Thus, when an expression using the subexpression is going to be evaluated, we check if the subexpression is initialized, if yes directly returning the result, if no call the function to evaluate it.
How was this patch tested?
Jenkins tests.
Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.