-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-22747][SQL] Localize lifetime of mutable states in HashAggregateExec #19938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #84691 has finished for PR 19938 at commit
|
|
Jenkins, retest this please |
|
Test build #84694 has finished for PR 19938 at commit
|
|
IIUC, the problem is only happened when we wrongly pass global variables into split functions and change the values. Will we change the result variables from aggregation? I think parent operators just use the variables in evaluating expressions, and won't change the values, doesn't it? |
|
Yes, the problem would happen only when we pass global variables into split functions. |
|
They are possibly passed in split functions. But it is hard to image a case we will change their values in the functions. In SparkSQL, the output from a child operator are just used as input to evaluate new output in the parent operator. We don't use the output as mutable statuses. The possible problematic case is when we create a global variable in an operator/expression codegen and use this global variable to carry mutable status (e.g., the condition meeting status in casewhen) during the evaluation of the op/expr. Then it is possibly we pass it and modify it in split functions. If we don't change the values, seems to me this change just creates redundant local variables. This doesn't cause any harm at all. So I feel no strong option for this. |
|
Thank you for great thought. Let me think about it. |
|
actually there is one real problem: after we fold many global variables into an array, the variable name may become something like Localize the global variables in current expression/operator is one solution, another one is generating parameter names instead of reusing the input variable name. |
|
Good point. |
|
This makes sense to me. Currently, based on previous discussion, we are fixing code generation and insert an assertion at #19865 to ensure no global variables are passed. Which one is better solution? |
|
Is there only the place where we need this localization (I mean other operators don't need this logic)? I'm also neutral about this pr though, I feel better to make this more general to avoid the same situation in the other existing (and new) operators. |
|
I prefer to generate new parameter name in |
|
Sure, in #19865, I will generate new parameter name in |
What changes were proposed in this pull request?
This PR localizes lifetime of mutable states, which are used for
isNullandvalueof aggregation results, in generated code byHashAggregateExec.These status are passed to successor operations thru
consume()method. It may violate this assumption at #19865 when operations that uses these variables are split. In the following example,agg_localBufValueandagg_localBufisNullare passed to an successor operation (projection). Lifetime of mutable statesagg_bufValueandagg_bufIsNullare ended at Line 120.This PR is based on @cloud-fan 's suggestion.
Without this PR
With this PR
How was this patch tested?
Existing test suites