-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-31091] Revert SPARK-24640 Return NULL from size(NULL) by default
#27834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
For the full picture, -1 may lead to wrong query results. Here is the example I got from @ssimeonov : "A client discovered this behavior while investigating post-click user engagement in their AdTech system. The schema was per ad placement and post-click user engagements were in an array of structs. The culprit was df.groupBy('placementId).agg(sum(size('engagements)).as("engagement_count"), ...), which subtracted 1 for every click without post-click engagement. Luckily, the behavior led to negative engagement counts in some periods, which alerted them to the problem and this bizarre behavior." |
|
Test build #119465 has finished for PR 27834 at commit
|
|
Thanks @MaxGekk. Yes, after discovering this, we had to comb from our entire codebase to refactor Taking a step back, a |
|
Thank you for pinging me, @cloud-fan . |
|
I prefer the AS-IS status of 3.0, but this seems to fall into the same category where @marmbrus and @gatorsmile asked reverting on the behavior fix at two-parameter Technically, two-parameter |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. I'll leave this PR for the other committers' review especially @marmbrus .
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, please hold on until the vote.
|
It doesn't seem to be just a cosmetic change, as far as we hear actual use case being affected by. It would be OK to let it be -1 for non-aggregate given -1 can be still differentiated with valid values, but for aggregations the value of -1 is being handled as "valid" values silently, and provides correctness issue. That would require "pre-process" (having a column for the result of I feel we should have clear reason to have -1 for the return value, what benefits we get from having it to -1. At least they should be a kind of "trade-off" if we would like to decide and take one - if it doesn't even a trade-off, it's clearly a correctness issue we should fix. @ssimeonov |
|
@HeartSaVioR the workaround was simple, unpleasant and unstable:
As for the idea to have the |
No I didn't mean it as an idea. I used it to describe the issue where NULL as result wouldn't have. |
NULL from size(NULL) by defaultNULL from size(NULL) by default
|
Thank you all for making and reviewing this PR. Since the vote finished, I'll merge this. |
…efault ### What changes were proposed in this pull request? This PR reverts #26051 and #26066 ### Why are the changes needed? There is no standard requiring that `size(null)` must return null, and returning -1 looks reasonable as well. This is kind of a cosmetic change and we should avoid it if it breaks existing queries. This is similar to reverting TRIM function parameter order change. ### Does this PR introduce any user-facing change? Yes, change the behavior of `size(null)` back to be the same as 2.4. ### How was this patch tested? N/A Closes #27834 from cloud-fan/revert. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 8efb710) Signed-off-by: Dongjoon Hyun <[email protected]>
|
cc @rxin since he is a release manager for 3.0.0. (Also, cc @gatorsmile ) |
|
Btw, the configuration name How about renaming it as |
|
For me, it looks better, @HeartSaVioR . Please make a PR for that. |
|
Oh.. |
|
This exists at 2.4.x age. It seems that it's difficult to change. |
|
Since it has already |
|
Oh OK. I'm not sure I love to have One thing we may want to have attention is that, @ssimeonov did their own workaround while the behavior can be simply changed via touching the legacy config. @ssimeonov Did you indicate the existence of legacy config? If then, could you please elaborate why you didn't leverage the config? Thanks in advance! |
|
We made our workaround long before this made it to Spark. After all the code was changed, it was easier to keep working in our own framework, ignoring all the mess of settings. |
|
Ah OK, you were faster. Thanks for the info. |
### What changes were proposed in this pull request? Make `size(null)` return null under ANSI mode, regardless of the `spark.sql.legacy.sizeOfNull` config. ### Why are the changes needed? In #27834, we change the result of `size(null)` to be -1 to match the 2.4 behavior and avoid breaking changes. However, it's true that the "return -1" behavior is error-prone when being used with aggregate functions. The current ANSI mode controls a bunch of "better behaviors" like failing on overflow. We don't enable these "better behaviors" by default because they are too breaking. The "return null" behavior of `size(null)` is a good fit of the ANSI mode. ### Does this PR introduce any user-facing change? No as ANSI mode is off by default. ### How was this patch tested? new tests Closes #27936 from cloud-fan/null. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
Make `size(null)` return null under ANSI mode, regardless of the `spark.sql.legacy.sizeOfNull` config. In #27834, we change the result of `size(null)` to be -1 to match the 2.4 behavior and avoid breaking changes. However, it's true that the "return -1" behavior is error-prone when being used with aggregate functions. The current ANSI mode controls a bunch of "better behaviors" like failing on overflow. We don't enable these "better behaviors" by default because they are too breaking. The "return null" behavior of `size(null)` is a good fit of the ANSI mode. No as ANSI mode is off by default. new tests Closes #27936 from cloud-fan/null. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit dc5ebc2) Signed-off-by: Dongjoon Hyun <[email protected]>
…efault ### What changes were proposed in this pull request? This PR reverts apache#26051 and apache#26066 ### Why are the changes needed? There is no standard requiring that `size(null)` must return null, and returning -1 looks reasonable as well. This is kind of a cosmetic change and we should avoid it if it breaks existing queries. This is similar to reverting TRIM function parameter order change. ### Does this PR introduce any user-facing change? Yes, change the behavior of `size(null)` back to be the same as 2.4. ### How was this patch tested? N/A Closes apache#27834 from cloud-fan/revert. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? Make `size(null)` return null under ANSI mode, regardless of the `spark.sql.legacy.sizeOfNull` config. ### Why are the changes needed? In apache#27834, we change the result of `size(null)` to be -1 to match the 2.4 behavior and avoid breaking changes. However, it's true that the "return -1" behavior is error-prone when being used with aggregate functions. The current ANSI mode controls a bunch of "better behaviors" like failing on overflow. We don't enable these "better behaviors" by default because they are too breaking. The "return null" behavior of `size(null)` is a good fit of the ANSI mode. ### Does this PR introduce any user-facing change? No as ANSI mode is off by default. ### How was this patch tested? new tests Closes apache#27936 from cloud-fan/null. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
This PR reverts #26051 and #26066
Why are the changes needed?
There is no standard requiring that
size(null)must return null, and returning -1 looks reasonable as well. This is kind of a cosmetic change and we should avoid it if it breaks existing queries. This is similar to reverting TRIM function parameter order change.Does this PR introduce any user-facing change?
Yes, change the behavior of
size(null)back to be the same as 2.4.How was this patch tested?
N/A