[SPARK-31091] Revert SPARK-24640 Return `NULL` from `size(NULL)` by default #27834

cloud-fan · 2020-03-06T12:24:03Z

What changes were proposed in this pull request?

This PR reverts #26051 and #26066

Why are the changes needed?

There is no standard requiring that size(null) must return null, and returning -1 looks reasonable as well. This is kind of a cosmetic change and we should avoid it if it breaks existing queries. This is similar to reverting TRIM function parameter order change.

Does this PR introduce any user-facing change?

Yes, change the behavior of size(null) back to be the same as 2.4.

How was this patch tested?

N/A

This reverts commit 4e6d31f.

cloud-fan · 2020-03-06T12:24:53Z

cc @rxin @MaxGekk @gatorsmile @HyukjinKwon @dongjoon-hyun

MaxGekk · 2020-03-06T12:43:01Z

returning -1 looks reasonable as well.

For the full picture, -1 may lead to wrong query results. Here is the example I got from @ssimeonov :

"A client discovered this behavior while investigating post-click user engagement in their AdTech system. The schema was per ad placement and post-click user engagements were in an array of structs. The culprit was df.groupBy('placementId).agg(sum(size('engagements)).as("engagement_count"), ...), which subtracted 1 for every click without post-click engagement. Luckily, the behavior led to negative engagement counts in some periods, which alerted them to the problem and this bizarre behavior."

SparkQA · 2020-03-06T17:02:19Z

Test build #119465 has finished for PR 27834 at commit 9ebf183.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ssimeonov · 2020-03-06T18:19:29Z

Thanks @MaxGekk. Yes, after discovering this, we had to comb from our entire codebase to refactor size(column) into column.size, adding our own implicit.

Taking a step back, a -1 return from a function reeks of non-SQL thinking by whoever implemented this in Hive originally. As Spark grows, its audience is expanding rapidly. We need to move away from random engineer decisions made many years ago in a different product towards what makes sense for the future (much larger) audience of Spark users.

dongjoon-hyun · 2020-03-06T18:23:55Z

Thank you for pinging me, @cloud-fan .

dongjoon-hyun · 2020-03-06T18:34:41Z

I prefer the AS-IS status of 3.0, but this seems to fall into the same category where @marmbrus and @gatorsmile asked reverting on the behavior fix at two-parameter TRIM function.

Technically, two-parameter TRIM was a correctness fix. However, we reverted that by preserving the old behavior because two-parameter TRIM was not a SQL standard.

dongjoon-hyun

+1, LGTM. I'll leave this PR for the other committers' review especially @marmbrus .

dongjoon-hyun

BTW, please hold on until the vote.

HeartSaVioR · 2020-03-08T01:17:45Z

It doesn't seem to be just a cosmetic change, as far as we hear actual use case being affected by. It would be OK to let it be -1 for non-aggregate given -1 can be still differentiated with valid values, but for aggregations the value of -1 is being handled as "valid" values silently, and provides correctness issue. That would require "pre-process" (having a column for the result of size(col), but change to NULL if the value is -1) or hacky workaround.

I feel we should have clear reason to have -1 for the return value, what benefits we get from having it to -1. At least they should be a kind of "trade-off" if we would like to decide and take one - if it doesn't even a trade-off, it's clearly a correctness issue we should fix.

@ssimeonov
Could you please share the details on workaround you took?

ssimeonov · 2020-03-08T02:38:30Z

@HeartSaVioR the workaround was simple, unpleasant and unstable:

Simple, because we realized that we should never use size() as its behavior is (a) inconsistent with SQL principles and (b) dangerous for analytics.
Unpleasant, because it involved replacing every single use of the function in our codebase with a Column implicit that we created (size(x) -> x.size). This was easy for IDE-managed code but hard for notebook code. After consideration, we decided that our implicit would return 0 for null, because in every single use case we could think of for our data, we would be converting null returns to 0. (I would not recommend this for Spark as the behavior is inconsistent with SQL.) This is also the reason why we went with an implicit as opposed to, say, using import order to hide functions.size() with our own size(). Caveat: if we had more people using SparkSQL, we would have gone with size0(x) and sizen(x) instead of an implicit, for version that return 0 or null for null input.
Unstable, because there is no way to guarantee the problem will not happen again due to user error. It's part of our DOs and DONTs of Spark not to use size() but discipline alone does not offer guaranteed protection. The only way to guarantee the behavior is via a change to Spark. A settings-driven change is an excellent way to have the best of both worlds: sane & consistent behavior for future users and an easy backward compatibility mode for others.

As for the idea to have the size() behavior differ for aggregation vs. non-aggregation processing, I think this would be a bad pattern to introduce into an already very complex system such as Spark.

HeartSaVioR · 2020-03-08T03:09:57Z

As for the idea to have the size() behavior differ for aggregation vs. non-aggregation processing

No I didn't mean it as an idea. I used it to describe the issue where NULL as result wouldn't have.

dongjoon-hyun · 2020-03-11T16:53:59Z

Thank you all for making and reviewing this PR. Since the vote finished, I'll merge this.

…efault ### What changes were proposed in this pull request? This PR reverts #26051 and #26066 ### Why are the changes needed? There is no standard requiring that `size(null)` must return null, and returning -1 looks reasonable as well. This is kind of a cosmetic change and we should avoid it if it breaks existing queries. This is similar to reverting TRIM function parameter order change. ### Does this PR introduce any user-facing change? Yes, change the behavior of `size(null)` back to be the same as 2.4. ### How was this patch tested? N/A Closes #27834 from cloud-fan/revert. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 8efb710) Signed-off-by: Dongjoon Hyun <[email protected]>

dongjoon-hyun · 2020-03-11T17:07:15Z

cc @rxin since he is a release manager for 3.0.0. (Also, cc @gatorsmile )

HeartSaVioR · 2020-03-13T04:10:33Z

Btw, the configuration name spark.sql.legacy.sizeOfNull doesn't seem to be intuitive. It doesn't contain the information how it will behave. It's more confusing because now the default is true even the configuration is to retain legacy behavior.

How about renaming it as spark.sql.legacy.sizeOfNullAsMinusOne and set it to true (so that no translation is needed)? Personally I prefer spark.sql.legacy.sizeOfNullAsNull and set it to false (so that we need to reverse in translation) so either is fine for me. Once we make configuration be intuitive we could guide to use it instead.

dongjoon-hyun · 2020-03-13T04:56:25Z

For me, it looks better, @HeartSaVioR . Please make a PR for that.

dongjoon-hyun · 2020-03-13T04:59:28Z

Oh..

dongjoon-hyun · 2020-03-13T05:00:12Z

This exists at 2.4.x age. It seems that it's difficult to change.

dongjoon-hyun · 2020-03-13T05:03:55Z

Since it has already legacy in the name, spark.sql.legacy.sizeOfNull, true just means preserving the old behavior, whatever it was. So, in that sense, the current name is also okay.

HeartSaVioR · 2020-03-13T05:43:25Z

Oh OK. I'm not sure I love to have legacy configurations which would lead end users to remember the old behavior (it would be nice to have its right name, and we can still control the behavior via default value), but given it has been one of the conventions, looks OK for now.

One thing we may want to have attention is that, @ssimeonov did their own workaround while the behavior can be simply changed via touching the legacy config.

@ssimeonov Did you indicate the existence of legacy config? If then, could you please elaborate why you didn't leverage the config? Thanks in advance!

ssimeonov · 2020-03-13T05:48:07Z

We made our workaround long before this made it to Spark. After all the code was changed, it was easier to keep working in our own framework, ignoring all the mess of settings.

HeartSaVioR · 2020-03-13T09:49:16Z

Ah OK, you were faster. Thanks for the info.

### What changes were proposed in this pull request? Make `size(null)` return null under ANSI mode, regardless of the `spark.sql.legacy.sizeOfNull` config. ### Why are the changes needed? In #27834, we change the result of `size(null)` to be -1 to match the 2.4 behavior and avoid breaking changes. However, it's true that the "return -1" behavior is error-prone when being used with aggregate functions. The current ANSI mode controls a bunch of "better behaviors" like failing on overflow. We don't enable these "better behaviors" by default because they are too breaking. The "return null" behavior of `size(null)` is a good fit of the ANSI mode. ### Does this PR introduce any user-facing change? No as ANSI mode is off by default. ### How was this patch tested? new tests Closes #27936 from cloud-fan/null. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

Make `size(null)` return null under ANSI mode, regardless of the `spark.sql.legacy.sizeOfNull` config. In #27834, we change the result of `size(null)` to be -1 to match the 2.4 behavior and avoid breaking changes. However, it's true that the "return -1" behavior is error-prone when being used with aggregate functions. The current ANSI mode controls a bunch of "better behaviors" like failing on overflow. We don't enable these "better behaviors" by default because they are too breaking. The "return null" behavior of `size(null)` is a good fit of the ANSI mode. No as ANSI mode is off by default. new tests Closes #27936 from cloud-fan/null. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit dc5ebc2) Signed-off-by: Dongjoon Hyun <[email protected]>

…efault ### What changes were proposed in this pull request? This PR reverts apache#26051 and apache#26066 ### Why are the changes needed? There is no standard requiring that `size(null)` must return null, and returning -1 looks reasonable as well. This is kind of a cosmetic change and we should avoid it if it breaks existing queries. This is similar to reverting TRIM function parameter order change. ### Does this PR introduce any user-facing change? Yes, change the behavior of `size(null)` back to be the same as 2.4. ### How was this patch tested? N/A Closes apache#27834 from cloud-fan/revert. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? Make `size(null)` return null under ANSI mode, regardless of the `spark.sql.legacy.sizeOfNull` config. ### Why are the changes needed? In apache#27834, we change the result of `size(null)` to be -1 to match the 2.4 behavior and avoid breaking changes. However, it's true that the "return -1" behavior is error-prone when being used with aggregate functions. The current ANSI mode controls a bunch of "better behaviors" like failing on overflow. We don't enable these "better behaviors" by default because they are too breaking. The "return null" behavior of `size(null)` is a good fit of the ANSI mode. ### Does this PR introduce any user-facing change? No as ANSI mode is off by default. ### How was this patch tested? new tests Closes apache#27936 from cloud-fan/null. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

cloud-fan added 2 commits March 6, 2020 20:17

Revert "[SPARK-24640][SQL] Return NULL from size(NULL) by default"

fd6b03d

This reverts commit 4e6d31f.

remove migration guide

9ebf183

dongjoon-hyun added the SQL label Mar 6, 2020

dongjoon-hyun approved these changes Mar 6, 2020

View reviewed changes

dongjoon-hyun requested changes Mar 6, 2020

View reviewed changes

dongjoon-hyun changed the title ~~revert [SPARK-24640][SQL] Return NULL from size(NULL) by default~~ [SPARK-31091] Revert SPARK-24640 Return NULL from size(NULL) by default Mar 9, 2020

dongjoon-hyun closed this in 8efb710 Mar 11, 2020

cloud-fan mentioned this pull request Mar 17, 2020

[SPARK-31171][SQL] size(null) should return null under ansi mode #27936

Closed

lindong28 mentioned this pull request Sep 18, 2023

Add metric CountMap and Average alibaba/feathub#185

Merged

[SPARK-31091] Revert SPARK-24640 Return NULL from size(NULL) by default #27834

[SPARK-31091] Revert SPARK-24640 Return NULL from size(NULL) by default #27834

Uh oh!

Conversation

cloud-fan commented Mar 6, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

cloud-fan commented Mar 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxGekk commented Mar 6, 2020

Uh oh!

SparkQA commented Mar 6, 2020

Uh oh!

ssimeonov commented Mar 6, 2020

Uh oh!

dongjoon-hyun commented Mar 6, 2020

Uh oh!

dongjoon-hyun commented Mar 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Mar 8, 2020

Uh oh!

ssimeonov commented Mar 8, 2020

Uh oh!

HeartSaVioR commented Mar 8, 2020

Uh oh!

dongjoon-hyun commented Mar 11, 2020

Uh oh!

dongjoon-hyun commented Mar 11, 2020

Uh oh!

HeartSaVioR commented Mar 13, 2020

Uh oh!

dongjoon-hyun commented Mar 13, 2020

Uh oh!

dongjoon-hyun commented Mar 13, 2020

Uh oh!

dongjoon-hyun commented Mar 13, 2020

Uh oh!

dongjoon-hyun commented Mar 13, 2020

Uh oh!

HeartSaVioR commented Mar 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ssimeonov commented Mar 13, 2020

Uh oh!

HeartSaVioR commented Mar 13, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[SPARK-31091] Revert SPARK-24640 Return `NULL` from `size(NULL)` by default #27834

[SPARK-31091] Revert SPARK-24640 Return `NULL` from `size(NULL)` by default #27834

cloud-fan commented Mar 6, 2020 •

edited

Loading

dongjoon-hyun commented Mar 6, 2020 •

edited

Loading

HeartSaVioR commented Mar 13, 2020 •

edited

Loading