Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

This PR reverts #26051 and #26066

Why are the changes needed?

There is no standard requiring that size(null) must return null, and returning -1 looks reasonable as well. This is kind of a cosmetic change and we should avoid it if it breaks existing queries. This is similar to reverting TRIM function parameter order change.

Does this PR introduce any user-facing change?

Yes, change the behavior of size(null) back to be the same as 2.4.

How was this patch tested?

N/A

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Mar 6, 2020

@MaxGekk
Copy link
Member

MaxGekk commented Mar 6, 2020

returning -1 looks reasonable as well.

For the full picture, -1 may lead to wrong query results. Here is the example I got from @ssimeonov :

"A client discovered this behavior while investigating post-click user engagement in their AdTech system. The schema was per ad placement and post-click user engagements were in an array of structs. The culprit was df.groupBy('placementId).agg(sum(size('engagements)).as("engagement_count"), ...), which subtracted 1 for every click without post-click engagement. Luckily, the behavior led to negative engagement counts in some periods, which alerted them to the problem and this bizarre behavior."

@SparkQA
Copy link

SparkQA commented Mar 6, 2020

Test build #119465 has finished for PR 27834 at commit 9ebf183.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ssimeonov
Copy link
Contributor

Thanks @MaxGekk. Yes, after discovering this, we had to comb from our entire codebase to refactor size(column) into column.size, adding our own implicit.

Taking a step back, a -1 return from a function reeks of non-SQL thinking by whoever implemented this in Hive originally. As Spark grows, its audience is expanding rapidly. We need to move away from random engineer decisions made many years ago in a different product towards what makes sense for the future (much larger) audience of Spark users.

@dongjoon-hyun
Copy link
Member

Thank you for pinging me, @cloud-fan .

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Mar 6, 2020

I prefer the AS-IS status of 3.0, but this seems to fall into the same category where @marmbrus and @gatorsmile asked reverting on the behavior fix at two-parameter TRIM function.

Technically, two-parameter TRIM was a correctness fix. However, we reverted that by preserving the old behavior because two-parameter TRIM was not a SQL standard.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. I'll leave this PR for the other committers' review especially @marmbrus .

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, please hold on until the vote.

@HeartSaVioR
Copy link
Contributor

It doesn't seem to be just a cosmetic change, as far as we hear actual use case being affected by. It would be OK to let it be -1 for non-aggregate given -1 can be still differentiated with valid values, but for aggregations the value of -1 is being handled as "valid" values silently, and provides correctness issue. That would require "pre-process" (having a column for the result of size(col), but change to NULL if the value is -1) or hacky workaround.

I feel we should have clear reason to have -1 for the return value, what benefits we get from having it to -1. At least they should be a kind of "trade-off" if we would like to decide and take one - if it doesn't even a trade-off, it's clearly a correctness issue we should fix.

@ssimeonov
Could you please share the details on workaround you took?

@ssimeonov
Copy link
Contributor

@HeartSaVioR the workaround was simple, unpleasant and unstable:

  • Simple, because we realized that we should never use size() as its behavior is (a) inconsistent with SQL principles and (b) dangerous for analytics.

  • Unpleasant, because it involved replacing every single use of the function in our codebase with a Column implicit that we created (size(x) -> x.size). This was easy for IDE-managed code but hard for notebook code. After consideration, we decided that our implicit would return 0 for null, because in every single use case we could think of for our data, we would be converting null returns to 0. (I would not recommend this for Spark as the behavior is inconsistent with SQL.) This is also the reason why we went with an implicit as opposed to, say, using import order to hide functions.size() with our own size(). Caveat: if we had more people using SparkSQL, we would have gone with size0(x) and sizen(x) instead of an implicit, for version that return 0 or null for null input.

  • Unstable, because there is no way to guarantee the problem will not happen again due to user error. It's part of our DOs and DONTs of Spark not to use size() but discipline alone does not offer guaranteed protection. The only way to guarantee the behavior is via a change to Spark. A settings-driven change is an excellent way to have the best of both worlds: sane & consistent behavior for future users and an easy backward compatibility mode for others.

As for the idea to have the size() behavior differ for aggregation vs. non-aggregation processing, I think this would be a bad pattern to introduce into an already very complex system such as Spark.

@HeartSaVioR
Copy link
Contributor

As for the idea to have the size() behavior differ for aggregation vs. non-aggregation processing

No I didn't mean it as an idea. I used it to describe the issue where NULL as result wouldn't have.

@dongjoon-hyun dongjoon-hyun changed the title revert [SPARK-24640][SQL] Return NULL from size(NULL) by default [SPARK-31091] Revert SPARK-24640 Return NULL from size(NULL) by default Mar 9, 2020
@dongjoon-hyun
Copy link
Member

Thank you all for making and reviewing this PR. Since the vote finished, I'll merge this.

dongjoon-hyun pushed a commit that referenced this pull request Mar 11, 2020
…efault

### What changes were proposed in this pull request?

This PR reverts #26051 and #26066

### Why are the changes needed?

There is no standard requiring that `size(null)` must return null, and returning -1 looks reasonable as well. This is kind of a cosmetic change and we should avoid it if it breaks existing queries. This is similar to reverting TRIM function parameter order change.

### Does this PR introduce any user-facing change?

Yes, change the behavior of `size(null)` back to be the same as 2.4.

### How was this patch tested?

N/A

Closes #27834 from cloud-fan/revert.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 8efb710)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

cc @rxin since he is a release manager for 3.0.0. (Also, cc @gatorsmile )

@HeartSaVioR
Copy link
Contributor

Btw, the configuration name spark.sql.legacy.sizeOfNull doesn't seem to be intuitive. It doesn't contain the information how it will behave. It's more confusing because now the default is true even the configuration is to retain legacy behavior.

How about renaming it as spark.sql.legacy.sizeOfNullAsMinusOne and set it to true (so that no translation is needed)? Personally I prefer spark.sql.legacy.sizeOfNullAsNull and set it to false (so that we need to reverse in translation) so either is fine for me. Once we make configuration be intuitive we could guide to use it instead.

@dongjoon-hyun
Copy link
Member

For me, it looks better, @HeartSaVioR . Please make a PR for that.

@dongjoon-hyun
Copy link
Member

Oh..

@dongjoon-hyun
Copy link
Member

This exists at 2.4.x age. It seems that it's difficult to change.

@dongjoon-hyun
Copy link
Member

Since it has already legacy in the name, spark.sql.legacy.sizeOfNull, true just means preserving the old behavior, whatever it was. So, in that sense, the current name is also okay.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Mar 13, 2020

Oh OK. I'm not sure I love to have legacy configurations which would lead end users to remember the old behavior (it would be nice to have its right name, and we can still control the behavior via default value), but given it has been one of the conventions, looks OK for now.

One thing we may want to have attention is that, @ssimeonov did their own workaround while the behavior can be simply changed via touching the legacy config.

@ssimeonov Did you indicate the existence of legacy config? If then, could you please elaborate why you didn't leverage the config? Thanks in advance!

@ssimeonov
Copy link
Contributor

We made our workaround long before this made it to Spark. After all the code was changed, it was easier to keep working in our own framework, ignoring all the mess of settings.

@HeartSaVioR
Copy link
Contributor

Ah OK, you were faster. Thanks for the info.

dongjoon-hyun pushed a commit that referenced this pull request Mar 17, 2020
### What changes were proposed in this pull request?

Make `size(null)` return null under ANSI mode, regardless of the `spark.sql.legacy.sizeOfNull` config.

### Why are the changes needed?

In #27834, we change the result of `size(null)` to be -1 to match the 2.4 behavior and avoid breaking changes.

However, it's true that the "return -1" behavior is error-prone when being used with aggregate functions. The current ANSI mode controls a bunch of "better behaviors" like failing on overflow. We don't enable these "better behaviors" by default because they are too breaking. The "return null" behavior of `size(null)` is a good fit of the ANSI mode.

### Does this PR introduce any user-facing change?

No as ANSI mode is off by default.

### How was this patch tested?

new tests

Closes #27936 from cloud-fan/null.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Mar 17, 2020
Make `size(null)` return null under ANSI mode, regardless of the `spark.sql.legacy.sizeOfNull` config.

In #27834, we change the result of `size(null)` to be -1 to match the 2.4 behavior and avoid breaking changes.

However, it's true that the "return -1" behavior is error-prone when being used with aggregate functions. The current ANSI mode controls a bunch of "better behaviors" like failing on overflow. We don't enable these "better behaviors" by default because they are too breaking. The "return null" behavior of `size(null)` is a good fit of the ANSI mode.

No as ANSI mode is off by default.

new tests

Closes #27936 from cloud-fan/null.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit dc5ebc2)
Signed-off-by: Dongjoon Hyun <[email protected]>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…efault

### What changes were proposed in this pull request?

This PR reverts apache#26051 and apache#26066

### Why are the changes needed?

There is no standard requiring that `size(null)` must return null, and returning -1 looks reasonable as well. This is kind of a cosmetic change and we should avoid it if it breaks existing queries. This is similar to reverting TRIM function parameter order change.

### Does this PR introduce any user-facing change?

Yes, change the behavior of `size(null)` back to be the same as 2.4.

### How was this patch tested?

N/A

Closes apache#27834 from cloud-fan/revert.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?

Make `size(null)` return null under ANSI mode, regardless of the `spark.sql.legacy.sizeOfNull` config.

### Why are the changes needed?

In apache#27834, we change the result of `size(null)` to be -1 to match the 2.4 behavior and avoid breaking changes.

However, it's true that the "return -1" behavior is error-prone when being used with aggregate functions. The current ANSI mode controls a bunch of "better behaviors" like failing on overflow. We don't enable these "better behaviors" by default because they are too breaking. The "return null" behavior of `size(null)` is a good fit of the ANSI mode.

### Does this PR introduce any user-facing change?

No as ANSI mode is off by default.

### How was this patch tested?

new tests

Closes apache#27936 from cloud-fan/null.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants