[SPARK-26448][SQL] retain the difference between 0.0 and -0.0 #23388

cloud-fan · 2018-12-26T18:21:14Z

What changes were proposed in this pull request?

In #23043 , we introduced a behavior change: Spark users are not able to distinguish 0.0 and -0.0 anymore.

This PR proposes an alternative fix to the original bug, to retain the difference between 0.0 and -0.0 inside Spark.

The idea is, we can rewrite the window partition key, join key and grouping key during logical phase, to normalize the special floating numbers. Thus only operators care about special floating numbers need to pay the perf overhead, and end users can distinguish -0.0.

How was this patch tested?

existing test

cloud-fan · 2018-12-26T18:22:16Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala

moved to JoinSuite.

cloud-fan · 2018-12-26T18:22:55Z

sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala

moved to the object, so that we can reuse it.

cloud-fan · 2018-12-26T18:25:21Z

cc @adoron @gatorsmile @dongjoon-hyun @viirya

cloud-fan · 2018-12-26T18:26:00Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeWriter.java

The comments are moved to the new rule.

SparkQA · 2018-12-26T18:44:33Z

Test build #100458 has finished for PR 23388 at commit b97c091.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class NormalizeNaNAndZero(child: Expression) extends UnaryExpression with ExpectsInputTypes

dongjoon-hyun · 2018-12-26T19:51:44Z

The error looks legitimate. Side-effects against decimal logics?

SparkQA · 2018-12-26T21:41:47Z

Test build #100459 has finished for PR 23388 at commit 67c694f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class NormalizeNaNAndZero(child: Expression) extends UnaryExpression with ExpectsInputTypes

SparkQA · 2018-12-26T22:04:04Z

Test build #100460 has finished for PR 23388 at commit 0cd1bcb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class NormalizeNaNAndZero(child: Expression) extends UnaryExpression with ExpectsInputTypes

cloud-fan · 2018-12-27T04:12:55Z

sql/core/src/test/scala/org/apache/spark/sql/DatasetPrimitiveSuite.scala

This is to prove that the test framework can distinguish -0.0 and 0.0.

cloud-fan · 2018-12-27T04:13:39Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala

sum of int is long, we shouldn't use double here.

uh... checkAnswer was unable to detect this

cloud-fan · 2018-12-27T04:14:32Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala

percentile always return double, we need to cast max to double so that we can compare the results.

cloud-fan · 2018-12-27T04:15:35Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

the result is decimal type, not double.

It's good to fix this.
Do you think this PR also affects the Spark's behavior on the existing apps?

It does not.

This test case was wrong at the first place, my change to the checkAnswer expose it.

viirya · 2018-12-27T05:26:26Z

docs/sql-migration-guide-upgrade.md

The motivation of this fix is to avoid this behavior change?

But for join keys and GROUP BY groups, the previous difference between 0.0 and -0.0 is treated as a bug, so we don't need to mention it in migration guide?

checkout the test case, "distinguish -0.0" is not about agg or join.

Aren't 0.0 and -0.0 treated as distinct groups for agg before the recent fix?

yes, and it's a bug. But if -0.0 is not used in grouping keys(and other similar places), users should still be able to distinguish it.

ah i see what you mean. Are you saying we should add migration guide for the behavior changes of grouping key/window partition key?

Yes, sorry for confusing. I'm not sure about if a migration guide is needed because it is a bug.

SparkQA · 2018-12-27T06:43:05Z

Test build #100465 has finished for PR 23388 at commit 12886ea.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-27T13:16:04Z

Test build #100474 has finished for PR 23388 at commit 12efbf5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-12-27T14:31:46Z

docs/sql-migration-guide-upgrade.md

I keep this migration guide because this bug is not very intuitive: literally -0.0 is not 0.0.

Is it better to explicitly state that outputs still distingish 0.0 and -0.0? For example, Seq(-0.0d).toDS().show() returns -0.0 in any version.

I think we only need to mention the difference between new and old versions.

SparkQA · 2018-12-27T17:00:19Z

Test build #100477 has finished for PR 23388 at commit 8b92191.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-28T05:12:13Z

Test build #100485 has finished for PR 23388 at commit 7810f7b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-12-28T05:41:46Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala

It seems to be a big behavior change in Spark testing.
So, this PR is going to enforce us to use explicit collect().toSeq for checkAnswer in some cases?

dongjoon-hyun · 2018-12-28T05:43:41Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

The result type of the above SQL statement is decimal(31,6). Can we use decimal type here?

viirya · 2018-12-28T08:14:30Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala

If NormalizeFloatingNumbers is an optimizer rule, NormalizeNaNAndZero should only go through Optimizer, so does it need to extend ExpectsInputTypes?

It doesn't, but I do it for safety. IIUC the test framework will throw exception if a plan becomes unresolved after a rule.

cloud-fan · 2018-12-28T14:05:45Z

@dongjoon-hyun I've implemented a safer way to let test framework distinguish -0.0, now we don't need to change a lot of existing test cases.

dongjoon-hyun · 2018-12-28T19:58:55Z

That's a big relief. Thank you, @cloud-fan !

hvanhovell · 2018-12-28T22:24:34Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala

Nit. All the window expressions in the project list also refer to the partitionSpec. Should we also normalize these?

assume the query is select a, a + sum(a) over (partition by a) ....

Since the project list is evaluated for each input row, I think the a in the project list should retain the different of -0.0. Thus I think only partitionSpec needs to be normalized.

Then make this clear by writing it up in a comment please? If the answer to this question is not obvious to the reviewer then it may also not be obvious to a later reader of the code, so in general it is advisable to answer misguided reviewer questions by adding comments. :)

hvanhovell · 2018-12-28T22:25:50Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala

Technically an optimizer rule should not change the result of a query. This rule does exactly that. Perhaps we should add a little bit of documentation for this.

The major reason is we create Joins during optimizaiton (for subquery), and I'm also worried about join reorder may break it. I'll add comment for it.

Also add it to nonExcludableRules?

ah good catch!

cloud-fan · 2019-01-07T15:29:55Z

retest this please

SparkQA · 2019-01-07T19:43:26Z

Test build #100892 has finished for PR 23388 at commit c228ad9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

bart-samwel · 2019-01-07T19:24:17Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala

NaNs are never equal to anything including other NaNs, so there is no reason to normalize them for join keys. It is fine to do it anyway for simplicity, but it should be made clear in the comments that this is not because we have to but just because it is easier.

That a good point. In Spark SQL, the EQUAL operator thinks 0.0 and -0.0 are same, so we have to follow it in join keys. I'm not sure how the SQL standard defines it, but it's another topic if we want to change the equal semantic of Spark SQL.

But you are right that we don't have to do it for join, we only need to do normalization for certain types of join that do binary comparison.

bart-samwel · 2019-01-07T19:25:14Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala

bart-samwel · 2019-01-07T19:30:46Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala

This reads as if the code is wrong. But it is not. The fact that we have to do this normalization for joins at least is not something that needs to be an analyzer rule for query correctness. Without the normalization the join query is perfectly fine if we execute it as a cross product with a filter applied as a post-join condition. In this case the requirement for normalization is an artifact of the fact that we use a shortcut for executing the join (binary comparison, sometimes hashing) which doesn't have the correct semantics for comparison. On the other hand for aggregation and window function partitioning the normalization is required for correctness.

bart-samwel · 2019-01-07T19:34:05Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala

Then make this clear by writing it up in a comment please? If the answer to this question is not obvious to the reviewer then it may also not be obvious to a later reader of the code, so in general it is advisable to answer misguided reviewer questions by adding comments. :)

bart-samwel · 2019-01-07T19:36:25Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala

Should this also check the right keys? Or is that implied by the fact that the keys of both sides have the same type? If so, please leave a comment to make it clear why the right keys are not checked.

the analyzer will make sure the left and right join keys are of the same data type. I'll add a comment to explain it, thanks!

bart-samwel · 2019-01-07T19:39:19Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala

This code is not future proof against the situation where map types are comparable in the future. It could be made future proof by throwing an exception if a map type is encountered here. If I understand correctly this code should never encounter a map type unless map types are comparable.

good point! I'll update soon.

bart-samwel · 2019-01-07T19:47:29Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala

This test relies on the specific property that division of zero by zero returns a different kind of NaN than Float.NaN. That is subtle and needs to be documented with a comment. You could also test with floatToRawIntBits that the values actually have different bits. Because if they do not, then you are actually not testing what the test is purportedly testing.

bart-samwel · 2019-01-07T19:51:38Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala

And maybe also arrays of structs and structs of arrays?

bart-samwel · 2019-01-07T19:56:29Z

sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala

Also select the v2 columns for clarity on why the result is what it is?

bart-samwel · 2019-01-07T19:57:40Z

sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala

Why the style difference compared to the previous test cases?

bart-samwel · 2019-01-08T15:16:42Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala

- *      to the same group.
+ *   2. In aggregate grouping keys, different NaNs should belong to the same group, -0.0 and 0.0
+ *      should belong to the same group.
 *   3. In join keys, different NaNs should be treated as same, `-0.0` and `0.0` should be


Still remove "different NaNs should be treated as same" here?

bart-samwel · 2019-01-08T15:32:17Z

Wait. This isn't right. NaNs in joins should actually be treated as *not* equal. That can't be done with binary value comparison. Well, maybe we can normalize NaNs to nulls? That would do the right thing. Op di 8 jan. 2019 13:19 schreef UCB AMPLab <[email protected]:

…

Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/6818/ Test PASSed. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#23388 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ApwVFMRmg7TPhBgGudMuE_91XHCPiwuLks5vBIzmgaJpZM4Zh-ye> .

cloud-fan · 2019-01-08T15:50:09Z

Hi @bart-samwel , in Spark SQL, different NaN values are treated as same, -0.0 and 0.0 are treated as same. So the normalization proposed by this PR keeps the current behavior and semantic unchanged.

If later on we find that NaN values should not be treated as same, basically we need to change 2 places:

the EQUAL operator should drop the special handling of NaNs, and always return false for 2 NaNs.
the normalization here should not apply to NaNs.

Furthermore, even the same NaN value should not equal to itself, so the binary comparison won't work. Like you proposed we should normalize NaN to null at that time.

Anyway, I think we should think about NaN later, as it will be a behavior change. What do you think?

bart-samwel · 2019-01-08T15:53:40Z

Let's be consistent with the equals operator now in this PR but then we may want to consider changing that to be consistent with IEEE floating point standard before spark 3.0 as well. Op di 8 jan. 2019 16:51 schreef Wenchen Fan <[email protected]:

…

Hi @bart-samwel <https://github.com/bart-samwel> , in Spark SQL, different NaN values are treated as same, -0.0 and 0.0 are treated as same. So the normalization proposed by this PR keeps the current behavior and semantic unchanged. If later on we find that NaN values should not be treated as same, basically we need to change 2 places: 1. the EQUAL operator should drop the special handling of NaNs, and always return false for 2 NaNs. 2. the normalization here should not apply to NaNs. Furthermore, even the same NaN value should not equal to itself, so the binary comparison won't work. Like you proposed we should normalize NaN to null at that time. Anyway, I think we should think about NaN later, as it will be a behavior change. What do you think? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#23388 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ApwVFEqMtsrnpSqA_LjGx-HIDlKZ4rCkks5vBL6GgaJpZM4Zh-ye> .

SparkQA · 2019-01-08T16:13:52Z

Test build #100927 has finished for PR 23388 at commit f420820.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-08T16:14:41Z

Test build #100928 has finished for PR 23388 at commit 3e8c171.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2019-01-09T00:32:06Z

@cloud-fan open a JIRA and revisit NaN handling before Spark 3.0?

gatorsmile · 2019-01-09T00:33:08Z

Will make one pass tonight. Thanks!

cloud-fan · 2019-01-09T01:06:48Z

@gatorsmile I have created https://issues.apache.org/jira/browse/SPARK-26575 to track the followup.

gatorsmile · 2019-01-09T21:50:05Z

LGTM

Thanks! Merged to master.

## What changes were proposed in this pull request? In apache#23043 , we introduced a behavior change: Spark users are not able to distinguish 0.0 and -0.0 anymore. This PR proposes an alternative fix to the original bug, to retain the difference between 0.0 and -0.0 inside Spark. The idea is, we can rewrite the window partition key, join key and grouping key during logical phase, to normalize the special floating numbers. Thus only operators care about special floating numbers need to pay the perf overhead, and end users can distinguish -0.0. ## How was this patch tested? existing test Closes apache#23388 from cloud-fan/minor. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: gatorsmile <[email protected]>

…s for final aggregate ## What changes were proposed in this pull request? A followup of apache#23388 . `AggUtils.createAggregate` is not the right place to normalize the grouping expressions, as final aggregate is also created by it. The grouping expressions of final aggregate should be attributes which refer to the grouping expressions in partial aggregate. This PR moves the normalization to the caller side of `AggUtils`. ## How was this patch tested? existing tests Closes apache#23692 from cloud-fan/follow. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? This is a followup of #23388 . #23388 has an issue: it doesn't handle subquery expressions and assumes they will be turned into joins. However, this is not true for non-correlated subquery expressions. This PR fixes this issue. It now doesn't skip `Subquery`, and subquery expressions will be handled by `OptimizeSubqueries`, which runs the optimizer with the subquery. Note that, correlated subquery expressions will be handled twice: once in `OptimizeSubqueries`, once later when it becomes join. This is OK as `NormalizeFloatingNumbers` is idempotent now. ### Why are the changes needed? fix a bug ### Does this PR introduce _any_ user-facing change? yes, see the newly added test. ### How was this patch tested? new test Closes #28785 from cloud-fan/normalize. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? This is a followup of #23388 . #23388 has an issue: it doesn't handle subquery expressions and assumes they will be turned into joins. However, this is not true for non-correlated subquery expressions. This PR fixes this issue. It now doesn't skip `Subquery`, and subquery expressions will be handled by `OptimizeSubqueries`, which runs the optimizer with the subquery. Note that, correlated subquery expressions will be handled twice: once in `OptimizeSubqueries`, once later when it becomes join. This is OK as `NormalizeFloatingNumbers` is idempotent now. ### Why are the changes needed? fix a bug ### Does this PR introduce _any_ user-facing change? yes, see the newly added test. ### How was this patch tested? new test Closes #28785 from cloud-fan/normalize. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 6fb9c80) Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? This is a followup of apache#23388 . apache#23388 has an issue: it doesn't handle subquery expressions and assumes they will be turned into joins. However, this is not true for non-correlated subquery expressions. This PR fixes this issue. It now doesn't skip `Subquery`, and subquery expressions will be handled by `OptimizeSubqueries`, which runs the optimizer with the subquery. Note that, correlated subquery expressions will be handled twice: once in `OptimizeSubqueries`, once later when it becomes join. This is OK as `NormalizeFloatingNumbers` is idempotent now. ### Why are the changes needed? fix a bug ### Does this PR introduce _any_ user-facing change? yes, see the newly added test. ### How was this patch tested? new test Closes apache#28785 from cloud-fan/normalize. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 6fb9c80) Signed-off-by: Wenchen Fan <[email protected]>

cloud-fan commented Dec 26, 2018

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala Outdated

Copy link

Contributor Author

cloud-fan Dec 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to JoinSuite.

cloud-fan commented Dec 26, 2018

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala Outdated

Copy link

Contributor Author

cloud-fan Dec 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to the object, so that we can reuse it.

cloud-fan mentioned this pull request Dec 26, 2018

[SPARK-26021][SQL] replace minus zero with zero in Platform.putDouble/Float #23043

Closed

cloud-fan commented Dec 26, 2018

View reviewed changes

cloud-fan force-pushed the minor branch 3 times, most recently from 67c694f to 0cd1bcb Compare December 26, 2018 20:02

cloud-fan commented Dec 27, 2018

View reviewed changes

viirya reviewed Dec 27, 2018

View reviewed changes

cloud-fan force-pushed the minor branch from 12886ea to 12efbf5 Compare December 27, 2018 11:27

cloud-fan commented Dec 27, 2018

View reviewed changes

dongjoon-hyun reviewed Dec 28, 2018

View reviewed changes

viirya reviewed Dec 28, 2018

View reviewed changes

hvanhovell reviewed Dec 28, 2018

View reviewed changes

bart-samwel suggested changes Jan 7, 2019

View reviewed changes

cloud-fan added 6 commits January 8, 2019 18:29

retain the difference between 0.0 and -0.0

0eb1781

fix tests

ee5a1f0

add back migration guide

d3c5992

fix test

74934da

add comment

fdc9988

updare

8dafc64

cloud-fan force-pushed the minor branch from c228ad9 to f420820 Compare January 8, 2019 12:07

address comments from Bart

3e8c171

cloud-fan force-pushed the minor branch from f420820 to 3e8c171 Compare January 8, 2019 12:13

bart-samwel approved these changes Jan 8, 2019

View reviewed changes

asfgit closed this in e853afb Jan 9, 2019

cloud-fan mentioned this pull request Jan 30, 2019

[SPARK-26448][SQL][followup] should not normalize grouping expressions for final aggregate #23692

Closed

cloud-fan mentioned this pull request Jun 10, 2020

[SPARK-31958][SQL] normalize special floating numbers in subquery #28785

Closed

[SPARK-26448][SQL] retain the difference between 0.0 and -0.0 #23388

[SPARK-26448][SQL] retain the difference between 0.0 and -0.0 #23388

Uh oh!

Conversation

cloud-fan commented Dec 26, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 26, 2018

Uh oh!

dongjoon-hyun commented Dec 26, 2018

Uh oh!

SparkQA commented Dec 26, 2018

Uh oh!

SparkQA commented Dec 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 27, 2018

Uh oh!

SparkQA commented Dec 27, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 27, 2018

Uh oh!

SparkQA commented Dec 28, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Dec 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 28, 2018

Uh oh!

dongjoon-hyun Dec 28, 2018 •

edited

Loading