-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-26448][SQL] retain the difference between 0.0 and -0.0 #23388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to JoinSuite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to the object, so that we can reuse it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments are moved to the new rule.
|
Test build #100458 has finished for PR 23388 at commit
|
|
The error looks legitimate. Side-effects against decimal logics? |
67c694f to
0cd1bcb
Compare
|
Test build #100459 has finished for PR 23388 at commit
|
|
Test build #100460 has finished for PR 23388 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to prove that the test framework can distinguish -0.0 and 0.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sum of int is long, we shouldn't use double here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uh... checkAnswer was unable to detect this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
percentile always return double, we need to cast max to double so that we can compare the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the result is decimal type, not double.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good to fix this.
Do you think this PR also affects the Spark's behavior on the existing apps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not.
This test case was wrong at the first place, my change to the checkAnswer expose it.
docs/sql-migration-guide-upgrade.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation of this fix is to avoid this behavior change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But for join keys and GROUP BY groups, the previous difference between 0.0 and -0.0 is treated as a bug, so we don't need to mention it in migration guide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkout the test case, "distinguish -0.0" is not about agg or join.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't 0.0 and -0.0 treated as distinct groups for agg before the recent fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, and it's a bug. But if -0.0 is not used in grouping keys(and other similar places), users should still be able to distinguish it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah i see what you mean. Are you saying we should add migration guide for the behavior changes of grouping key/window partition key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry for confusing. I'm not sure about if a migration guide is needed because it is a bug.
|
Test build #100465 has finished for PR 23388 at commit
|
|
Test build #100474 has finished for PR 23388 at commit
|
docs/sql-migration-guide-upgrade.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I keep this migration guide because this bug is not very intuitive: literally -0.0 is not 0.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to explicitly state that outputs still distingish 0.0 and -0.0? For example, Seq(-0.0d).toDS().show() returns -0.0 in any version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we only need to mention the difference between new and old versions.
|
Test build #100477 has finished for PR 23388 at commit
|
|
Test build #100485 has finished for PR 23388 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be a big behavior change in Spark testing.
So, this PR is going to enforce us to use explicit collect().toSeq for checkAnswer in some cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result type of the above SQL statement is decimal(31,6). Can we use decimal type here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If NormalizeFloatingNumbers is an optimizer rule, NormalizeNaNAndZero should only go through Optimizer, so does it need to extend ExpectsInputTypes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't, but I do it for safety. IIUC the test framework will throw exception if a plan becomes unresolved after a rule.
|
@dongjoon-hyun I've implemented a safer way to let test framework distinguish -0.0, now we don't need to change a lot of existing test cases. |
|
That's a big relief. Thank you, @cloud-fan ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit. All the window expressions in the project list also refer to the partitionSpec. Should we also normalize these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assume the query is select a, a + sum(a) over (partition by a) ....
Since the project list is evaluated for each input row, I think the a in the project list should retain the different of -0.0. Thus I think only partitionSpec needs to be normalized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then make this clear by writing it up in a comment please? If the answer to this question is not obvious to the reviewer then it may also not be obvious to a later reader of the code, so in general it is advisable to answer misguided reviewer questions by adding comments. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically an optimizer rule should not change the result of a query. This rule does exactly that. Perhaps we should add a little bit of documentation for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The major reason is we create Joins during optimizaiton (for subquery), and I'm also worried about join reorder may break it. I'll add comment for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add it to nonExcludableRules?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah good catch!
|
retest this please |
|
Test build #100892 has finished for PR 23388 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NaNs are never equal to anything including other NaNs, so there is no reason to normalize them for join keys. It is fine to do it anyway for simplicity, but it should be made clear in the comments that this is not because we have to but just because it is easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That a good point. In Spark SQL, the EQUAL operator thinks 0.0 and -0.0 are same, so we have to follow it in join keys. I'm not sure how the SQL standard defines it, but it's another topic if we want to change the equal semantic of Spark SQL.
But you are right that we don't have to do it for join, we only need to do normalization for certain types of join that do binary comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"correct"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reads as if the code is wrong. But it is not. The fact that we have to do this normalization for joins at least is not something that needs to be an analyzer rule for query correctness. Without the normalization the join query is perfectly fine if we execute it as a cross product with a filter applied as a post-join condition. In this case the requirement for normalization is an artifact of the fact that we use a shortcut for executing the join (binary comparison, sometimes hashing) which doesn't have the correct semantics for comparison. On the other hand for aggregation and window function partitioning the normalization is required for correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then make this clear by writing it up in a comment please? If the answer to this question is not obvious to the reviewer then it may also not be obvious to a later reader of the code, so in general it is advisable to answer misguided reviewer questions by adding comments. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also check the right keys? Or is that implied by the fact that the keys of both sides have the same type? If so, please leave a comment to make it clear why the right keys are not checked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the analyzer will make sure the left and right join keys are of the same data type. I'll add a comment to explain it, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is not future proof against the situation where map types are comparable in the future. It could be made future proof by throwing an exception if a map type is encountered here. If I understand correctly this code should never encounter a map type unless map types are comparable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point! I'll update soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test relies on the specific property that division of zero by zero returns a different kind of NaN than Float.NaN. That is subtle and needs to be documented with a comment. You could also test with floatToRawIntBits that the values actually have different bits. Because if they do not, then you are actually not testing what the test is purportedly testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And maybe also arrays of structs and structs of arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also select the v2 columns for clarity on why the result is what it is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the style difference compared to the previous test cases?
| * to the same group. | ||
| * 2. In aggregate grouping keys, different NaNs should belong to the same group, -0.0 and 0.0 | ||
| * should belong to the same group. | ||
| * 3. In join keys, different NaNs should be treated as same, `-0.0` and `0.0` should be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still remove "different NaNs should be treated as same" here?
|
Wait. This isn't right. NaNs in joins should actually be treated as *not*
equal. That can't be done with binary value comparison. Well, maybe we can
normalize NaNs to nulls? That would do the right thing.
Op di 8 jan. 2019 13:19 schreef UCB AMPLab <[email protected]:
… Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/6818/
Test PASSed.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#23388 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ApwVFMRmg7TPhBgGudMuE_91XHCPiwuLks5vBIzmgaJpZM4Zh-ye>
.
|
|
Hi @bart-samwel , in Spark SQL, different NaN values are treated as same, -0.0 and 0.0 are treated as same. So the normalization proposed by this PR keeps the current behavior and semantic unchanged. If later on we find that NaN values should not be treated as same, basically we need to change 2 places:
Furthermore, even the same NaN value should not equal to itself, so the binary comparison won't work. Like you proposed we should normalize NaN to null at that time. Anyway, I think we should think about NaN later, as it will be a behavior change. What do you think? |
|
Let's be consistent with the equals operator now in this PR but then we may
want to consider changing that to be consistent with IEEE floating point
standard before spark 3.0 as well.
Op di 8 jan. 2019 16:51 schreef Wenchen Fan <[email protected]:
… Hi @bart-samwel <https://github.com/bart-samwel> , in Spark SQL,
different NaN values are treated as same, -0.0 and 0.0 are treated as same.
So the normalization proposed by this PR keeps the current behavior and
semantic unchanged.
If later on we find that NaN values should not be treated as same,
basically we need to change 2 places:
1. the EQUAL operator should drop the special handling of NaNs, and
always return false for 2 NaNs.
2. the normalization here should not apply to NaNs.
Furthermore, even the same NaN value should not equal to itself, so the
binary comparison won't work. Like you proposed we should normalize NaN to
null at that time.
Anyway, I think we should think about NaN later, as it will be a behavior
change. What do you think?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23388 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ApwVFEqMtsrnpSqA_LjGx-HIDlKZ4rCkks5vBL6GgaJpZM4Zh-ye>
.
|
|
Test build #100927 has finished for PR 23388 at commit
|
|
Test build #100928 has finished for PR 23388 at commit
|
|
@cloud-fan open a JIRA and revisit NaN handling before Spark 3.0? |
|
Will make one pass tonight. Thanks! |
|
@gatorsmile I have created https://issues.apache.org/jira/browse/SPARK-26575 to track the followup. |
|
LGTM Thanks! Merged to master. |
## What changes were proposed in this pull request? In apache#23043 , we introduced a behavior change: Spark users are not able to distinguish 0.0 and -0.0 anymore. This PR proposes an alternative fix to the original bug, to retain the difference between 0.0 and -0.0 inside Spark. The idea is, we can rewrite the window partition key, join key and grouping key during logical phase, to normalize the special floating numbers. Thus only operators care about special floating numbers need to pay the perf overhead, and end users can distinguish -0.0. ## How was this patch tested? existing test Closes apache#23388 from cloud-fan/minor. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: gatorsmile <[email protected]>
…s for final aggregate ## What changes were proposed in this pull request? A followup of apache#23388 . `AggUtils.createAggregate` is not the right place to normalize the grouping expressions, as final aggregate is also created by it. The grouping expressions of final aggregate should be attributes which refer to the grouping expressions in partial aggregate. This PR moves the normalization to the caller side of `AggUtils`. ## How was this patch tested? existing tests Closes apache#23692 from cloud-fan/follow. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? This is a followup of #23388 . #23388 has an issue: it doesn't handle subquery expressions and assumes they will be turned into joins. However, this is not true for non-correlated subquery expressions. This PR fixes this issue. It now doesn't skip `Subquery`, and subquery expressions will be handled by `OptimizeSubqueries`, which runs the optimizer with the subquery. Note that, correlated subquery expressions will be handled twice: once in `OptimizeSubqueries`, once later when it becomes join. This is OK as `NormalizeFloatingNumbers` is idempotent now. ### Why are the changes needed? fix a bug ### Does this PR introduce _any_ user-facing change? yes, see the newly added test. ### How was this patch tested? new test Closes #28785 from cloud-fan/normalize. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? This is a followup of #23388 . #23388 has an issue: it doesn't handle subquery expressions and assumes they will be turned into joins. However, this is not true for non-correlated subquery expressions. This PR fixes this issue. It now doesn't skip `Subquery`, and subquery expressions will be handled by `OptimizeSubqueries`, which runs the optimizer with the subquery. Note that, correlated subquery expressions will be handled twice: once in `OptimizeSubqueries`, once later when it becomes join. This is OK as `NormalizeFloatingNumbers` is idempotent now. ### Why are the changes needed? fix a bug ### Does this PR introduce _any_ user-facing change? yes, see the newly added test. ### How was this patch tested? new test Closes #28785 from cloud-fan/normalize. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 6fb9c80) Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? This is a followup of apache#23388 . apache#23388 has an issue: it doesn't handle subquery expressions and assumes they will be turned into joins. However, this is not true for non-correlated subquery expressions. This PR fixes this issue. It now doesn't skip `Subquery`, and subquery expressions will be handled by `OptimizeSubqueries`, which runs the optimizer with the subquery. Note that, correlated subquery expressions will be handled twice: once in `OptimizeSubqueries`, once later when it becomes join. This is OK as `NormalizeFloatingNumbers` is idempotent now. ### Why are the changes needed? fix a bug ### Does this PR introduce _any_ user-facing change? yes, see the newly added test. ### How was this patch tested? new test Closes apache#28785 from cloud-fan/normalize. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 6fb9c80) Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
In #23043 , we introduced a behavior change: Spark users are not able to distinguish 0.0 and -0.0 anymore.
This PR proposes an alternative fix to the original bug, to retain the difference between 0.0 and -0.0 inside Spark.
The idea is, we can rewrite the window partition key, join key and grouping key during logical phase, to normalize the special floating numbers. Thus only operators care about special floating numbers need to pay the perf overhead, and end users can distinguish -0.0.
How was this patch tested?
existing test