-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-30082][SQL][2.4] Depend on Scala type coercion when building replace query #26749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Do not cast `NaN` to an `Integer`, `Long`, `Short` or `Byte`. This is because casting `NaN` to those types results in a `0` which erroneously replaces `0`s while only `NaN`s should be replaced.
This Scala code snippet:
```
import scala.math;
println(Double.NaN.toLong)
```
returns `0` which is problematic as if you run the following Spark code, `0`s get replaced as well:
```
>>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value"))
>>> df.show()
+-----+-----+
|index|value|
+-----+-----+
| 1.0| 0|
| 0.0| 3|
| NaN| 0|
+-----+-----+
>>> df.replace(float('nan'), 2).show()
+-----+-----+
|index|value|
+-----+-----+
| 1.0| 2|
| 0.0| 3|
| 2.0| 2|
+-----+-----+
```
Yes, after the PR, running the same above code snippet returns the correct expected results:
```
>>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value"))
>>> df.show()
+-----+-----+
|index|value|
+-----+-----+
| 1.0| 0|
| 0.0| 3|
| NaN| 0|
+-----+-----+
>>> df.replace(float('nan'), 2).show()
+-----+-----+
|index|value|
+-----+-----+
| 1.0| 0|
| 0.0| 3|
| 2.0| 0|
+-----+-----+
```
Added unit tests to verify replacing `NaN` only affects columns of type `Float` and `Double`
Closes apache#26738 from johnhany97/SPARK-30082.
Lead-authored-by: John Ayad <[email protected]>
Co-authored-by: John Ayad <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Contributor
Author
|
@cloud-fan can you take a look? |
Contributor
|
ok to test |
Member
|
ok to test |
|
Test build #114808 has finished for PR 26749 at commit
|
cloud-fan
pushed a commit
that referenced
this pull request
Dec 4, 2019
### What changes were proposed in this pull request?
Do not cast `NaN` to an `Integer`, `Long`, `Short` or `Byte`. This is because casting `NaN` to those types results in a `0` which erroneously replaces `0`s while only `NaN`s should be replaced.
### Why are the changes needed?
This Scala code snippet:
```
import scala.math;
println(Double.NaN.toLong)
```
returns `0` which is problematic as if you run the following Spark code, `0`s get replaced as well:
```
>>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value"))
>>> df.show()
+-----+-----+
|index|value|
+-----+-----+
| 1.0| 0|
| 0.0| 3|
| NaN| 0|
+-----+-----+
>>> df.replace(float('nan'), 2).show()
+-----+-----+
|index|value|
+-----+-----+
| 1.0| 2|
| 0.0| 3|
| 2.0| 2|
+-----+-----+
```
### Does this PR introduce any user-facing change?
Yes, after the PR, running the same above code snippet returns the correct expected results:
```
>>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value"))
>>> df.show()
+-----+-----+
|index|value|
+-----+-----+
| 1.0| 0|
| 0.0| 3|
| NaN| 0|
+-----+-----+
>>> df.replace(float('nan'), 2).show()
+-----+-----+
|index|value|
+-----+-----+
| 1.0| 0|
| 0.0| 3|
| 2.0| 0|
+-----+-----+
```
### How was this patch tested?
Added unit tests to verify replacing `NaN` only affects columns of type `Float` and `Double`
Closes #26749 from johnhany97/SPARK-30082-2.4.
Authored-by: John Ayad <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Contributor
|
thanks, merging to 2.4! |
johnhany97
added a commit
to johnhany97/spark
that referenced
this pull request
Jan 15, 2020
### What changes were proposed in this pull request?
Do not cast `NaN` to an `Integer`, `Long`, `Short` or `Byte`. This is because casting `NaN` to those types results in a `0` which erroneously replaces `0`s while only `NaN`s should be replaced.
### Why are the changes needed?
This Scala code snippet:
```
import scala.math;
println(Double.NaN.toLong)
```
returns `0` which is problematic as if you run the following Spark code, `0`s get replaced as well:
```
>>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value"))
>>> df.show()
+-----+-----+
|index|value|
+-----+-----+
| 1.0| 0|
| 0.0| 3|
| NaN| 0|
+-----+-----+
>>> df.replace(float('nan'), 2).show()
+-----+-----+
|index|value|
+-----+-----+
| 1.0| 2|
| 0.0| 3|
| 2.0| 2|
+-----+-----+
```
### Does this PR introduce any user-facing change?
Yes, after the PR, running the same above code snippet returns the correct expected results:
```
>>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value"))
>>> df.show()
+-----+-----+
|index|value|
+-----+-----+
| 1.0| 0|
| 0.0| 3|
| NaN| 0|
+-----+-----+
>>> df.replace(float('nan'), 2).show()
+-----+-----+
|index|value|
+-----+-----+
| 1.0| 0|
| 0.0| 3|
| 2.0| 0|
+-----+-----+
```
### How was this patch tested?
Added unit tests to verify replacing `NaN` only affects columns of type `Float` and `Double`
Closes apache#26749 from johnhany97/SPARK-30082-2.4.
Authored-by: John Ayad <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
bulldozer-bot bot
pushed a commit
to palantir/spark
that referenced
this pull request
Jan 15, 2020
…e query (#628) apache#26738 apache#26749 ### What changes were proposed in this pull request? Depend on type coercion when building the replace query. This would solve an edge case where when trying to replace `NaN`s, `0`s would get replace too. ### Why are the changes needed? This Scala code snippet: ``` import scala.math; println(Double.NaN.toLong) ``` returns `0` which is problematic as if you run the following Spark code, `0`s get replaced as well: ``` >>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value")) >>> df.show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | NaN| 0| +-----+-----+ >>> df.replace(float('nan'), 2).show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 2| | 0.0| 3| | 2.0| 2| +-----+-----+ ``` ### Does this PR introduce any user-facing change? Yes, after the PR, running the same above code snippet returns the correct expected results: ``` >>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value")) >>> df.show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | NaN| 0| +-----+-----+ >>> df.replace(float('nan'), 2).show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | 2.0| 0| +-----+-----+ ``` And additionally, query results are changed as a result of the change in depending on scala's type coercion rules. ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Added unit tests to verify replacing `NaN` only affects columns of type `Float` and `Double`.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Depend on type coercion when building the replace query. This would solve an edge case where when trying to replace
NaNs,0s would get replace too.Why are the changes needed?
This Scala code snippet:
returns
0which is problematic as if you run the following Spark code,0s get replaced as well:Does this PR introduce any user-facing change?
Yes, after the PR, running the same above code snippet returns the correct expected results:
And additionally, query results are changed as a result of the change in depending on scala's type coercion rules.
How was this patch tested?
Added unit tests to verify replacing
NaNonly affects columns of typeFloatandDouble.