[SPARK-30082][SQL][2.4] Depend on Scala type coercion when building replace query #26749

johnhany97 · 2019-12-03T16:30:38Z

What changes were proposed in this pull request?

Depend on type coercion when building the replace query. This would solve an edge case where when trying to replace NaNs, 0s would get replace too.

Why are the changes needed?

This Scala code snippet:

import scala.math;

println(Double.NaN.toLong)

returns 0 which is problematic as if you run the following Spark code, 0s get replaced as well:

>>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value"))
>>> df.show()
+-----+-----+
|index|value|
+-----+-----+
|  1.0|    0|
|  0.0|    3|
|  NaN|    0|
+-----+-----+
>>> df.replace(float('nan'), 2).show()
+-----+-----+
|index|value|
+-----+-----+
|  1.0|    2|
|  0.0|    3|
|  2.0|    2|
+-----+-----+

Does this PR introduce any user-facing change?

Yes, after the PR, running the same above code snippet returns the correct expected results:

>>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value"))
>>> df.show()
+-----+-----+
|index|value|
+-----+-----+
|  1.0|    0|
|  0.0|    3|
|  NaN|    0|
+-----+-----+

>>> df.replace(float('nan'), 2).show()
+-----+-----+
|index|value|
+-----+-----+
|  1.0|    0|
|  0.0|    3|
|  2.0|    0|
+-----+-----+

And additionally, query results are changed as a result of the change in depending on scala's type coercion rules.

How was this patch tested?

Added unit tests to verify replacing NaN only affects columns of type Float and Double.

Do not cast `NaN` to an `Integer`, `Long`, `Short` or `Byte`. This is because casting `NaN` to those types results in a `0` which erroneously replaces `0`s while only `NaN`s should be replaced. This Scala code snippet: ``` import scala.math; println(Double.NaN.toLong) ``` returns `0` which is problematic as if you run the following Spark code, `0`s get replaced as well: ``` >>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value")) >>> df.show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | NaN| 0| +-----+-----+ >>> df.replace(float('nan'), 2).show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 2| | 0.0| 3| | 2.0| 2| +-----+-----+ ``` Yes, after the PR, running the same above code snippet returns the correct expected results: ``` >>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value")) >>> df.show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | NaN| 0| +-----+-----+ >>> df.replace(float('nan'), 2).show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | 2.0| 0| +-----+-----+ ``` Added unit tests to verify replacing `NaN` only affects columns of type `Float` and `Double` Closes apache#26738 from johnhany97/SPARK-30082. Lead-authored-by: John Ayad <[email protected]> Co-authored-by: John Ayad <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

johnhany97 · 2019-12-03T16:34:46Z

@cloud-fan can you take a look?

cloud-fan · 2019-12-03T16:37:57Z

ok to test

HyukjinKwon · 2019-12-04T00:58:31Z

ok to test

SparkQA · 2019-12-04T04:03:54Z

Test build #114808 has finished for PR 26749 at commit f23d113.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

### What changes were proposed in this pull request? Do not cast `NaN` to an `Integer`, `Long`, `Short` or `Byte`. This is because casting `NaN` to those types results in a `0` which erroneously replaces `0`s while only `NaN`s should be replaced. ### Why are the changes needed? This Scala code snippet: ``` import scala.math; println(Double.NaN.toLong) ``` returns `0` which is problematic as if you run the following Spark code, `0`s get replaced as well: ``` >>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value")) >>> df.show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | NaN| 0| +-----+-----+ >>> df.replace(float('nan'), 2).show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 2| | 0.0| 3| | 2.0| 2| +-----+-----+ ``` ### Does this PR introduce any user-facing change? Yes, after the PR, running the same above code snippet returns the correct expected results: ``` >>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value")) >>> df.show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | NaN| 0| +-----+-----+ >>> df.replace(float('nan'), 2).show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | 2.0| 0| +-----+-----+ ``` ### How was this patch tested? Added unit tests to verify replacing `NaN` only affects columns of type `Float` and `Double` Closes #26749 from johnhany97/SPARK-30082-2.4. Authored-by: John Ayad <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

cloud-fan · 2019-12-04T05:26:49Z

thanks, merging to 2.4!

### What changes were proposed in this pull request? Do not cast `NaN` to an `Integer`, `Long`, `Short` or `Byte`. This is because casting `NaN` to those types results in a `0` which erroneously replaces `0`s while only `NaN`s should be replaced. ### Why are the changes needed? This Scala code snippet: ``` import scala.math; println(Double.NaN.toLong) ``` returns `0` which is problematic as if you run the following Spark code, `0`s get replaced as well: ``` >>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value")) >>> df.show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | NaN| 0| +-----+-----+ >>> df.replace(float('nan'), 2).show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 2| | 0.0| 3| | 2.0| 2| +-----+-----+ ``` ### Does this PR introduce any user-facing change? Yes, after the PR, running the same above code snippet returns the correct expected results: ``` >>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value")) >>> df.show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | NaN| 0| +-----+-----+ >>> df.replace(float('nan'), 2).show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | 2.0| 0| +-----+-----+ ``` ### How was this patch tested? Added unit tests to verify replacing `NaN` only affects columns of type `Float` and `Double` Closes apache#26749 from johnhany97/SPARK-30082-2.4. Authored-by: John Ayad <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…e query (#628) apache#26738 apache#26749 ### What changes were proposed in this pull request? Depend on type coercion when building the replace query. This would solve an edge case where when trying to replace `NaN`s, `0`s would get replace too. ### Why are the changes needed? This Scala code snippet: ``` import scala.math; println(Double.NaN.toLong) ``` returns `0` which is problematic as if you run the following Spark code, `0`s get replaced as well: ``` >>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value")) >>> df.show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | NaN| 0| +-----+-----+ >>> df.replace(float('nan'), 2).show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 2| | 0.0| 3| | 2.0| 2| +-----+-----+ ``` ### Does this PR introduce any user-facing change? Yes, after the PR, running the same above code snippet returns the correct expected results: ``` >>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value")) >>> df.show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | NaN| 0| +-----+-----+ >>> df.replace(float('nan'), 2).show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | 2.0| 0| +-----+-----+ ``` And additionally, query results are changed as a result of the change in depending on scala's type coercion rules. ### How was this patch tested?  Added unit tests to verify replacing `NaN` only affects columns of type `Float` and `Double`.

cloud-fan changed the title ~~[SPARK-30082][SQL] Do not replace Zeros when replacing NaNs~~ [SPARK-30082][SQL][2.4] Do not replace Zeros when replacing NaNs Dec 3, 2019

cloud-fan closed this Dec 4, 2019

johnhany97 changed the title ~~[SPARK-30082][SQL][2.4] Do not replace Zeros when replacing NaNs~~ [SPARK-30082][SQL][2.4] Depend on Scala type coercion when building replace query Jan 10, 2020

johnhany97 mentioned this pull request Jan 10, 2020

[SPARK-30082][SQL] Depend on Scala type coercion when building replace query palantir/spark#628

Merged

dongjoon-hyun added the SQL label Feb 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-30082][SQL][2.4] Depend on Scala type coercion when building replace query #26749

[SPARK-30082][SQL][2.4] Depend on Scala type coercion when building replace query #26749

Uh oh!

johnhany97 commented Dec 3, 2019 •

edited

Loading

Uh oh!

johnhany97 commented Dec 3, 2019

Uh oh!

cloud-fan commented Dec 3, 2019

Uh oh!

HyukjinKwon commented Dec 4, 2019

Uh oh!

SparkQA commented Dec 4, 2019

Uh oh!

cloud-fan commented Dec 4, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-30082][SQL][2.4] Depend on Scala type coercion when building replace query #26749

[SPARK-30082][SQL][2.4] Depend on Scala type coercion when building replace query #26749

Uh oh!

Conversation

johnhany97 commented Dec 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

johnhany97 commented Dec 3, 2019

Uh oh!

cloud-fan commented Dec 3, 2019

Uh oh!

HyukjinKwon commented Dec 4, 2019

Uh oh!

SparkQA commented Dec 4, 2019

Uh oh!

cloud-fan commented Dec 4, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

johnhany97 commented Dec 3, 2019 •

edited

Loading