[SPARK-30082][SQL] Depend on Scala type coercion when building replace query #26738

johnhany97 · 2019-12-02T15:50:44Z

What changes were proposed in this pull request?

Depend on type coercion when building the replace query. This would solve an edge case where when trying to replace NaNs, 0s would get replace too.

Why are the changes needed?

This Scala code snippet:

import scala.math;

println(Double.NaN.toLong)

returns 0 which is problematic as if you run the following Spark code, 0s get replaced as well:

>>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value"))
>>> df.show()
+-----+-----+
|index|value|
+-----+-----+
|  1.0|    0|
|  0.0|    3|
|  NaN|    0|
+-----+-----+
>>> df.replace(float('nan'), 2).show()
+-----+-----+
|index|value|
+-----+-----+
|  1.0|    2|
|  0.0|    3|
|  2.0|    2|
+-----+-----+

Does this PR introduce any user-facing change?

Yes, after the PR, running the same above code snippet returns the correct expected results:

>>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value"))
>>> df.show()
+-----+-----+
|index|value|
+-----+-----+
|  1.0|    0|
|  0.0|    3|
|  NaN|    0|
+-----+-----+

>>> df.replace(float('nan'), 2).show()
+-----+-----+
|index|value|
+-----+-----+
|  1.0|    0|
|  0.0|    3|
|  2.0|    0|
+-----+-----+

And additionally, query results are changed as a result of the change in depending on scala's type coercion rules.

How was this patch tested?

Added unit tests to verify replacing NaN only affects columns of type Float and Double.

mccheah · 2019-12-02T19:05:33Z

Ok to test

This reverts commit 8de494e.

Do not do a replace if source/target is NaN and the column is not of type Double/Float

sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala

mccheah · 2019-12-02T21:51:51Z

@dongjoon-hyun can you take a look?

dongjoon-hyun · 2019-12-03T04:43:49Z

Thank you for pinging me, @mccheah . Sure.

dongjoon-hyun · 2019-12-03T04:55:32Z

ok to test

sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala

SparkQA · 2019-12-03T08:05:02Z

Test build #114753 has finished for PR 26738 at commit ed6f08d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala

sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala

SparkQA · 2019-12-03T12:06:12Z

Test build #114769 has finished for PR 26738 at commit 279a9fd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala

SparkQA · 2019-12-03T15:21:01Z

Test build #114770 has finished for PR 26738 at commit 6b5d26d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-03T15:25:39Z

Test build #114771 has finished for PR 26738 at commit 10c91d6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-03T15:54:03Z

Test build #114772 has finished for PR 26738 at commit 1744b28.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-12-03T16:05:07Z

thanks, merging to master!

cloud-fan · 2019-12-03T16:05:43Z

@johnhany97 do you mind opening another PR for 2.4? thanks!

johnhany97 · 2019-12-03T16:07:55Z

@cloud-fan sure thing

Do not cast `NaN` to an `Integer`, `Long`, `Short` or `Byte`. This is because casting `NaN` to those types results in a `0` which erroneously replaces `0`s while only `NaN`s should be replaced. This Scala code snippet: ``` import scala.math; println(Double.NaN.toLong) ``` returns `0` which is problematic as if you run the following Spark code, `0`s get replaced as well: ``` >>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value")) >>> df.show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | NaN| 0| +-----+-----+ >>> df.replace(float('nan'), 2).show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 2| | 0.0| 3| | 2.0| 2| +-----+-----+ ``` Yes, after the PR, running the same above code snippet returns the correct expected results: ``` >>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value")) >>> df.show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | NaN| 0| +-----+-----+ >>> df.replace(float('nan'), 2).show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | 2.0| 0| +-----+-----+ ``` Added unit tests to verify replacing `NaN` only affects columns of type `Float` and `Double` Closes apache#26738 from johnhany97/SPARK-30082. Lead-authored-by: John Ayad <[email protected]> Co-authored-by: John Ayad <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

dongjoon-hyun · 2019-12-03T16:54:08Z

Thank you, @johnhany97 , @mccheah , @cloud-fan .

SparkQA · 2019-12-03T17:16:43Z

Test build #114776 has finished for PR 26738 at commit 1295633.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-03T18:02:37Z

Test build #114781 has finished for PR 26738 at commit b3709a1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

### What changes were proposed in this pull request? Do not cast `NaN` to an `Integer`, `Long`, `Short` or `Byte`. This is because casting `NaN` to those types results in a `0` which erroneously replaces `0`s while only `NaN`s should be replaced. ### Why are the changes needed? This Scala code snippet: ``` import scala.math; println(Double.NaN.toLong) ``` returns `0` which is problematic as if you run the following Spark code, `0`s get replaced as well: ``` >>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value")) >>> df.show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | NaN| 0| +-----+-----+ >>> df.replace(float('nan'), 2).show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 2| | 0.0| 3| | 2.0| 2| +-----+-----+ ``` ### Does this PR introduce any user-facing change? Yes, after the PR, running the same above code snippet returns the correct expected results: ``` >>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value")) >>> df.show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | NaN| 0| +-----+-----+ >>> df.replace(float('nan'), 2).show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | 2.0| 0| +-----+-----+ ``` ### How was this patch tested? Added unit tests to verify replacing `NaN` only affects columns of type `Float` and `Double` Closes apache#26738 from johnhany97/SPARK-30082. Lead-authored-by: John Ayad <[email protected]> Co-authored-by: John Ayad <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

gatorsmile

This change's impact is much bigger than what it documents in the PR description.

gatorsmile · 2019-12-30T02:54:07Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala

    def buildExpr(v: Any) = Cast(Literal(v), keyExpr.dataType)
    val branches = replacementMap.flatMap { case (source, target) =>
-      Seq(buildExpr(source), buildExpr(target))
+      Seq(Literal(source), buildExpr(target))


This fix relies on the type coercion rule to do the casting in the another side. It could cause the difference of query results. For example,

def createNaNDF(): DataFrame = { Seq[(java.lang.Integer, java.lang.Long, java.lang.Short, java.lang.Byte, java.lang.Float, java.lang.Double)]( (2, 2L, 2.toShort, 2.toByte, 2.0f, 2.0) ).toDF("int", "long", "short", "byte", "float", "double") } test("replace float with double") { createNaNDF().na.replace("*", Map( 2.3 -> 9.0 )).show() createNaNDF().na.replace("*", Map( 2.3 -> 9.0 )).explain(true) }

Before this PR,

+---+----+-----+----+-----+------+ |int|long|short|byte|float|double| +---+----+-----+----+-----+------+ | 9| 9| 9| 9| 2.0| 2.0| +---+----+-----+----+-----+------+ == Parsed Logical Plan == Project [CASE WHEN (int#99 = cast(2.3 as int)) THEN cast(9.0 as int) ELSE int#99 END AS int#117, CASE WHEN (long#100L = cast(2.3 as bigint)) THEN cast(9.0 as bigint) ELSE long#100L END AS long#118L, CASE WHEN (short#101 = cast(2.3 as smallint)) THEN cast(9.0 as smallint) ELSE short#101 END AS short#119, CASE WHEN (byte#102 = cast(2.3 as tinyint)) THEN cast(9.0 as tinyint) ELSE byte#102 END AS byte#120, CASE WHEN (float#103 = cast(2.3 as float)) THEN cast(9.0 as float) ELSE float#103 END AS float#121, CASE WHEN (double#104 = cast(2.3 as double)) THEN cast(9.0 as double) ELSE double#104 END AS double#122] +- Project [_1#86 AS int#99, _2#87L AS long#100L, _3#88 AS short#101, _4#89 AS byte#102, _5#90 AS float#103, _6#91 AS double#104] +- LocalRelation [_1#86, _2#87L, _3#88, _4#89, _5#90, _6#91] == Analyzed Logical Plan == int: int, long: bigint, short: smallint, byte: tinyint, float: float, double: double Project [CASE WHEN (int#99 = cast(2.3 as int)) THEN cast(9.0 as int) ELSE int#99 END AS int#117, CASE WHEN (long#100L = cast(2.3 as bigint)) THEN cast(9.0 as bigint) ELSE long#100L END AS long#118L, CASE WHEN (short#101 = cast(2.3 as smallint)) THEN cast(9.0 as smallint) ELSE short#101 END AS short#119, CASE WHEN (byte#102 = cast(2.3 as tinyint)) THEN cast(9.0 as tinyint) ELSE byte#102 END AS byte#120, CASE WHEN (float#103 = cast(2.3 as float)) THEN cast(9.0 as float) ELSE float#103 END AS float#121, CASE WHEN (double#104 = cast(2.3 as double)) THEN cast(9.0 as double) ELSE double#104 END AS double#122] +- Project [_1#86 AS int#99, _2#87L AS long#100L, _3#88 AS short#101, _4#89 AS byte#102, _5#90 AS float#103, _6#91 AS double#104] +- LocalRelation [_1#86, _2#87L, _3#88, _4#89, _5#90, _6#91] == Optimized Logical Plan == Project [CASE WHEN (_1#86 = 2) THEN 9 ELSE _1#86 END AS int#117, CASE WHEN (_2#87L = 2) THEN 9 ELSE _2#87L END AS long#118L, CASE WHEN (_3#88 = 2) THEN 9 ELSE _3#88 END AS short#119, CASE WHEN (_4#89 = 2) THEN 9 ELSE _4#89 END AS byte#120, CASE WHEN (_5#90 = 2.3) THEN 9.0 ELSE _5#90 END AS float#121, CASE WHEN (_6#91 = 2.3) THEN 9.0 ELSE _6#91 END AS double#122] +- LocalRelation [_1#86, _2#87L, _3#88, _4#89, _5#90, _6#91] == Physical Plan == *(1) Project [CASE WHEN (_1#86 = 2) THEN 9 ELSE _1#86 END AS int#117, CASE WHEN (_2#87L = 2) THEN 9 ELSE _2#87L END AS long#118L, CASE WHEN (_3#88 = 2) THEN 9 ELSE _3#88 END AS short#119, CASE WHEN (_4#89 = 2) THEN 9 ELSE _4#89 END AS byte#120, CASE WHEN (_5#90 = 2.3) THEN 9.0 ELSE _5#90 END AS float#121, CASE WHEN (_6#91 = 2.3) THEN 9.0 ELSE _6#91 END AS double#122] +- *(1) LocalTableScan [_1#86, _2#87L, _3#88, _4#89, _5#90, _6#91]

After this PR,

+---+----+-----+----+-----+------+ |int|long|short|byte|float|double| +---+----+-----+----+-----+------+ | 2| 2| 2| 2| 2.0| 2.0| +---+----+-----+----+-----+------+ == Parsed Logical Plan == 'Project [CASE WHEN (int#99 = 2.3) THEN cast(9.0 as int) ELSE int#99 END AS int#117, CASE WHEN (long#100L = 2.3) THEN cast(9.0 as bigint) ELSE long#100L END AS long#118, CASE WHEN (short#101 = 2.3) THEN cast(9.0 as smallint) ELSE short#101 END AS short#119, CASE WHEN (byte#102 = 2.3) THEN cast(9.0 as tinyint) ELSE byte#102 END AS byte#120, CASE WHEN (float#103 = 2.3) THEN cast(9.0 as float) ELSE float#103 END AS float#121, CASE WHEN (double#104 = 2.3) THEN cast(9.0 as double) ELSE double#104 END AS double#122] +- Project [_1#86 AS int#99, _2#87L AS long#100L, _3#88 AS short#101, _4#89 AS byte#102, _5#90 AS float#103, _6#91 AS double#104] +- LocalRelation [_1#86, _2#87L, _3#88, _4#89, _5#90, _6#91] == Analyzed Logical Plan == int: int, long: bigint, short: smallint, byte: tinyint, float: float, double: double Project [CASE WHEN (cast(int#99 as double) = 2.3) THEN cast(9.0 as int) ELSE int#99 END AS int#117, CASE WHEN (cast(long#100L as double) = 2.3) THEN cast(9.0 as bigint) ELSE long#100L END AS long#118L, CASE WHEN (cast(short#101 as double) = 2.3) THEN cast(9.0 as smallint) ELSE short#101 END AS short#119, CASE WHEN (cast(byte#102 as double) = 2.3) THEN cast(9.0 as tinyint) ELSE byte#102 END AS byte#120, CASE WHEN (cast(float#103 as double) = 2.3) THEN cast(9.0 as float) ELSE float#103 END AS float#121, CASE WHEN (double#104 = 2.3) THEN cast(9.0 as double) ELSE double#104 END AS double#122] +- Project [_1#86 AS int#99, _2#87L AS long#100L, _3#88 AS short#101, _4#89 AS byte#102, _5#90 AS float#103, _6#91 AS double#104] +- LocalRelation [_1#86, _2#87L, _3#88, _4#89, _5#90, _6#91] == Optimized Logical Plan == Project [CASE WHEN (cast(_1#86 as double) = 2.3) THEN 9 ELSE _1#86 END AS int#117, CASE WHEN (cast(_2#87L as double) = 2.3) THEN 9 ELSE _2#87L END AS long#118L, CASE WHEN (cast(_3#88 as double) = 2.3) THEN 9 ELSE _3#88 END AS short#119, CASE WHEN (cast(_4#89 as double) = 2.3) THEN 9 ELSE _4#89 END AS byte#120, CASE WHEN (cast(_5#90 as double) = 2.3) THEN 9.0 ELSE _5#90 END AS float#121, CASE WHEN (_6#91 = 2.3) THEN 9.0 ELSE _6#91 END AS double#122] +- LocalRelation [_1#86, _2#87L, _3#88, _4#89, _5#90, _6#91] == Physical Plan == *(1) Project [CASE WHEN (cast(_1#86 as double) = 2.3) THEN 9 ELSE _1#86 END AS int#117, CASE WHEN (cast(_2#87L as double) = 2.3) THEN 9 ELSE _2#87L END AS long#118L, CASE WHEN (cast(_3#88 as double) = 2.3) THEN 9 ELSE _3#88 END AS short#119, CASE WHEN (cast(_4#89 as double) = 2.3) THEN 9 ELSE _4#89 END AS byte#120, CASE WHEN (cast(_5#90 as double) = 2.3) THEN 9.0 ELSE _5#90 END AS float#121, CASE WHEN (_6#91 = 2.3) THEN 9.0 ELSE _6#91 END AS double#122] +- *(1) LocalTableScan [_1#86, _2#87L, _3#88, _4#89, _5#90, _6#91]

The new behavior makes more sense, but I agree that the PR description needs update to reflect all the changes. cc @johnhany97

Nice catch there @gatorsmile. I've updated the PR description. Should I also update the PR title? Let me know if you'd like me to add in more details into the PR description.

Yes, we also need to update the PR title.

…e query (#628) apache#26738 apache#26749 ### What changes were proposed in this pull request? Depend on type coercion when building the replace query. This would solve an edge case where when trying to replace `NaN`s, `0`s would get replace too. ### Why are the changes needed? This Scala code snippet: ``` import scala.math; println(Double.NaN.toLong) ``` returns `0` which is problematic as if you run the following Spark code, `0`s get replaced as well: ``` >>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value")) >>> df.show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | NaN| 0| +-----+-----+ >>> df.replace(float('nan'), 2).show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 2| | 0.0| 3| | 2.0| 2| +-----+-----+ ``` ### Does this PR introduce any user-facing change? Yes, after the PR, running the same above code snippet returns the correct expected results: ``` >>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value")) >>> df.show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | NaN| 0| +-----+-----+ >>> df.replace(float('nan'), 2).show() +-----+-----+ |index|value| +-----+-----+ | 1.0| 0| | 0.0| 3| | 2.0| 0| +-----+-----+ ``` And additionally, query results are changed as a result of the change in depending on scala's type coercion rules. ### How was this patch tested?  Added unit tests to verify replacing `NaN` only affects columns of type `Float` and `Double`.

Do not cast NaN to an Integer, Long, Short or Byte

8de494e

johnhany97 mentioned this pull request Dec 2, 2019

[SPARK-30082][SQL] Depend on Scala type coercion when building replace query palantir/spark#628

Merged

dongjoon-hyun added the PYSPARK label Dec 2, 2019

johnhany97 changed the title ~~[SPARK-30082] Do not replace Zeros when replacing NaNs~~ [WIP][SPARK-30082] Do not replace Zeros when replacing NaNs Dec 2, 2019

johnhany97 added 3 commits December 2, 2019 19:46

Revert "Do not cast NaN to an Integer, Long, Short or Byte"

fad3304

This reverts commit 8de494e.

Take it back a step

4a9573c

Do not do a replace if source/target is NaN and the column is not of type Double/Float

Merge branch 'master' of github.com:apache/spark into SPARK-30082

12f219b

johnhany97 force-pushed the SPARK-30082 branch from 3276b5b to 12f219b Compare December 2, 2019 19:49

johnhany97 added 2 commits December 2, 2019 21:15

Add tests

532f447

Avoid repeating df used in tests

1c5f1eb

johnhany97 changed the title ~~[WIP][SPARK-30082] Do not replace Zeros when replacing NaNs~~ [SPARK-30082] Do not replace Zeros when replacing NaNs Dec 2, 2019

mccheah reviewed Dec 2, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala Outdated Show resolved Hide resolved

mccheah reviewed Dec 2, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala Outdated Show resolved Hide resolved

Improve coding style

ed6f08d

dongjoon-hyun added SQL and removed PYSPARK labels Dec 3, 2019

dongjoon-hyun changed the title ~~[SPARK-30082] Do not replace Zeros when replacing NaNs~~ [SPARK-30082][SQL] Do not replace Zeros when replacing NaNs Dec 3, 2019

dongjoon-hyun reviewed Dec 3, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala Outdated Show resolved Hide resolved

johnhany97 added 2 commits December 3, 2019 11:03

Opt for a more simple fix

279a9fd

Adjust unit tests to reflect new behaviour

6b5d26d

cloud-fan reviewed Dec 3, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Dec 3, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala Show resolved Hide resolved

johnhany97 added 2 commits December 3, 2019 11:14

Adjust whitespace

10c91d6

Do not change param name

1744b28

Clean unit tests df generator

1295633

cloud-fan reviewed Dec 3, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Dec 3, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DataFrameNaFunctionsSuite.scala Outdated Show resolved Hide resolved

Clean tests themselves as well

b3709a1

cloud-fan approved these changes Dec 3, 2019

View reviewed changes

cloud-fan closed this in 8c2849a Dec 3, 2019

gatorsmile reviewed Dec 30, 2019

View reviewed changes

johnhany97 changed the title ~~[SPARK-30082][SQL] Do not replace Zeros when replacing NaNs~~ [SPARK-30082][SQL] Depend on Scala type coercion when building replace query Jan 4, 2020

[SPARK-30082][SQL] Depend on Scala type coercion when building replace query #26738

[SPARK-30082][SQL] Depend on Scala type coercion when building replace query #26738

Uh oh!

Conversation

johnhany97 commented Dec 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

mccheah commented Dec 2, 2019

Uh oh!

Uh oh!

Uh oh!

mccheah commented Dec 2, 2019

Uh oh!

dongjoon-hyun commented Dec 3, 2019

Uh oh!

dongjoon-hyun commented Dec 3, 2019

Uh oh!

Uh oh!

SparkQA commented Dec 3, 2019

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Dec 3, 2019

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Dec 3, 2019

Uh oh!

SparkQA commented Dec 3, 2019

Uh oh!

SparkQA commented Dec 3, 2019

Uh oh!

cloud-fan commented Dec 3, 2019

Uh oh!

cloud-fan commented Dec 3, 2019

Uh oh!

johnhany97 commented Dec 3, 2019

Uh oh!

dongjoon-hyun commented Dec 3, 2019

Uh oh!

SparkQA commented Dec 3, 2019

Uh oh!

SparkQA commented Dec 3, 2019

Uh oh!

gatorsmile left a comment

Choose a reason for hiding this comment

Uh oh!

gatorsmile Dec 30, 2019

Choose a reason for hiding this comment

Uh oh!

gatorsmile Dec 30, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 30, 2019

Choose a reason for hiding this comment

Uh oh!

johnhany97 Jan 1, 2020

Choose a reason for hiding this comment

Uh oh!

gatorsmile Jan 3, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

johnhany97 commented Dec 2, 2019 •

edited

Loading