-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18555][SQL]DataFrameNaFunctions.fill miss up original values in long integers #15994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #69076 has finished for PR 15994 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You won't be able to remove public API methods like this. Why is this necessary if the idea is to handle Long differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought fill[T](value: T) can handle the String Type, then I remove it.....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't it change the binary signature of the API? That is the problem.
|
Test build #69101 has finished for PR 15994 at commit
|
|
@srowen I exclude the fill function from mima, is it ok? |
|
Test build #69106 has finished for PR 15994 at commit
|
|
Test build #69107 has finished for PR 15994 at commit
|
|
cc @cloud-fan |
project/MimaExcludes.scala
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, as I say, you definitely can't just remove public API methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But although it is removed, the fill[T](value T) can handle this. It is backward compatibility. should we must keep the fill(value String) and fill(value Double) function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is they are not backward compatible at bytecode level, so applications will break if they are not rebuilt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thanks,I will try another way~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of introducing fill[T], which breaks backward compatibility, can we just add a new version for long type? i.e. def fill(value: Long)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a long type function is ok. I thought the string/long/double logic is the same ,so I want to merge them by a template function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is ok~
windpiger@360eafe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can have a private def fill[T], but the public API can't be changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thanks a lot ! I have put it as a private. @cloud-fan
|
Test build #69213 has finished for PR 15994 at commit
|
|
Test build #69214 has finished for PR 15994 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: fill1(value, cols) should work, scala has type inference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need (Scala-specific) and the since tag for private methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can combine the check here:
val targetType = value match {
case _: Long => LongType
case _: Double => DoubleType
case _: String => StringType
case _ => throw ...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a clearer version:
val typeMatches = (targetType, f.dataType) match {
case (LongType, dt) => dt.isInstanceOf[IntergralType]
case (DoubleType, dt) => dt.isInstanceOf[FractionType]
case (StringType, dt) => if dt == StringType
}
if (typeMatches && cols.exists(col => columnEquals(f.name, col))) {
fillCol...
} else
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I have modified except one:
If T is a double type , this should be apply to all Numeric columns(include LongType/IntegerType), or just apply to FractionType?
The fill(value Double) apply to all Numeric columns, and I think fill(value Long) also keep the logic.
|
Test build #69229 has started for PR 15994 at commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: put it in one line? i.e. def fill... = fill1...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the T can be Integer and Float?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove them is ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we match jd.Double here intead of scala Double?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could I ask change [[DataFrame]] to `DataFrame`? It seems the DataFrame is unrecognisable via unidoc/genjavadoc (see #16013) which ends up with documentation build failure with Java 8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change them in #16013 is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, I will let you know if that one is merged first. If this one is merged first, I will rebase and fix them together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon yeah the bad news is that I'm sure the javadoc generation is going to re-break periodically. we can try to catch it with reviews and your work at least gets it to a working state. But we'll clean it up again before releases regularly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@windpiger I don't think this was quite resolved -- you need to back-tick DataFrame unfortunately to get it to work with javadoc 8.
|
Test build #69231 has finished for PR 15994 at commit
|
|
Test build #69236 has finished for PR 15994 at commit
|
|
Test build #69237 has finished for PR 15994 at commit
|
|
Test build #69238 has finished for PR 15994 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"fill1" isn't a good name. "fillInternal" maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a fill0 function in the class, is there some code name rules which I can refer to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a poor choice. I'd even suggest renaming it, while you rename this new method too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about change fill1 to fillValue ,and change private def fill0(values: Seq[(String, Any)]) to private def fillMap(values: Seq[(String, Any)]) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may be in 2.2.0 in the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok,thanks
|
Test build #69249 has finished for PR 15994 at commit
|
|
Test build #69442 has finished for PR 15994 at commit
|
|
Hi @windpiger it seems there are conflicts here. Would you try to rebase this please? |
…n long integers exclude mima na.fill methodtype modify the comment fix a style fix a style add long type na.fill and add a template func fill[T] remove na.fill from mimaexclude fix style optim and simpliy some code remove boolean type logic remove integer/float type fix style fix a java document modify 2.1.0 to 2.2.0 rename fill0 and fill1 in DataFrameNaFunctions
4c9f3a0 to
7e799fc
Compare
|
Test build #69471 has finished for PR 15994 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@windpiger I don't think this was quite resolved -- you need to back-tick DataFrame unfortunately to get it to work with javadoc 8.
|
Test build #69653 has finished for PR 15994 at commit
|
|
@srowen fixed it. |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks reasonable to me though I'd prefer to have @rxin give a final OK
|
Thanks - merging in master. |
|
@windpiger you should use a proper email on your github commit next time. It is currently shown as: |
|
ok ,thanks! |
…in long integers
## What changes were proposed in this pull request?
DataSet.na.fill(0) used on a DataSet which has a long value column, it will change the original long value.
The reason is that the type of the function fill's param is Double, and the numeric columns are always cast to double(`fillCol[Double](f, value)`) .
```
def fill(value: Double, cols: Seq[String]): DataFrame = {
val columnEquals = df.sparkSession.sessionState.analyzer.resolver
val projections = df.schema.fields.map { f =>
// Only fill if the column is part of the cols list.
if (f.dataType.isInstanceOf[NumericType] && cols.exists(col => columnEquals(f.name, col))) {
fillCol[Double](f, value)
} else {
df.col(f.name)
}
}
df.select(projections : _*)
}
```
For example:
```
scala> val df = Seq[(Long, Long)]((1, 2), (-1, -2), (9123146099426677101L, 9123146560113991650L)).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint]
scala> df.show
+-------------------+-------------------+
| a| b|
+-------------------+-------------------+
| 1| 2|
| -1| -2|
|9123146099426677101|9123146560113991650|
+-------------------+-------------------+
scala> df.na.fill(0).show
+-------------------+-------------------+
| a| b|
+-------------------+-------------------+
| 1| 2|
| -1| -2|
|9123146099426676736|9123146560113991680|
+-------------------+-------------------+
```
the original values changed [which is not we expected result]:
```
9123146099426677101 -> 9123146099426676736
9123146560113991650 -> 9123146560113991680
```
## How was this patch tested?
unit test added.
Author: root <root@iZbp1gsnrlfzjxh82cz80vZ.(none)>
Closes apache#15994 from windpiger/nafillMissupOriginalValue.
…in long integers
## What changes were proposed in this pull request?
DataSet.na.fill(0) used on a DataSet which has a long value column, it will change the original long value.
The reason is that the type of the function fill's param is Double, and the numeric columns are always cast to double(`fillCol[Double](f, value)`) .
```
def fill(value: Double, cols: Seq[String]): DataFrame = {
val columnEquals = df.sparkSession.sessionState.analyzer.resolver
val projections = df.schema.fields.map { f =>
// Only fill if the column is part of the cols list.
if (f.dataType.isInstanceOf[NumericType] && cols.exists(col => columnEquals(f.name, col))) {
fillCol[Double](f, value)
} else {
df.col(f.name)
}
}
df.select(projections : _*)
}
```
For example:
```
scala> val df = Seq[(Long, Long)]((1, 2), (-1, -2), (9123146099426677101L, 9123146560113991650L)).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint]
scala> df.show
+-------------------+-------------------+
| a| b|
+-------------------+-------------------+
| 1| 2|
| -1| -2|
|9123146099426677101|9123146560113991650|
+-------------------+-------------------+
scala> df.na.fill(0).show
+-------------------+-------------------+
| a| b|
+-------------------+-------------------+
| 1| 2|
| -1| -2|
|9123146099426676736|9123146560113991680|
+-------------------+-------------------+
```
the original values changed [which is not we expected result]:
```
9123146099426677101 -> 9123146099426676736
9123146560113991650 -> 9123146560113991680
```
## How was this patch tested?
unit test added.
Author: root <root@iZbp1gsnrlfzjxh82cz80vZ.(none)>
Closes apache#15994 from windpiger/nafillMissupOriginalValue.
…in long integers
## What changes were proposed in this pull request?
DataSet.na.fill(0) used on a DataSet which has a long value column, it will change the original long value.
The reason is that the type of the function fill's param is Double, and the numeric columns are always cast to double(`fillCol[Double](f, value)`) .
```
def fill(value: Double, cols: Seq[String]): DataFrame = {
val columnEquals = df.sparkSession.sessionState.analyzer.resolver
val projections = df.schema.fields.map { f =>
// Only fill if the column is part of the cols list.
if (f.dataType.isInstanceOf[NumericType] && cols.exists(col => columnEquals(f.name, col))) {
fillCol[Double](f, value)
} else {
df.col(f.name)
}
}
df.select(projections : _*)
}
```
For example:
```
scala> val df = Seq[(Long, Long)]((1, 2), (-1, -2), (9123146099426677101L, 9123146560113991650L)).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint]
scala> df.show
+-------------------+-------------------+
| a| b|
+-------------------+-------------------+
| 1| 2|
| -1| -2|
|9123146099426677101|9123146560113991650|
+-------------------+-------------------+
scala> df.na.fill(0).show
+-------------------+-------------------+
| a| b|
+-------------------+-------------------+
| 1| 2|
| -1| -2|
|9123146099426676736|9123146560113991680|
+-------------------+-------------------+
```
the original values changed [which is not we expected result]:
```
9123146099426677101 -> 9123146099426676736
9123146560113991650 -> 9123146560113991680
```
## How was this patch tested?
unit test added.
Author: root <rootiZbp1gsnrlfzjxh82cz80vZ.(none)>
Closes apache#15994 from windpiger/nafillMissupOriginalValue.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: root <root@iZbp1gsnrlfzjxh82cz80vZ.(none)>
Closes apache#202 from davies/backport_na.
…teger when the default value is in double ## What changes were proposed in this pull request? This bug was partially addressed in SPARK-18555 #15994, but the root cause isn't completely solved. This bug is pretty critical since it changes the member id in Long in our application if the member id can not be represented by Double losslessly when the member id is very big. Here is an example how this happens, with ``` Seq[(java.lang.Long, java.lang.Double)]((null, 3.14), (9123146099426677101L, null), (9123146560113991650L, 1.6), (null, null)).toDF("a", "b").na.fill(0.2), ``` the logical plan will be ``` == Analyzed Logical Plan == a: bigint, b: double Project [cast(coalesce(cast(a#232L as double), cast(0.2 as double)) as bigint) AS a#240L, cast(coalesce(nanvl(b#233, cast(null as double)), 0.2) as double) AS b#241] +- Project [_1#229L AS a#232L, _2#230 AS b#233] +- LocalRelation [_1#229L, _2#230] ``` Note that even the value is not null, Spark will cast the Long into Double first. Then if it's not null, Spark will cast it back to Long which results in losing precision. The behavior should be that the original value should not be changed if it's not null, but Spark will change the value which is wrong. With the PR, the logical plan will be ``` == Analyzed Logical Plan == a: bigint, b: double Project [coalesce(a#232L, cast(0.2 as bigint)) AS a#240L, coalesce(nanvl(b#233, cast(null as double)), cast(0.2 as double)) AS b#241] +- Project [_1#229L AS a#232L, _2#230 AS b#233] +- LocalRelation [_1#229L, _2#230] ``` which behaves correctly without changing the original Long values and also avoids extra cost of unnecessary casting. ## How was this patch tested? unit test added. +cc srowen rxin cloud-fan gatorsmile Thanks. Author: DB Tsai <[email protected]> Closes #17577 from dbtsai/fixnafill.
…in long integers
## What changes were proposed in this pull request?
DataSet.na.fill(0) used on a DataSet which has a long value column, it will change the original long value.
The reason is that the type of the function fill's param is Double, and the numeric columns are always cast to double(`fillCol[Double](f, value)`) .
```
def fill(value: Double, cols: Seq[String]): DataFrame = {
val columnEquals = df.sparkSession.sessionState.analyzer.resolver
val projections = df.schema.fields.map { f =>
// Only fill if the column is part of the cols list.
if (f.dataType.isInstanceOf[NumericType] && cols.exists(col => columnEquals(f.name, col))) {
fillCol[Double](f, value)
} else {
df.col(f.name)
}
}
df.select(projections : _*)
}
```
For example:
```
scala> val df = Seq[(Long, Long)]((1, 2), (-1, -2), (9123146099426677101L, 9123146560113991650L)).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint]
scala> df.show
+-------------------+-------------------+
| a| b|
+-------------------+-------------------+
| 1| 2|
| -1| -2|
|9123146099426677101|9123146560113991650|
+-------------------+-------------------+
scala> df.na.fill(0).show
+-------------------+-------------------+
| a| b|
+-------------------+-------------------+
| 1| 2|
| -1| -2|
|9123146099426676736|9123146560113991680|
+-------------------+-------------------+
```
the original values changed [which is not we expected result]:
```
9123146099426677101 -> 9123146099426676736
9123146560113991650 -> 9123146560113991680
```
## How was this patch tested?
unit test added.
Author: root <root@iZbp1gsnrlfzjxh82cz80vZ.(none)>
Closes #15994 from windpiger/nafillMissupOriginalValue.
(cherry picked from commit 508de38)
Signed-off-by: DB Tsai <[email protected]>
…in long integers
## What changes were proposed in this pull request?
DataSet.na.fill(0) used on a DataSet which has a long value column, it will change the original long value.
The reason is that the type of the function fill's param is Double, and the numeric columns are always cast to double(`fillCol[Double](f, value)`) .
```
def fill(value: Double, cols: Seq[String]): DataFrame = {
val columnEquals = df.sparkSession.sessionState.analyzer.resolver
val projections = df.schema.fields.map { f =>
// Only fill if the column is part of the cols list.
if (f.dataType.isInstanceOf[NumericType] && cols.exists(col => columnEquals(f.name, col))) {
fillCol[Double](f, value)
} else {
df.col(f.name)
}
}
df.select(projections : _*)
}
```
For example:
```
scala> val df = Seq[(Long, Long)]((1, 2), (-1, -2), (9123146099426677101L, 9123146560113991650L)).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint]
scala> df.show
+-------------------+-------------------+
| a| b|
+-------------------+-------------------+
| 1| 2|
| -1| -2|
|9123146099426677101|9123146560113991650|
+-------------------+-------------------+
scala> df.na.fill(0).show
+-------------------+-------------------+
| a| b|
+-------------------+-------------------+
| 1| 2|
| -1| -2|
|9123146099426676736|9123146560113991680|
+-------------------+-------------------+
```
the original values changed [which is not we expected result]:
```
9123146099426677101 -> 9123146099426676736
9123146560113991650 -> 9123146560113991680
```
## How was this patch tested?
unit test added.
Author: root <root@iZbp1gsnrlfzjxh82cz80vZ.(none)>
Closes #15994 from windpiger/nafillMissupOriginalValue.
(cherry picked from commit 508de38)
Signed-off-by: DB Tsai <[email protected]>
…teger when the default value is in double ## What changes were proposed in this pull request? This bug was partially addressed in SPARK-18555 #15994, but the root cause isn't completely solved. This bug is pretty critical since it changes the member id in Long in our application if the member id can not be represented by Double losslessly when the member id is very big. Here is an example how this happens, with ``` Seq[(java.lang.Long, java.lang.Double)]((null, 3.14), (9123146099426677101L, null), (9123146560113991650L, 1.6), (null, null)).toDF("a", "b").na.fill(0.2), ``` the logical plan will be ``` == Analyzed Logical Plan == a: bigint, b: double Project [cast(coalesce(cast(a#232L as double), cast(0.2 as double)) as bigint) AS a#240L, cast(coalesce(nanvl(b#233, cast(null as double)), 0.2) as double) AS b#241] +- Project [_1#229L AS a#232L, _2#230 AS b#233] +- LocalRelation [_1#229L, _2#230] ``` Note that even the value is not null, Spark will cast the Long into Double first. Then if it's not null, Spark will cast it back to Long which results in losing precision. The behavior should be that the original value should not be changed if it's not null, but Spark will change the value which is wrong. With the PR, the logical plan will be ``` == Analyzed Logical Plan == a: bigint, b: double Project [coalesce(a#232L, cast(0.2 as bigint)) AS a#240L, coalesce(nanvl(b#233, cast(null as double)), cast(0.2 as double)) AS b#241] +- Project [_1#229L AS a#232L, _2#230 AS b#233] +- LocalRelation [_1#229L, _2#230] ``` which behaves correctly without changing the original Long values and also avoids extra cost of unnecessary casting. ## How was this patch tested? unit test added. +cc srowen rxin cloud-fan gatorsmile Thanks. Author: DB Tsai <[email protected]> Closes #17577 from dbtsai/fixnafill. (cherry picked from commit 1a0bc41) Signed-off-by: DB Tsai <[email protected]>
…teger when the default value is in double ## What changes were proposed in this pull request? This bug was partially addressed in SPARK-18555 #15994, but the root cause isn't completely solved. This bug is pretty critical since it changes the member id in Long in our application if the member id can not be represented by Double losslessly when the member id is very big. Here is an example how this happens, with ``` Seq[(java.lang.Long, java.lang.Double)]((null, 3.14), (9123146099426677101L, null), (9123146560113991650L, 1.6), (null, null)).toDF("a", "b").na.fill(0.2), ``` the logical plan will be ``` == Analyzed Logical Plan == a: bigint, b: double Project [cast(coalesce(cast(a#232L as double), cast(0.2 as double)) as bigint) AS a#240L, cast(coalesce(nanvl(b#233, cast(null as double)), 0.2) as double) AS b#241] +- Project [_1#229L AS a#232L, _2#230 AS b#233] +- LocalRelation [_1#229L, _2#230] ``` Note that even the value is not null, Spark will cast the Long into Double first. Then if it's not null, Spark will cast it back to Long which results in losing precision. The behavior should be that the original value should not be changed if it's not null, but Spark will change the value which is wrong. With the PR, the logical plan will be ``` == Analyzed Logical Plan == a: bigint, b: double Project [coalesce(a#232L, cast(0.2 as bigint)) AS a#240L, coalesce(nanvl(b#233, cast(null as double)), cast(0.2 as double)) AS b#241] +- Project [_1#229L AS a#232L, _2#230 AS b#233] +- LocalRelation [_1#229L, _2#230] ``` which behaves correctly without changing the original Long values and also avoids extra cost of unnecessary casting. ## How was this patch tested? unit test added. +cc srowen rxin cloud-fan gatorsmile Thanks. Author: DB Tsai <[email protected]> Closes #17577 from dbtsai/fixnafill. (cherry picked from commit 1a0bc41) Signed-off-by: DB Tsai <[email protected]>
…teger when the default value is in double ## What changes were proposed in this pull request? This bug was partially addressed in SPARK-18555 apache#15994, but the root cause isn't completely solved. This bug is pretty critical since it changes the member id in Long in our application if the member id can not be represented by Double losslessly when the member id is very big. Here is an example how this happens, with ``` Seq[(java.lang.Long, java.lang.Double)]((null, 3.14), (9123146099426677101L, null), (9123146560113991650L, 1.6), (null, null)).toDF("a", "b").na.fill(0.2), ``` the logical plan will be ``` == Analyzed Logical Plan == a: bigint, b: double Project [cast(coalesce(cast(a#232L as double), cast(0.2 as double)) as bigint) AS a#240L, cast(coalesce(nanvl(b#233, cast(null as double)), 0.2) as double) AS b#241] +- Project [_1#229L AS a#232L, _2#230 AS b#233] +- LocalRelation [_1#229L, _2#230] ``` Note that even the value is not null, Spark will cast the Long into Double first. Then if it's not null, Spark will cast it back to Long which results in losing precision. The behavior should be that the original value should not be changed if it's not null, but Spark will change the value which is wrong. With the PR, the logical plan will be ``` == Analyzed Logical Plan == a: bigint, b: double Project [coalesce(a#232L, cast(0.2 as bigint)) AS a#240L, coalesce(nanvl(b#233, cast(null as double)), cast(0.2 as double)) AS b#241] +- Project [_1#229L AS a#232L, _2#230 AS b#233] +- LocalRelation [_1#229L, _2#230] ``` which behaves correctly without changing the original Long values and also avoids extra cost of unnecessary casting. ## How was this patch tested? unit test added. +cc srowen rxin cloud-fan gatorsmile Thanks. Author: DB Tsai <[email protected]> Closes apache#17577 from dbtsai/fixnafill.
What changes were proposed in this pull request?
DataSet.na.fill(0) used on a DataSet which has a long value column, it will change the original long value.
The reason is that the type of the function fill's param is Double, and the numeric columns are always cast to double(
fillCol[Double](f, value)) .For example:
the original values changed [which is not we expected result]:
How was this patch tested?
unit test added.