[SPARK-26950][SQL][TEST] Make RandomDataGenerator use Float.NaN or Double.NaN for all NaN values #23851

dongjoon-hyun · 2019-02-21T07:39:05Z

What changes were proposed in this pull request?

Apache Spark uses the predefined Float.NaN and Double.NaN for NaN values, but there exists more NaN values with different binary presentations.

scala> java.nio.ByteBuffer.allocate(4).putFloat(Float.NaN).array
res1: Array[Byte] = Array(127, -64, 0, 0)

scala> val x = java.lang.Float.intBitsToFloat(-6966608)
x: Float = NaN

scala> java.nio.ByteBuffer.allocate(4).putFloat(x).array
res2: Array[Byte] = Array(-1, -107, -78, -80)

Since users can have these values, RandomDataGenerator generates these NaN values. However, this causes checkEvaluationWithUnsafeProjection failures due to the difference between UnsafeRow binary presentation. The following is the UT failure instance. This PR aims to fix this UT flakiness.

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102528/testReport/

How was this patch tested?

Pass the Jenkins with the newly added test cases.

…uble.NaN for all NaN values

dongjoon-hyun · 2019-02-21T07:41:30Z

cc @dbtsai , @cloud-fan , @gatorsmile , @HyukjinKwon

SparkQA · 2019-02-21T08:05:02Z

Test build #102576 has finished for PR 23851 at commit 5444a62.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-02-21T08:35:56Z

Retest this please.

dongjoon-hyun · 2019-02-21T09:40:51Z

Retest this please.

cloud-fan · 2019-02-21T12:46:50Z

it seems the better fix is to wrap expressions with NormalizeNaNAndZero in checkEvaluationWithUnsafeProjection.

SparkQA · 2019-02-21T14:26:58Z

Test build #102579 has finished for PR 23851 at commit 5444a62.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-02-21T16:16:38Z

Huh! I didn't realize there were many representations of NaN.
https://en.wikipedia.org/wiki/IEEE_754-1985#NaN

This seems OK.

However this code caught my attention; it seems to be trying to generate floats 'uniformly' with intBitsToFloat(r.nextInt()) . That's not uniform; small values are way more likely. That may be desirable or not matter. If you feel like it, maybe change randomNumeric's uniformRand argument to not say 'uniform'.

dongjoon-hyun · 2019-02-21T18:08:51Z

Thank you for review, @cloud-fan and @srowen .

To @cloud-fan .
checkEvaluationWithUnsafeProjection should handle more complex expressions like from_avro(to_avro([NaN]), {"type":"record","name":"topLevelRecord","fields":[{"name":"col_1","type":["float","null"]}]}) described in the PR description. However,

NormalizeNaNAndZero expects and handles Float and Double instances only.
NormalizeFloatingNumbers expects Plan.

We can wrap (transform) the first argument expression, but the second argument expected is Any type.

  protected def checkEvaluationWithUnsafeProjection(
      expression: Expression,
      expected: Any,
      inputRow: InternalRow = EmptyRow): Unit = {

dongjoon-hyun · 2019-02-21T19:37:22Z

Hi, @srowen . For uniformRand argument, it looks like randomNumeric's initial aspiration instead of its requirement. If we can provide some uniform function instead of intBitsToFloat, that would be better. For now, I'd like to keep the argument name~

srowen · 2019-02-21T19:51:07Z

That's fine, it's not a big deal. I think the intent is to choose from all possible float values with equal probability, which isn't uniform over its range, but, 'uniform' over all possible values of a float.

SparkQA · 2019-02-21T22:54:09Z

Test build #102586 has finished for PR 23851 at commit 5444a62.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…uble.NaN for all NaN values ## What changes were proposed in this pull request? Apache Spark uses the predefined `Float.NaN` and `Double.NaN` for NaN values, but there exists more NaN values with different binary presentations. ```scala scala> java.nio.ByteBuffer.allocate(4).putFloat(Float.NaN).array res1: Array[Byte] = Array(127, -64, 0, 0) scala> val x = java.lang.Float.intBitsToFloat(-6966608) x: Float = NaN scala> java.nio.ByteBuffer.allocate(4).putFloat(x).array res2: Array[Byte] = Array(-1, -107, -78, -80) ``` Since users can have these values, `RandomDataGenerator` generates these NaN values. However, this causes `checkEvaluationWithUnsafeProjection` failures due to the difference between `UnsafeRow` binary presentation. The following is the UT failure instance. This PR aims to fix this UT flakiness. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102528/testReport/ ## How was this patch tested? Pass the Jenkins with the newly added test cases. Closes #23851 from dongjoon-hyun/SPARK-26950. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit ffef3d4) Signed-off-by: Wenchen Fan <[email protected]>

cloud-fan · 2019-02-22T04:27:33Z

thanks, merging to master/2.4!

srowen · 2019-02-22T04:28:32Z

Oh ha I also tried to merge it just now and got weird errors. That's why.
I have no idea why it shows I pushed to my fork?
srowen@ffef3d4

dongjoon-hyun · 2019-02-22T06:03:23Z

Thank you, @cloud-fan , @srowen , @maropu .

…uble.NaN for all NaN values ## What changes were proposed in this pull request? Apache Spark uses the predefined `Float.NaN` and `Double.NaN` for NaN values, but there exists more NaN values with different binary presentations. ```scala scala> java.nio.ByteBuffer.allocate(4).putFloat(Float.NaN).array res1: Array[Byte] = Array(127, -64, 0, 0) scala> val x = java.lang.Float.intBitsToFloat(-6966608) x: Float = NaN scala> java.nio.ByteBuffer.allocate(4).putFloat(x).array res2: Array[Byte] = Array(-1, -107, -78, -80) ``` Since users can have these values, `RandomDataGenerator` generates these NaN values. However, this causes `checkEvaluationWithUnsafeProjection` failures due to the difference between `UnsafeRow` binary presentation. The following is the UT failure instance. This PR aims to fix this UT flakiness. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102528/testReport/ ## How was this patch tested? Pass the Jenkins with the newly added test cases. Closes #23851 from dongjoon-hyun/SPARK-26950. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit ffef3d4) Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit ef67be3) Signed-off-by: Dongjoon Hyun <[email protected]>

dongjoon-hyun · 2019-02-22T21:55:25Z

To prevent flakiness, I merged this to branch-2.3, too.

…uble.NaN for all NaN values ## What changes were proposed in this pull request? Apache Spark uses the predefined `Float.NaN` and `Double.NaN` for NaN values, but there exists more NaN values with different binary presentations. ```scala scala> java.nio.ByteBuffer.allocate(4).putFloat(Float.NaN).array res1: Array[Byte] = Array(127, -64, 0, 0) scala> val x = java.lang.Float.intBitsToFloat(-6966608) x: Float = NaN scala> java.nio.ByteBuffer.allocate(4).putFloat(x).array res2: Array[Byte] = Array(-1, -107, -78, -80) ``` Since users can have these values, `RandomDataGenerator` generates these NaN values. However, this causes `checkEvaluationWithUnsafeProjection` failures due to the difference between `UnsafeRow` binary presentation. The following is the UT failure instance. This PR aims to fix this UT flakiness. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102528/testReport/ ## How was this patch tested? Pass the Jenkins with the newly added test cases. Closes apache#23851 from dongjoon-hyun/SPARK-26950. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit ffef3d4) Signed-off-by: Wenchen Fan <[email protected]>

[SPARK-26950][SQL][TEST] Make RandomDataGenerator use Float.NaN or Do…

5444a62

…uble.NaN for all NaN values

srowen approved these changes Feb 22, 2019

View reviewed changes

maropu approved these changes Feb 22, 2019

View reviewed changes

cloud-fan closed this in ffef3d4 Feb 22, 2019

dongjoon-hyun deleted the SPARK-26950 branch February 22, 2019 06:03

dongjoon-hyun mentioned this pull request Dec 4, 2019

[SPARK-30009][CORE][SQL][FOLLOWUP] Remove OrderingUtil and Utils.nanSafeCompare{Doubles,Floats} and use java.lang.{Double,Float}.compare directly #26761

Closed

[SPARK-26950][SQL][TEST] Make RandomDataGenerator use Float.NaN or Double.NaN for all NaN values #23851

[SPARK-26950][SQL][TEST] Make RandomDataGenerator use Float.NaN or Double.NaN for all NaN values #23851

Uh oh!

Conversation

dongjoon-hyun commented Feb 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

dongjoon-hyun commented Feb 21, 2019

Uh oh!

SparkQA commented Feb 21, 2019

Uh oh!

dongjoon-hyun commented Feb 21, 2019

Uh oh!

dongjoon-hyun commented Feb 21, 2019

Uh oh!

cloud-fan commented Feb 21, 2019

Uh oh!

SparkQA commented Feb 21, 2019

Uh oh!

srowen commented Feb 21, 2019

Uh oh!

dongjoon-hyun commented Feb 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Feb 21, 2019

Uh oh!

srowen commented Feb 21, 2019

Uh oh!

SparkQA commented Feb 21, 2019

Uh oh!

cloud-fan commented Feb 22, 2019

Uh oh!

srowen commented Feb 22, 2019

Uh oh!

dongjoon-hyun commented Feb 22, 2019

Uh oh!

dongjoon-hyun commented Feb 22, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dongjoon-hyun commented Feb 21, 2019 •

edited

Loading

dongjoon-hyun commented Feb 21, 2019 •

edited

Loading