Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Feb 21, 2019

What changes were proposed in this pull request?

Apache Spark uses the predefined Float.NaN and Double.NaN for NaN values, but there exists more NaN values with different binary presentations.

scala> java.nio.ByteBuffer.allocate(4).putFloat(Float.NaN).array
res1: Array[Byte] = Array(127, -64, 0, 0)

scala> val x = java.lang.Float.intBitsToFloat(-6966608)
x: Float = NaN

scala> java.nio.ByteBuffer.allocate(4).putFloat(x).array
res2: Array[Byte] = Array(-1, -107, -78, -80)

Since users can have these values, RandomDataGenerator generates these NaN values. However, this causes checkEvaluationWithUnsafeProjection failures due to the difference between UnsafeRow binary presentation. The following is the UT failure instance. This PR aims to fix this UT flakiness.

How was this patch tested?

Pass the Jenkins with the newly added test cases.

@dongjoon-hyun
Copy link
Member Author

cc @dbtsai , @cloud-fan , @gatorsmile , @HyukjinKwon

@SparkQA
Copy link

SparkQA commented Feb 21, 2019

Test build #102576 has finished for PR 23851 at commit 5444a62.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

1 similar comment
@dongjoon-hyun
Copy link
Member Author

Retest this please.

@cloud-fan
Copy link
Contributor

it seems the better fix is to wrap expressions with NormalizeNaNAndZero in checkEvaluationWithUnsafeProjection.

@SparkQA
Copy link

SparkQA commented Feb 21, 2019

Test build #102579 has finished for PR 23851 at commit 5444a62.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Feb 21, 2019

Huh! I didn't realize there were many representations of NaN.
https://en.wikipedia.org/wiki/IEEE_754-1985#NaN

This seems OK.

However this code caught my attention; it seems to be trying to generate floats 'uniformly' with intBitsToFloat(r.nextInt()) . That's not uniform; small values are way more likely. That may be desirable or not matter. If you feel like it, maybe change randomNumeric's uniformRand argument to not say 'uniform'.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Feb 21, 2019

Thank you for review, @cloud-fan and @srowen .

To @cloud-fan .
checkEvaluationWithUnsafeProjection should handle more complex expressions like from_avro(to_avro([NaN]), {"type":"record","name":"topLevelRecord","fields":[{"name":"col_1","type":["float","null"]}]}) described in the PR description. However,

  1. NormalizeNaNAndZero expects and handles Float and Double instances only.
  2. NormalizeFloatingNumbers expects Plan.

We can wrap (transform) the first argument expression, but the second argument expected is Any type.

  protected def checkEvaluationWithUnsafeProjection(
      expression: Expression,
      expected: Any,
      inputRow: InternalRow = EmptyRow): Unit = {

@dongjoon-hyun
Copy link
Member Author

Hi, @srowen . For uniformRand argument, it looks like randomNumeric's initial aspiration instead of its requirement. If we can provide some uniform function instead of intBitsToFloat, that would be better. For now, I'd like to keep the argument name~

@srowen
Copy link
Member

srowen commented Feb 21, 2019

That's fine, it's not a big deal. I think the intent is to choose from all possible float values with equal probability, which isn't uniform over its range, but, 'uniform' over all possible values of a float.

@SparkQA
Copy link

SparkQA commented Feb 21, 2019

Test build #102586 has finished for PR 23851 at commit 5444a62.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

cloud-fan pushed a commit that referenced this pull request Feb 22, 2019
…uble.NaN for all NaN values

## What changes were proposed in this pull request?

Apache Spark uses the predefined `Float.NaN` and `Double.NaN` for NaN values, but there exists more NaN values with different binary presentations.

```scala
scala> java.nio.ByteBuffer.allocate(4).putFloat(Float.NaN).array
res1: Array[Byte] = Array(127, -64, 0, 0)

scala> val x = java.lang.Float.intBitsToFloat(-6966608)
x: Float = NaN

scala> java.nio.ByteBuffer.allocate(4).putFloat(x).array
res2: Array[Byte] = Array(-1, -107, -78, -80)
```

Since users can have these values, `RandomDataGenerator` generates these NaN values. However, this causes `checkEvaluationWithUnsafeProjection` failures due to the difference between `UnsafeRow` binary presentation. The following is the UT failure instance. This PR aims to fix this UT flakiness.

- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102528/testReport/

## How was this patch tested?

Pass the Jenkins with the newly added test cases.

Closes #23851 from dongjoon-hyun/SPARK-26950.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit ffef3d4)
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.4!

@srowen
Copy link
Member

srowen commented Feb 22, 2019

Oh ha I also tried to merge it just now and got weird errors. That's why.
I have no idea why it shows I pushed to my fork?
srowen@ffef3d4

@cloud-fan cloud-fan closed this in ffef3d4 Feb 22, 2019
@dongjoon-hyun
Copy link
Member Author

Thank you, @cloud-fan , @srowen , @maropu .

@dongjoon-hyun dongjoon-hyun deleted the SPARK-26950 branch February 22, 2019 06:03
dongjoon-hyun added a commit that referenced this pull request Feb 22, 2019
…uble.NaN for all NaN values

## What changes were proposed in this pull request?

Apache Spark uses the predefined `Float.NaN` and `Double.NaN` for NaN values, but there exists more NaN values with different binary presentations.

```scala
scala> java.nio.ByteBuffer.allocate(4).putFloat(Float.NaN).array
res1: Array[Byte] = Array(127, -64, 0, 0)

scala> val x = java.lang.Float.intBitsToFloat(-6966608)
x: Float = NaN

scala> java.nio.ByteBuffer.allocate(4).putFloat(x).array
res2: Array[Byte] = Array(-1, -107, -78, -80)
```

Since users can have these values, `RandomDataGenerator` generates these NaN values. However, this causes `checkEvaluationWithUnsafeProjection` failures due to the difference between `UnsafeRow` binary presentation. The following is the UT failure instance. This PR aims to fix this UT flakiness.

- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102528/testReport/

## How was this patch tested?

Pass the Jenkins with the newly added test cases.

Closes #23851 from dongjoon-hyun/SPARK-26950.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit ffef3d4)
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit ef67be3)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member Author

To prevent flakiness, I merged this to branch-2.3, too.

kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 23, 2019
…uble.NaN for all NaN values

## What changes were proposed in this pull request?

Apache Spark uses the predefined `Float.NaN` and `Double.NaN` for NaN values, but there exists more NaN values with different binary presentations.

```scala
scala> java.nio.ByteBuffer.allocate(4).putFloat(Float.NaN).array
res1: Array[Byte] = Array(127, -64, 0, 0)

scala> val x = java.lang.Float.intBitsToFloat(-6966608)
x: Float = NaN

scala> java.nio.ByteBuffer.allocate(4).putFloat(x).array
res2: Array[Byte] = Array(-1, -107, -78, -80)
```

Since users can have these values, `RandomDataGenerator` generates these NaN values. However, this causes `checkEvaluationWithUnsafeProjection` failures due to the difference between `UnsafeRow` binary presentation. The following is the UT failure instance. This PR aims to fix this UT flakiness.

- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102528/testReport/

## How was this patch tested?

Pass the Jenkins with the newly added test cases.

Closes apache#23851 from dongjoon-hyun/SPARK-26950.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit ffef3d4)
Signed-off-by: Wenchen Fan <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 25, 2019
…uble.NaN for all NaN values

## What changes were proposed in this pull request?

Apache Spark uses the predefined `Float.NaN` and `Double.NaN` for NaN values, but there exists more NaN values with different binary presentations.

```scala
scala> java.nio.ByteBuffer.allocate(4).putFloat(Float.NaN).array
res1: Array[Byte] = Array(127, -64, 0, 0)

scala> val x = java.lang.Float.intBitsToFloat(-6966608)
x: Float = NaN

scala> java.nio.ByteBuffer.allocate(4).putFloat(x).array
res2: Array[Byte] = Array(-1, -107, -78, -80)
```

Since users can have these values, `RandomDataGenerator` generates these NaN values. However, this causes `checkEvaluationWithUnsafeProjection` failures due to the difference between `UnsafeRow` binary presentation. The following is the UT failure instance. This PR aims to fix this UT flakiness.

- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102528/testReport/

## How was this patch tested?

Pass the Jenkins with the newly added test cases.

Closes apache#23851 from dongjoon-hyun/SPARK-26950.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit ffef3d4)
Signed-off-by: Wenchen Fan <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Aug 1, 2019
…uble.NaN for all NaN values

## What changes were proposed in this pull request?

Apache Spark uses the predefined `Float.NaN` and `Double.NaN` for NaN values, but there exists more NaN values with different binary presentations.

```scala
scala> java.nio.ByteBuffer.allocate(4).putFloat(Float.NaN).array
res1: Array[Byte] = Array(127, -64, 0, 0)

scala> val x = java.lang.Float.intBitsToFloat(-6966608)
x: Float = NaN

scala> java.nio.ByteBuffer.allocate(4).putFloat(x).array
res2: Array[Byte] = Array(-1, -107, -78, -80)
```

Since users can have these values, `RandomDataGenerator` generates these NaN values. However, this causes `checkEvaluationWithUnsafeProjection` failures due to the difference between `UnsafeRow` binary presentation. The following is the UT failure instance. This PR aims to fix this UT flakiness.

- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102528/testReport/

## How was this patch tested?

Pass the Jenkins with the newly added test cases.

Closes apache#23851 from dongjoon-hyun/SPARK-26950.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit ffef3d4)
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants