Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented May 12, 2017

What changes were proposed in this pull request?

This PR is based on #16199 and extracts the valid change from #9759 to resolve SPARK-18772

This avoids additional conversion try with toFloat and toDouble.

For avoiding additional conversions, please refer the codes below:

Before

scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show()
17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.NumberFormatException: For input string: "nan"
...

After

scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show()
17/05/12 11:44:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.RuntimeException: Cannot parse nan as DoubleType.
...

How was this patch tested?

Unit tests added in JsonSuite.

Closes #16199

@HyukjinKwon
Copy link
Member Author

cc @NathanHowell, @cloud-fan and @viirya.

(I just want to note this will not change any input/output but just the exception type and avoid additional conversion try.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#9759 (comment)

"infinity".toDouble, "inf".toDouble are not legal. These non-numeric numbers are case-sensitive, both for Jackson and Scala.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about

parser.getText match {
  case "NaN" => Float.NaN
  case "+INF" | "INF" | "+Infinity" | "Infinity" => Float.PositiveInfinity
  case "-INF" | "-Infinity" => Float.NegativeInfinity
  case other => throw new RuntimeException(s"Cannot parse $other as FloatType.")
}

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented May 12, 2017

Test build #76841 has finished for PR 17956 at commit 660a284.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also test other negative cases?

@viirya
Copy link
Member

viirya commented May 12, 2017

LGTM except for a minor comment.

@SparkQA
Copy link

SparkQA commented May 12, 2017

Test build #76846 has finished for PR 17956 at commit aa7c658.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is difference between line 1995 and line 1997?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to test +INF?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds a bug fix, because toFloat is case sensitive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For input/output, it is not a bug. Please see #17956 (comment) and the PR description.

Copy link
Member

@gatorsmile gatorsmile May 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether we should follow it. How about our CSV parsers? How about Pandas? Are they case sensitive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is JSON format itself related with CSV and Pandas? Please see the discussion in #9759 (comment).

Also, this does not change the existing behaviour.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, we want to be consistent with the others. Could you check Pandas?

Copy link
Member Author

@HyukjinKwon HyukjinKwon May 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably let me check and open JIRA/PR if there are some cases we should handle when I have some time. Let's leave out that here. It sounds that does not block this PR and I don't want extra changes hold off this PR like #17901.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR changes the behavior, right? Does your test case pass without the code changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this adds some cases. I am not still convinced that we should take care of every other cases. Let's add options nanValue, positiveInf and negativeInf in a separate PR like our CSV datasource.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this PR, +INF, INF and -INF" do not work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I added some cases here. I will open another PR/JIRA to add those options into JSON related functionalities.

@SparkQA
Copy link

SparkQA commented May 12, 2017

Test build #76851 has started for PR 17956 at commit e858789.

@gatorsmile
Copy link
Member

This PR description is misleading. This PR is actually a bug fix, I think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand which standards we follow. The above seven values are accepted by toFloat and toDouble?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This follows Jackson + Scala. #9759 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below is from JsonParser.Feature ALLOW_NON_NUMERIC_NUMBERS

"INF" (for positive infinity), as well as alias of "Infinity"
"-INF" (for negative infinity), alias "-Infinity"
"NaN" (for other not-a-numbers, like result of division by zero)

Then, this PR also adds +INF and +Infinity.

What is the Scala standard?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just let us assume we want to follow JsonParser.Feature ALLOW_NON_NUMERIC_NUMBERS, does our JSON option flag work? allowNonNumericNumbers.

If we turn it off, these test cases should fail, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also can address https://issues.apache.org/jira/browse/SPARK-11753 in this PR, right? Simply accepting all these non-numeric numbers without respecting our flag allowNonNumericNumbers does not sound right to me.

@HyukjinKwon
Copy link
Member Author

retest this please

@HyukjinKwon HyukjinKwon changed the title [SPARK-18772][SQL] Avoid unnecessary conversion try for special floats in JSON and add related tests [SPARK-18772][SQL] Avoid unnecessary conversion try and add missing cases for special floats in JSON May 12, 2017
@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented May 12, 2017

@gatorsmile, I added some more cases I could identify based on http://docs.oracle.com/javase/7/docs/api/java/lang/Double.html#valueOf(java.lang.String) (this looks calling java.lang.Double.parseDouble which toDouble calls).

And also, for sure, I made this falls back to toDouble or toFloat and throw the same exceptions.

@gatorsmile
Copy link
Member

Could we update the JIRA, PR title, and PR description?

Also check the JSON flag allowNonNumericNumbers when doing the conversion?

value.toFloat
} else {
throw new RuntimeException(s"Cannot parse $value as FloatType.")
parser.getText match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one last question is, should we be case sensitive here? can we check other systems like pandas?

Copy link
Member Author

@HyukjinKwon HyukjinKwon May 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to my knowledge, at least Python supports some case-insensitive cases. I would rather leave this working with options (discussed in #17956 (comment)) treating this as a variant. At least, I think we can say here it follows Scala + Jackson's standard.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that we don't really support case insensitive before. Although we check the lower case of input string, but we actually call toFloat on the original string. And "nan".toFloat will get java.lang.NumberFormatException: For input string: "nan".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to me NaN and Infinity, INF are special and case insensitive are not really making sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, here, we are not trying to educate users/customers what are right or not. We are trying to improve the usability. If the other popular platforms accept some formats, we need to follow them even if they are strange to us.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not add all the variants in the default behaviour. We could add the options for those cases. We all already have different ideas. Is it worth adding those cases right now in this PR? This is not the end of fixing this code path.

@HyukjinKwon
Copy link
Member Author

We should upgrade Jackson accordingly to support that option correctly. This looks reverted due to dependency problem - #9759 (comment). It sounds a bigger problem to me. (Also, strictly, it looks rather related with parsing itself). We already support the special floating cases and it might be fine just to improve this case.

I updated PR title and description already. I will fix JIRA as well (I was hesitated because that was not open by me).

@gatorsmile
Copy link
Member

If we do not upgrade Jackson, is that possible we can do it by checking the flag allowNonNumericNumbers before doing the conversion?

@gatorsmile
Copy link
Member

uh... That is different... This is String. That is Float. nvm. They are different issues.

@HyukjinKwon
Copy link
Member Author

Yes, thanks.

"""{"a": "-NaN"}""",
"""{"a": "Infinity"}""",
"""{"a": "+Infinity"}""",
"""{"a": "-Infinity"}""")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not check it, but does """{"a": -Infinity}""" fail?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks -Infinity succeeding.

scala> "-Infinity".toDouble
res0: Double = -Infinity

scala> "-Infinity".toFloat
res1: Float = -Infinity

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like Jackson only support the following four cases.

    val jsons = Seq(
      """{"a": NaN}""",
      """{"a": Infinity}""",
      """{"a": +Infinity}""",
      """{"a": -Infinity}""")

Could you add a separate test case for them? Also add the negative cases too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile
Copy link
Member

Do we need to follow allowNonNumericNumbers when parsing """{"a": "-Infinity"}"""? cc @cloud-fan @viirya

@HyukjinKwon
Copy link
Member Author

I didn't understand what ^ means. Could you elaborate please? allowNonNumericNumbers option here for now is incomplete up to my knowledge.

} else {
throw new RuntimeException(s"Cannot parse $value as FloatType.")
parser.getText match {
case "NaN" | "+NaN" | "-NaN" => Float.NaN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that Jackson's JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS doesn't support "+NaN" and "-NaN".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, but it is supported in Scala. These were added to address #17956 (comment) to say we support Scala + Jackson standard.

scala> "+NaN".toDouble
res2: Double = NaN

scala> "-NaN".toDouble
res3: Double = NaN

@viirya
Copy link
Member

viirya commented May 12, 2017

Do we need to follow allowNonNumericNumbers when parsing """{"a": "-Infinity"}"""? cc @cloud-fan @viirya

allowNonNumericNumbers supports it and Scala supports it too. It seems to me that we should follow them.

Btw we support it too before this change.

@SparkQA
Copy link

SparkQA commented May 12, 2017

Test build #76857 has finished for PR 17956 at commit d581467.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 12, 2017

Test build #76858 has finished for PR 17956 at commit fa63ff4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

gatorsmile commented May 12, 2017

Before this PR, the option allowNonNumericNumbers affects the results of the following four cases:

{"a": NaN}
{"a": Infinity}
{"a": +Infinity}
{"a": -Infinity}

We should also enable this flag for the following cases too.

{"a": "+INF"}
{"a": "INF"}
{"a": "-INF"}
{"a": "NaN"}
{"a": "+NaN"}
{"a": "-NaN"}
{"a": "Infinity"}
{"a": "+Infinity"}
{"a": "-Infinity"}

@HyukjinKwon
Copy link
Member Author

allowNonNumericNumbers option is undocumented and incomplete for now - #9724 (comment)

allowNonNumericNumbers is undocumented for now, since I can't figure out how it works.

@gatorsmile
Copy link
Member

This PR also introduces the following behavior changes.

    def floatRecords: Dataset[String] =
      spark.createDataset(spark.sparkContext.parallelize(
        """{"f": "18.00"}""" ::
          """{"f": "18.30"}""" ::
          """{"f": "20.00"}""" :: Nil))(Encoders.STRING)

    val jsonDF = spark.read.schema("f float").json(floatRecords)

Before this PR, the output is like

+----+
|   f|
+----+
|null|
|null|
|null|
+----+

After this PR, the output is like

+----+
|   f|
+----+
|18.00|
|18.30|
|20.00|
+----+

@HyukjinKwon
Copy link
Member Author

Yea, I am aware of it because we support special floating cases in string and I think we'd expect other doubles could be in string?

@HyukjinKwon
Copy link
Member Author

Hey @gatorsmile, how about picking up the commits here and opening a PR?

@gatorsmile
Copy link
Member

We really need to be careful when making the code changes. If we support such a feature, we need to support all the data types. This is unexpected to users.

@HyukjinKwon
Copy link
Member Author

WDYT? I think it'd be faster to get this merged.

@gatorsmile
Copy link
Member

Let me share my understanding.

First, we do not have a deadline for any non-urgent fixes. This PR is an example. How many JSON writers write a float "INF" as a string?

Being a committer needs to make a judgement to see how critical a PR is. My personal judgement is that we do not need to rush to merge it. I even want to wait for understanding how the other platforms handle such a string-represented "INF" in a float-type column. I believe we are not the first platform to face such an issue. That is why @cloud-fan and I asked the question above.

Second, we have to be careful to avoid introducing bugs when fixing the old issue. Spark is an infrastructure open-source project. We have to be very careful when doing the code reviews and code changes. If we accidentally introduced some supports, it is painful to remove the support and thus we have to keep it maybe forever. @viirya just found it very recently. Thus, please be careful when you do the code review and code changes.

If the users really need such a fix, they also can fix it and make a build by themselves. I believe that is why people like open source projects.

At the end, I want to share you a slides about bug costs. You might already saw similar figure before.

screen shot 2017-05-12 at 11 41 20 am

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented May 13, 2017

  1. It is not good to hold off PRs. If a PR looks good and coherent, I think we could merge.

  2. "How many JSON writers write a float "INF" as a string?", if it is your only worry, I will take out the cases here. This is why I am thinking we should add such cases via options.

  3. My question was to ask taking over this. Please answer to this question.

  4. Other information looks not related with the topic in this PR.

@gatorsmile
Copy link
Member

Based on my previous work experience, I am very conservative to introduce any behavior change unless we are sure this is the right thing to do. I do not think this is a high priority PR. I can put it in my to-do list. When I am free, I can work on it.

To prove the significance of this change, we need to check whether the other projects/products have the same/similar behavior? We do not want to introduce anything new or different. This is my concern.

Below is the list we need to check.

{"a": "+INF"}
{"a": "INF"}
{"a": "-INF"}
{"a": "NaN"}
{"a": "+NaN"}
{"a": "-NaN"}
{"a": "Infinity"}
{"a": "+Infinity"}
{"a": "-Infinity"}

Note, the above is different from the following JSON strings.

{"a": NaN}
{"a": Infinity}
{"a": +Infinity}
{"a": -Infinity}

In my previous comment, I just want to share something for the people who want to be a committer in Spark. It is not related to this PR.

@gatorsmile
Copy link
Member

BTW, if we want to support the JSON strings like {"a": "-Infinity"} as the FLOAT type, I think we also should support the float data in a string. Below is an example.

    def floatRecords: Dataset[String] =
      spark.createDataset(spark.sparkContext.parallelize(
        """{"f": "18.00"}""" ::
          """{"f": "18.30"}""" ::
          """{"f": "20.00"}""" :: Nil))(Encoders.STRING)
    val jsonDF = spark.read.schema("f float").json(floatRecords)

If we want to support parsing float data in a string, we should also support the other data types. Thus, this is another to-do issue, we should do an investigation.

@HyukjinKwon
Copy link
Member Author

Yea, Let's focus on the topic. For the cases below:

{"a": NaN}
{"a": Infinity}
{"a": +Infinity}
{"a": -Infinity}

They are related with allowNonNumericNumbers and related tests are already being ignored.

Okay. So the point is behaviour change, right? Let me give a shot to remove the behaviour changes here.

@HyukjinKwon
Copy link
Member Author

Given the existing logic below:

if (lowerCaseValue.equals("nan") ||
  lowerCaseValue.equals("infinity") ||
  lowerCaseValue.equals("-infinity") ||
  lowerCaseValue.equals("inf") ||
  lowerCaseValue.equals("-inf")) {
    value.toFloat

It looks we have try all the combinations (lowercase and uppcass) of "nan", "infinity", "-infinity", "inf", "-inf" sofar.

import itertools
items = ["nan", "infinity", "-infinity", "inf", "-inf"]
comb = []
for item in items:
   comb.extend(map(''.join, itertools.product(*zip(item.upper(), item.lower()))))

print("Set(%s)" % ", ".join(map(lambda c: '"%s"' % c, comb)))

This code is to generate the combinations. (I feel convenient for dealing with strings in Python).

val s = Set("NAN", "NAn", ... "-inf")
s.foreach { w =>
  try {
    w.toDouble // or w.toFloat
    println(w)
  } catch {
    case _ => // pass
  }
}

It looks we have supported the cases below so far:

NaN
-Infinity
Infinity

Let me update this.

@HyukjinKwon HyukjinKwon changed the title [SPARK-18772][SQL] Avoid unnecessary conversion try and add missing cases for special floats in JSON [SPARK-18772][SQL] Avoid unnecessary conversion try for special floats in JSON May 13, 2017
@gatorsmile
Copy link
Member

We are handling the Json data sources here. Are the following inputs widely used?

{"a": "+INF"}
{"a": "INF"}
{"a": "-INF"}
{"a": "NaN"}
{"a": "+NaN"}
{"a": "-NaN"}
{"a": "Infinity"}
{"a": "+Infinity"}
{"a": "-Infinity"}

@gatorsmile
Copy link
Member

Note, we do not support the following cases

    def floatRecords: Dataset[String] =
      spark.createDataset(spark.sparkContext.parallelize(
        """{"f": "18.00"}""" ::
          """{"f": "18.30"}""" ::
          """{"f": "20.00"}""" :: Nil))(Encoders.STRING)
    val jsonDF = spark.read.schema("f float").json(floatRecords)

Basically, we do not assume users are use string in double columns for JSON data sources.

@HyukjinKwon
Copy link
Member Author

Unfortunately, we already support

"NaN"
"-Infinity"
"Infinity"

Now, this PR targets primarily to avoid unnecessary conversion try primarily.

@HyukjinKwon
Copy link
Member Author

@gatorsmile, @cloud-fan and @viirya, could you take another look please? I tried to get rid of all the behaviour changes existing in both previous PRs but only leave the change to avoid the conversion.

I tried to generate combinations of the existing supports - #17956 (comment) and extracted three cases - "NaN", "-Infinity" and "Infinity" which I believe we have supported so far.

@SparkQA
Copy link

SparkQA commented May 13, 2017

Test build #76894 has finished for PR 17956 at commit 90330bc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request May 13, 2017
…s in JSON

## What changes were proposed in this pull request?

This PR is based on  #16199 and extracts the valid change from #9759 to resolve SPARK-18772

This avoids additional conversion try with `toFloat` and `toDouble`.

For avoiding additional conversions, please refer the codes below:

**Before**

```scala
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show()
17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.NumberFormatException: For input string: "nan"
...
```

**After**

```scala
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show()
17/05/12 11:44:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.RuntimeException: Cannot parse nan as DoubleType.
...
```

## How was this patch tested?

Unit tests added in `JsonSuite`.

Closes #16199

Author: hyukjinkwon <[email protected]>
Author: Nathan Howell <[email protected]>

Closes #17956 from HyukjinKwon/SPARK-18772.

(cherry picked from commit 3f98375)
Signed-off-by: Wenchen Fan <[email protected]>
@asfgit asfgit closed this in 3f98375 May 13, 2017
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.2!

I think this change is pretty safe, we can discuss 2 things later:

  1. if we want to support more special strings like Inf
  2. if we want to make it case-insensitive

@HyukjinKwon
Copy link
Member Author

Thank you everybody sincerely.

robert3005 pushed a commit to palantir/spark that referenced this pull request May 19, 2017
…s in JSON

## What changes were proposed in this pull request?

This PR is based on  apache#16199 and extracts the valid change from apache#9759 to resolve SPARK-18772

This avoids additional conversion try with `toFloat` and `toDouble`.

For avoiding additional conversions, please refer the codes below:

**Before**

```scala
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show()
17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.NumberFormatException: For input string: "nan"
...
```

**After**

```scala
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show()
17/05/12 11:44:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.RuntimeException: Cannot parse nan as DoubleType.
...
```

## How was this patch tested?

Unit tests added in `JsonSuite`.

Closes apache#16199

Author: hyukjinkwon <[email protected]>
Author: Nathan Howell <[email protected]>

Closes apache#17956 from HyukjinKwon/SPARK-18772.
liyichao pushed a commit to liyichao/spark that referenced this pull request May 24, 2017
…s in JSON

## What changes were proposed in this pull request?

This PR is based on  apache#16199 and extracts the valid change from apache#9759 to resolve SPARK-18772

This avoids additional conversion try with `toFloat` and `toDouble`.

For avoiding additional conversions, please refer the codes below:

**Before**

```scala
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show()
17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.NumberFormatException: For input string: "nan"
...
```

**After**

```scala
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show()
17/05/12 11:44:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.RuntimeException: Cannot parse nan as DoubleType.
...
```

## How was this patch tested?

Unit tests added in `JsonSuite`.

Closes apache#16199

Author: hyukjinkwon <[email protected]>
Author: Nathan Howell <[email protected]>

Closes apache#17956 from HyukjinKwon/SPARK-18772.
@HyukjinKwon HyukjinKwon deleted the SPARK-18772 branch January 2, 2018 03:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants