-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18772][SQL] Avoid unnecessary conversion try for special floats in JSON #17956
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @NathanHowell, @cloud-fan and @viirya. (I just want to note this will not change any input/output but just the exception type and avoid additional conversion try.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"infinity".toDouble, "inf".toDouble are not legal. These non-numeric numbers are case-sensitive, both for Jackson and Scala.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about
parser.getText match {
case "NaN" => Float.NaN
case "+INF" | "INF" | "+Infinity" | "Infinity" => Float.PositiveInfinity
case "-INF" | "-Infinity" => Float.NegativeInfinity
case other => throw new RuntimeException(s"Cannot parse $other as FloatType.")
}
|
LGTM |
|
Test build #76841 has finished for PR 17956 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we also test other negative cases?
|
LGTM except for a minor comment. |
|
Test build #76846 has finished for PR 17956 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is difference between line 1995 and line 1997?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to test +INF?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds a bug fix, because toFloat is case sensitive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For input/output, it is not a bug. Please see #17956 (comment) and the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure whether we should follow it. How about our CSV parsers? How about Pandas? Are they case sensitive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is JSON format itself related with CSV and Pandas? Please see the discussion in #9759 (comment).
Also, this does not change the existing behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, we want to be consistent with the others. Could you check Pandas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably let me check and open JIRA/PR if there are some cases we should handle when I have some time. Let's leave out that here. It sounds that does not block this PR and I don't want extra changes hold off this PR like #17901.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR changes the behavior, right? Does your test case pass without the code changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, this adds some cases. I am not still convinced that we should take care of every other cases. Let's add options nanValue, positiveInf and negativeInf in a separate PR like our CSV datasource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this PR, +INF, INF and -INF" do not work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I added some cases here. I will open another PR/JIRA to add those options into JSON related functionalities.
|
Test build #76851 has started for PR 17956 at commit |
|
This PR description is misleading. This PR is actually a bug fix, I think |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand which standards we follow. The above seven values are accepted by toFloat and toDouble?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This follows Jackson + Scala. #9759 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below is from JsonParser.Feature ALLOW_NON_NUMERIC_NUMBERS
"INF" (for positive infinity), as well as alias of "Infinity"
"-INF" (for negative infinity), alias "-Infinity"
"NaN" (for other not-a-numbers, like result of division by zero)
Then, this PR also adds +INF and +Infinity.
What is the Scala standard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just let us assume we want to follow JsonParser.Feature ALLOW_NON_NUMERIC_NUMBERS, does our JSON option flag work? allowNonNumericNumbers.
If we turn it off, these test cases should fail, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also can address https://issues.apache.org/jira/browse/SPARK-11753 in this PR, right? Simply accepting all these non-numeric numbers without respecting our flag allowNonNumericNumbers does not sound right to me.
|
retest this please |
|
@gatorsmile, I added some more cases I could identify based on http://docs.oracle.com/javase/7/docs/api/java/lang/Double.html#valueOf(java.lang.String) (this looks calling And also, for sure, I made this falls back to |
|
Could we update the JIRA, PR title, and PR description? Also check the JSON flag |
| value.toFloat | ||
| } else { | ||
| throw new RuntimeException(s"Cannot parse $value as FloatType.") | ||
| parser.getText match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one last question is, should we be case sensitive here? can we check other systems like pandas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to my knowledge, at least Python supports some case-insensitive cases. I would rather leave this working with options (discussed in #17956 (comment)) treating this as a variant. At least, I think we can say here it follows Scala + Jackson's standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that we don't really support case insensitive before. Although we check the lower case of input string, but we actually call toFloat on the original string. And "nan".toFloat will get java.lang.NumberFormatException: For input string: "nan".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to me NaN and Infinity, INF are special and case insensitive are not really making sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, here, we are not trying to educate users/customers what are right or not. We are trying to improve the usability. If the other popular platforms accept some formats, we need to follow them even if they are strange to us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not add all the variants in the default behaviour. We could add the options for those cases. We all already have different ideas. Is it worth adding those cases right now in this PR? This is not the end of fixing this code path.
|
We should upgrade Jackson accordingly to support that option correctly. This looks reverted due to dependency problem - #9759 (comment). It sounds a bigger problem to me. (Also, strictly, it looks rather related with parsing itself). We already support the special floating cases and it might be fine just to improve this case. I updated PR title and description already. I will fix JIRA as well (I was hesitated because that was not open by me). |
|
If we do not upgrade |
|
uh... That is different... This is String. That is Float. nvm. They are different issues. |
|
Yes, thanks. |
| """{"a": "-NaN"}""", | ||
| """{"a": "Infinity"}""", | ||
| """{"a": "+Infinity"}""", | ||
| """{"a": "-Infinity"}""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not check it, but does """{"a": -Infinity}""" fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks -Infinity succeeding.
scala> "-Infinity".toDouble
res0: Double = -Infinity
scala> "-Infinity".toFloat
res1: Float = -Infinity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like Jackson only support the following four cases.
val jsons = Seq(
"""{"a": NaN}""",
"""{"a": Infinity}""",
"""{"a": +Infinity}""",
"""{"a": -Infinity}""")Could you add a separate test case for them? Also add the negative cases too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #17956 (comment)
|
Do we need to follow |
|
I didn't understand what ^ means. Could you elaborate please? |
| } else { | ||
| throw new RuntimeException(s"Cannot parse $value as FloatType.") | ||
| parser.getText match { | ||
| case "NaN" | "+NaN" | "-NaN" => Float.NaN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember that Jackson's JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS doesn't support "+NaN" and "-NaN".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, but it is supported in Scala. These were added to address #17956 (comment) to say we support Scala + Jackson standard.
scala> "+NaN".toDouble
res2: Double = NaN
scala> "-NaN".toDouble
res3: Double = NaN
Btw we support it too before this change. |
|
Test build #76857 has finished for PR 17956 at commit
|
|
Test build #76858 has finished for PR 17956 at commit
|
|
Before this PR, the option {"a": NaN}
{"a": Infinity}
{"a": +Infinity}
{"a": -Infinity}We should also enable this flag for the following cases too. {"a": "+INF"}
{"a": "INF"}
{"a": "-INF"}
{"a": "NaN"}
{"a": "+NaN"}
{"a": "-NaN"}
{"a": "Infinity"}
{"a": "+Infinity"}
{"a": "-Infinity"} |
|
|
|
This PR also introduces the following behavior changes. def floatRecords: Dataset[String] =
spark.createDataset(spark.sparkContext.parallelize(
"""{"f": "18.00"}""" ::
"""{"f": "18.30"}""" ::
"""{"f": "20.00"}""" :: Nil))(Encoders.STRING)
val jsonDF = spark.read.schema("f float").json(floatRecords)Before this PR, the output is like After this PR, the output is like |
|
Yea, I am aware of it because we support special floating cases in string and I think we'd expect other doubles could be in string? |
|
Hey @gatorsmile, how about picking up the commits here and opening a PR? |
|
We really need to be careful when making the code changes. If we support such a feature, we need to support all the data types. This is unexpected to users. |
|
WDYT? I think it'd be faster to get this merged. |
|
Let me share my understanding. First, we do not have a deadline for any non-urgent fixes. This PR is an example. How many JSON writers write a float "INF" as a string? Being a committer needs to make a judgement to see how critical a PR is. My personal judgement is that we do not need to rush to merge it. I even want to wait for understanding how the other platforms handle such a string-represented "INF" in a float-type column. I believe we are not the first platform to face such an issue. That is why @cloud-fan and I asked the question above. Second, we have to be careful to avoid introducing bugs when fixing the old issue. Spark is an infrastructure open-source project. We have to be very careful when doing the code reviews and code changes. If we accidentally introduced some supports, it is painful to remove the support and thus we have to keep it maybe forever. @viirya just found it very recently. Thus, please be careful when you do the code review and code changes. If the users really need such a fix, they also can fix it and make a build by themselves. I believe that is why people like open source projects. At the end, I want to share you a slides about bug costs. You might already saw similar figure before. |
|
|
Based on my previous work experience, I am very conservative to introduce any behavior change unless we are sure this is the right thing to do. I do not think this is a high priority PR. I can put it in my to-do list. When I am free, I can work on it. To prove the significance of this change, we need to check whether the other projects/products have the same/similar behavior? We do not want to introduce anything new or different. This is my concern. Below is the list we need to check. Note, the above is different from the following JSON strings. In my previous comment, I just want to share something for the people who want to be a committer in Spark. It is not related to this PR. |
|
BTW, if we want to support the JSON strings like def floatRecords: Dataset[String] =
spark.createDataset(spark.sparkContext.parallelize(
"""{"f": "18.00"}""" ::
"""{"f": "18.30"}""" ::
"""{"f": "20.00"}""" :: Nil))(Encoders.STRING)
val jsonDF = spark.read.schema("f float").json(floatRecords)If we want to support parsing float data in a string, we should also support the other data types. Thus, this is another to-do issue, we should do an investigation. |
|
Yea, Let's focus on the topic. For the cases below: They are related with Okay. So the point is behaviour change, right? Let me give a shot to remove the behaviour changes here. |
|
Given the existing logic below: It looks we have try all the combinations (lowercase and uppcass) of import itertools
items = ["nan", "infinity", "-infinity", "inf", "-inf"]
comb = []
for item in items:
comb.extend(map(''.join, itertools.product(*zip(item.upper(), item.lower()))))
print("Set(%s)" % ", ".join(map(lambda c: '"%s"' % c, comb)))This code is to generate the combinations. (I feel convenient for dealing with strings in Python). val s = Set("NAN", "NAn", ... "-inf")
s.foreach { w =>
try {
w.toDouble // or w.toFloat
println(w)
} catch {
case _ => // pass
}
}It looks we have supported the cases below so far: Let me update this. |
|
We are handling the Json data sources here. Are the following inputs widely used? |
|
Note, we do not support the following cases Basically, we do not assume users are use string in double columns for JSON data sources. |
|
Unfortunately, we already support Now, this PR targets primarily to avoid unnecessary conversion try primarily. |
|
@gatorsmile, @cloud-fan and @viirya, could you take another look please? I tried to get rid of all the behaviour changes existing in both previous PRs but only leave the change to avoid the conversion. I tried to generate combinations of the existing supports - #17956 (comment) and extracted three cases - "NaN", "-Infinity" and "Infinity" which I believe we have supported so far. |
|
Test build #76894 has finished for PR 17956 at commit
|
…s in JSON ## What changes were proposed in this pull request? This PR is based on #16199 and extracts the valid change from #9759 to resolve SPARK-18772 This avoids additional conversion try with `toFloat` and `toDouble`. For avoiding additional conversions, please refer the codes below: **Before** ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show() 17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.lang.NumberFormatException: For input string: "nan" ... ``` **After** ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show() 17/05/12 11:44:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.RuntimeException: Cannot parse nan as DoubleType. ... ``` ## How was this patch tested? Unit tests added in `JsonSuite`. Closes #16199 Author: hyukjinkwon <[email protected]> Author: Nathan Howell <[email protected]> Closes #17956 from HyukjinKwon/SPARK-18772. (cherry picked from commit 3f98375) Signed-off-by: Wenchen Fan <[email protected]>
|
thanks, merging to master/2.2! I think this change is pretty safe, we can discuss 2 things later:
|
|
Thank you everybody sincerely. |
…s in JSON ## What changes were proposed in this pull request? This PR is based on apache#16199 and extracts the valid change from apache#9759 to resolve SPARK-18772 This avoids additional conversion try with `toFloat` and `toDouble`. For avoiding additional conversions, please refer the codes below: **Before** ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show() 17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.lang.NumberFormatException: For input string: "nan" ... ``` **After** ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show() 17/05/12 11:44:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.RuntimeException: Cannot parse nan as DoubleType. ... ``` ## How was this patch tested? Unit tests added in `JsonSuite`. Closes apache#16199 Author: hyukjinkwon <[email protected]> Author: Nathan Howell <[email protected]> Closes apache#17956 from HyukjinKwon/SPARK-18772.
…s in JSON ## What changes were proposed in this pull request? This PR is based on apache#16199 and extracts the valid change from apache#9759 to resolve SPARK-18772 This avoids additional conversion try with `toFloat` and `toDouble`. For avoiding additional conversions, please refer the codes below: **Before** ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show() 17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.lang.NumberFormatException: For input string: "nan" ... ``` **After** ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show() 17/05/12 11:44:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.RuntimeException: Cannot parse nan as DoubleType. ... ``` ## How was this patch tested? Unit tests added in `JsonSuite`. Closes apache#16199 Author: hyukjinkwon <[email protected]> Author: Nathan Howell <[email protected]> Closes apache#17956 from HyukjinKwon/SPARK-18772.

What changes were proposed in this pull request?
This PR is based on #16199 and extracts the valid change from #9759 to resolve SPARK-18772
This avoids additional conversion try with
toFloatandtoDouble.For avoiding additional conversions, please refer the codes below:
Before
After
How was this patch tested?
Unit tests added in
JsonSuite.Closes #16199