[SPARK-26163][SQL] Parsing decimals from JSON using locale #23132

MaxGekk · 2018-11-24T22:24:06Z

What changes were proposed in this pull request?

In the PR, I propose using of the locale option to parse (and infer) decimals from JSON input. After the changes, JacksonParser converts input string to BigDecimal and to Spark's Decimal by using java.text.DecimalFormat. New behavior is enabled for locales different from en-US.

How was this patch tested?

Added 2 tests to JsonExpressionsSuite for the en-US, ko-KR, ru-RU, de-DE locales:

Inferring decimal type using locale from JSON field values
Converting JSON field values to specified decimal type using the locales.

…g-locale # Conflicts: # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala

SparkQA · 2018-11-25T02:02:58Z

Test build #99232 has finished for PR 23132 at commit 83920b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-27T12:52:01Z

Test build #99318 has finished for PR 23132 at commit 72ebd34.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-11-27T17:04:51Z

@dongjoon-hyun @cloud-fan Please, take a look at the PR.

dongjoon-hyun · 2018-11-27T22:31:00Z

docs/sql-migration-guide-upgrade.md


 ## Upgrading From Spark SQL 2.4 to 3.0

+  - In Spark version 2.4 and earlier, accepted format of decimals parsed from JSON is an optional sign ('+' or '-'), followed by a sequence of zero or more decimal digits, optionally followed by a fraction, optionally followed by an exponent. Any commas were removed from the input before parsing. Since Spark 3.0, format varies and depends on locale which can be set via JSON option `locale`. The default locale is `en-US`. To switch back to previous behavior, set `spark.sql.legacy.decimalParsing.enabled` to `true`.


Is there any performance regression with DecimalFormat?

I have the same question. Do we need the DecimalFormat when locale is en-US?

Is there any performance regression with DecimalFormat?

I haven't benchmarked the changes. Looking at JSONBenchmarks, we even don't check decimals there.

Do we need the DecimalFormat when locale is en-US?

No, we don't need it.

since the default value is en-US, can we skip DecimalFormat when locale is en-US? Then there is nothing changes by default, and we don't even need a legacy config.

I removed SQL config and record in the migration guid. Also I applied DecimalFormat only for not en-US locales.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

SparkQA · 2018-11-28T23:01:31Z

Test build #99396 has finished for PR 23132 at commit 1ec56e5.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-11-28T23:28:47Z

jenkins, retest this, please

cloud-fan · 2018-11-29T02:48:34Z

LGTM, does CSV need to do the same?

SparkQA · 2018-11-29T02:58:16Z

Test build #99417 has finished for PR 23132 at commit 1ec56e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-11-29T14:05:49Z

@cloud-fan I did the same for CSV: #22979

cloud-fan · 2018-11-29T14:15:23Z

thanks, merging to master!

gatorsmile · 2018-12-10T07:40:46Z

spark.sql.legacy.decimalParsing.enabled is still shown in the PR description and commit messages.

HyukjinKwon · 2018-12-10T08:01:09Z

@MaxGekk, mind fixing PR description accordingly?

MaxGekk · 2018-12-10T10:03:05Z

mind fixing PR description accordingly?

@HyukjinKwon fixed

## What changes were proposed in this pull request? In the PR, I propose using of the locale option to parse (and infer) decimals from JSON input. After the changes, `JacksonParser` converts input string to `BigDecimal` and to Spark's Decimal by using `java.text.DecimalFormat`. New behaviour can be switched off via SQL config `spark.sql.legacy.decimalParsing.enabled`. ## How was this patch tested? Added 2 tests to `JsonExpressionsSuite` for the `en-US`, `ko-KR`, `ru-RU`, `de-DE` locales: - Inferring decimal type using locale from JSON field values - Converting JSON field values to specified decimal type using the locales. Closes apache#23132 from MaxGekk/json-decimal-parsing-locale. Lead-authored-by: Maxim Gekk <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

MaxGekk added 12 commits November 10, 2018 20:49

Test for parsing decimals using locale

506417e

Parsing decimals using locale

ac25fb6

Updating the migration guide

b784003

Merge remote-tracking branch 'origin/master' into json-decimal-parsin…

d052209

…g-locale # Conflicts: # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala

Add SQL config spark.sql.legacy.decimalParsing.enabled

722e135

Updating the migration guide

dc6c0ac

Added a test for parsing

f15b181

Fix test

ab781d5

Create getDecimalParser

163a8b9

Add a test for inferring decimals

8fb65c0

Change JsonSuite to adopt it for JsonInferSchema class

7e3a290

Inferring decimals from JSON

83920b2

Merge branch 'master' into json-decimal-parsing-locale

72ebd34

dongjoon-hyun reviewed Nov 27, 2018

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

Addressing Wenchen's review comments

1ec56e5

asfgit closed this in 7a83d71 Nov 29, 2018

MaxGekk deleted the json-decimal-parsing-locale branch August 17, 2019 13:33


		## Upgrading From Spark SQL 2.4 to 3.0

		- In Spark version 2.4 and earlier, accepted format of decimals parsed from JSON is an optional sign ('+' or '-'), followed by a sequence of zero or more decimal digits, optionally followed by a fraction, optionally followed by an exponent. Any commas were removed from the input before parsing. Since Spark 3.0, format varies and depends on locale which can be set via JSON option `locale`. The default locale is `en-US`. To switch back to previous behavior, set `spark.sql.legacy.decimalParsing.enabled` to `true`.

[SPARK-26163][SQL] Parsing decimals from JSON using locale #23132

[SPARK-26163][SQL] Parsing decimals from JSON using locale #23132

Uh oh!

Conversation

MaxGekk commented Nov 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Nov 25, 2018

Uh oh!

SparkQA commented Nov 27, 2018

Uh oh!

MaxGekk commented Nov 27, 2018

Uh oh!

dongjoon-hyun Nov 27, 2018

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 28, 2018

Choose a reason for hiding this comment

Uh oh!

MaxGekk Nov 28, 2018

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk Nov 28, 2018

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Nov 28, 2018

Uh oh!

MaxGekk commented Nov 28, 2018

Uh oh!

cloud-fan commented Nov 29, 2018

Uh oh!

SparkQA commented Nov 29, 2018

Uh oh!

MaxGekk commented Nov 29, 2018

Uh oh!

cloud-fan commented Nov 29, 2018

Uh oh!

gatorsmile commented Dec 10, 2018

Uh oh!

HyukjinKwon commented Dec 10, 2018

Uh oh!

MaxGekk commented Dec 10, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

MaxGekk commented Nov 24, 2018 •

edited

Loading

cloud-fan Nov 28, 2018 •

edited

Loading