Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Nov 24, 2018

What changes were proposed in this pull request?

In the PR, I propose using of the locale option to parse (and infer) decimals from JSON input. After the changes, JacksonParser converts input string to BigDecimal and to Spark's Decimal by using java.text.DecimalFormat. New behavior is enabled for locales different from en-US.

How was this patch tested?

Added 2 tests to JsonExpressionsSuite for the en-US, ko-KR, ru-RU, de-DE locales:

  • Inferring decimal type using locale from JSON field values
  • Converting JSON field values to specified decimal type using the locales.

@SparkQA
Copy link

SparkQA commented Nov 25, 2018

Test build #99232 has finished for PR 23132 at commit 83920b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 27, 2018

Test build #99318 has finished for PR 23132 at commit 72ebd34.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 27, 2018

@dongjoon-hyun @cloud-fan Please, take a look at the PR.


## Upgrading From Spark SQL 2.4 to 3.0

- In Spark version 2.4 and earlier, accepted format of decimals parsed from JSON is an optional sign ('+' or '-'), followed by a sequence of zero or more decimal digits, optionally followed by a fraction, optionally followed by an exponent. Any commas were removed from the input before parsing. Since Spark 3.0, format varies and depends on locale which can be set via JSON option `locale`. The default locale is `en-US`. To switch back to previous behavior, set `spark.sql.legacy.decimalParsing.enabled` to `true`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any performance regression with DecimalFormat?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same question. Do we need the DecimalFormat when locale is en-US?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any performance regression with DecimalFormat?

I haven't benchmarked the changes. Looking at JSONBenchmarks, we even don't check decimals there.

Do we need the DecimalFormat when locale is en-US?

No, we don't need it.

Copy link
Contributor

@cloud-fan cloud-fan Nov 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the default value is en-US, can we skip DecimalFormat when locale is en-US? Then there is nothing changes by default, and we don't even need a legacy config.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed SQL config and record in the migration guid. Also I applied DecimalFormat only for not en-US locales.

@SparkQA
Copy link

SparkQA commented Nov 28, 2018

Test build #99396 has finished for PR 23132 at commit 1ec56e5.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 28, 2018

jenkins, retest this, please

@cloud-fan
Copy link
Contributor

LGTM, does CSV need to do the same?

@SparkQA
Copy link

SparkQA commented Nov 29, 2018

Test build #99417 has finished for PR 23132 at commit 1ec56e5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 29, 2018

@cloud-fan I did the same for CSV: #22979

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 7a83d71 Nov 29, 2018
@gatorsmile
Copy link
Member

spark.sql.legacy.decimalParsing.enabled is still shown in the PR description and commit messages.

@HyukjinKwon
Copy link
Member

@MaxGekk, mind fixing PR description accordingly?

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 10, 2018

mind fixing PR description accordingly?

@HyukjinKwon fixed

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

In the PR, I propose using of the locale option to parse (and infer) decimals from JSON input. After the changes, `JacksonParser` converts input string to `BigDecimal` and to Spark's Decimal by using `java.text.DecimalFormat`. New behaviour can be switched off via SQL config `spark.sql.legacy.decimalParsing.enabled`.

## How was this patch tested?

Added 2 tests to `JsonExpressionsSuite` for the `en-US`, `ko-KR`, `ru-RU`, `de-DE` locales:
- Inferring decimal type using locale from JSON field values
- Converting JSON field values to specified decimal type using the locales.

Closes apache#23132 from MaxGekk/json-decimal-parsing-locale.

Lead-authored-by: Maxim Gekk <[email protected]>
Co-authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@MaxGekk MaxGekk deleted the json-decimal-parsing-locale branch August 17, 2019 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants