[SPARK-25977][SQL] Parsing decimals from CSV using locale #22979

MaxGekk · 2018-11-08T12:46:07Z

What changes were proposed in this pull request?

In the PR, I propose using of the locale option to parse decimals from CSV input. After the changes, UnivocityParser converts input string to BigDecimal and to Spark's Decimal by using java.text.DecimalFormat.

How was this patch tested?

Added a test for the en-US, ko-KR, ru-RU, de-DE locales.

MaxGekk · 2018-11-08T12:47:27Z

The PR requires this PR #22951 to support the locale option properly.

SparkQA · 2018-11-08T12:50:55Z

Test build #98599 has finished for PR 22979 at commit f9438c4.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-11-08T13:25:49Z

Looks good. Will take a closer look.

SparkQA · 2018-11-08T16:25:58Z

Test build #98600 has finished for PR 22979 at commit 3125c23.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-11-09T01:45:58Z

Also let me leave a cc for @srowen.

# Conflicts: # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CsvExpressionsSuite.scala

SparkQA · 2018-11-09T08:05:01Z

Test build #98640 has finished for PR 22979 at commit 64a97a2.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-11-09T15:19:54Z

retest this please

SparkQA · 2018-11-09T18:52:08Z

Test build #98651 has finished for PR 22979 at commit 64a97a2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CsvExpressionsSuite.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala

HyukjinKwon · 2018-11-11T12:49:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala

      nullSafeDatum(d, name, nullable, options) { datum =>
-        val value = new BigDecimal(datum.replaceAll(",", ""))
-        Decimal(value, dt.precision, dt.scale)
+        val bigDecimal = decimalParser.parse(datum).asInstanceOf[BigDecimal]


@MaxGekk, is it safe that we assume this Number is BigDecimal? Looks there are some possibilities that it can return other types.

is it safe that we assume this Number is BigDecimal?

I am not absolutely sure that it always return BigDecimal. Found this at https://docs.oracle.com/javase/8/docs/api/java/text/DecimalFormat.html#parse(java.lang.String,java.text.ParsePosition) :

If isParseBigDecimal() is true, values are returned as BigDecimal objects. The values are the ones constructed by BigDecimal.BigDecimal(String) for corresponding strings in locale-independent format. The special cases negative and positive infinity and NaN are returned as Double instances holding the values of the corresponding Double constants.

So, isParseBigDecimal() returns true when setParseBigDecimal was called with true as in the PR.

Looks there are some possibilities that it can return other types.

In that case we just fail with a cast exception and the record will be handled as a bad record. or you want to see more clear message in the exception?

Ah, right. The previous codes will anyway throw an exception, I see. One thing I am a little bit unsure is how much different the behaviour is. For instance, looks the previous one handles sign character as well (+ and -).

Let me take a closer look. I think I need to.

For instance, there was a similar try to change the date parsing library (#21363). I already know the different is quite breaking and the workaround is difficult as far as I know - so I suggested to add a configuration or fallback for now. Probably we should similarily just document the behaviour change in the migration guide but actually less sure yet even about this. anyway will take another look shortly.

so I suggested to add a configuration or fallback for now ...

What about SQL config spark.sql.legacy.decimalParsing.enabled with default value false.

Sounds good if that's not difficult.

docs/sql-migration-guide-upgrade.md

SparkQA · 2018-11-11T21:22:34Z

Test build #98702 has finished for PR 22979 at commit 3dfce18.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-12T20:37:19Z

Test build #98735 has finished for PR 22979 at commit 8c5593e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-12T20:48:27Z

Test build #98737 has finished for PR 22979 at commit c28b79f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-12T21:22:03Z

Test build #98738 has finished for PR 22979 at commit 1723da2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

Looks good. @MaxGekk, thanks for taking a look for this.

SparkQA · 2018-11-22T16:56:48Z

Test build #99185 has finished for PR 22979 at commit 2918d04.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-11-22T19:21:25Z

jenkins, retest this, please

SparkQA · 2018-11-22T22:34:56Z

Test build #99195 has finished for PR 22979 at commit 2918d04.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-11-22T23:11:11Z

No chance to pass tests in the PR ;-)

test_aggregator (pyspark.sql.tests.test_group.GroupTests) ... #
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000000000000080, pid=40070, tid=139648880690944
#
# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  0x0000000000000080

MaxGekk · 2018-11-22T23:13:37Z

@HyukjinKwon Could it be related to recent changes in python tests?

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

SparkQA · 2018-11-23T08:05:02Z

Test build #99211 has finished for PR 22979 at commit 15a09b8.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class CSVInferSchema(options: CSVOptions) extends Serializable

SparkQA · 2018-11-23T11:43:26Z

Test build #99212 has finished for PR 22979 at commit 5236336.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? In the PR, I propose new options for CSV datasource - `lineSep` similar to Text and JSON datasource. The option allows to specify custom line separator of maximum length of 2 characters (because of a restriction in `uniVocity` parser). New option can be used in reading and writing CSV files. ## How was this patch tested? Added a few tests with custom `lineSep` for enabled/disabled `multiLine` in read as well as tests in write. Also I added roundtrip tests. Closes #23080 from MaxGekk/csv-line-sep. Lead-authored-by: Maxim Gekk <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

SparkQA · 2018-11-27T10:57:03Z

Test build #99317 has finished for PR 22979 at commit 0d1a4f0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-11-27T11:15:10Z

jenkins, retest this, please

SparkQA · 2018-11-27T14:53:20Z

Test build #99323 has finished for PR 22979 at commit 0d1a4f0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

# Conflicts: # docs/sql-migration-guide-upgrade.md

…al-parsing-locale # Conflicts: # docs/sql-migration-guide-upgrade.md

SparkQA · 2018-11-28T22:58:05Z

Test build #99403 has finished for PR 22979 at commit e989b77.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-11-28T23:27:26Z

jenkins, retest this, please

SparkQA · 2018-11-29T02:58:29Z

Test build #99416 has finished for PR 22979 at commit e989b77.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-29T18:34:36Z

Test build #99456 has finished for PR 22979 at commit 521bd45.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-11-30T00:27:35Z

Merged to master.

## What changes were proposed in this pull request? In the PR, I propose using of the locale option to parse decimals from CSV input. After the changes, `UnivocityParser` converts input string to `BigDecimal` and to Spark's Decimal by using `java.text.DecimalFormat`. ## How was this patch tested? Added a test for the `en-US`, `ko-KR`, `ru-RU`, `de-DE` locales. Closes apache#22979 from MaxGekk/decimal-parsing-locale. Lead-authored-by: Maxim Gekk <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

…ce for backward compatibility ## What changes were proposed in this pull request? The code below currently infers as decimal but previously it was inferred as string. **In branch-2.4**, type inference path for decimal and parsing data are different. https://github.com/apache/spark/blob/2a8343121e62aabe5c69d1e20fbb2c01e2e520e7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala#L153 https://github.com/apache/spark/blob/c284c4e1f6f684ca8db1cc446fdcc43b46e3413c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L125 So the code below: ```scala scala> spark.read.option("delimiter", "|").option("inferSchema", "true").csv(Seq("1,2").toDS).printSchema() ``` produced string as its type. ``` root |-- _c0: string (nullable = true) ``` **In the current master**, it now infers decimal as below: ``` root |-- _c0: decimal(2,0) (nullable = true) ``` It happened after #22979 because, now after this PR, we only have one way to parse decimal: https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala#L92 **After the fix:** ``` root |-- _c0: string (nullable = true) ``` This PR proposes to restore the previous behaviour back in `CSVInferSchema`. ## How was this patch tested? Manually tested and unit tests were added. Closes #24437 from HyukjinKwon/SPARK-27512. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

MaxGekk added 4 commits November 8, 2018 15:31

Add a test

c567dcc

Fix decimal parsing

2b41eba

Add locale option

cf438ae

Updating the migration guide

f9438c4

MaxGekk mentioned this pull request Nov 8, 2018

[SPARK-25945][SQL] Support locale while parsing date/timestamp from CSV/JSON #22951

Closed

Fix imports

3125c23

Merge remote-tracking branch 'origin/master' into decimal-parsing-locale

64a97a2

# Conflicts: # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CsvExpressionsSuite.scala

HyukjinKwon reviewed Nov 11, 2018

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CsvExpressionsSuite.scala Show resolved Hide resolved

HyukjinKwon reviewed Nov 11, 2018

View reviewed changes

docs/sql-migration-guide-upgrade.md Outdated Show resolved Hide resolved

MaxGekk added 2 commits November 11, 2018 18:25

Renaming decimalParser to decimalFormat

2f76352

Moving the test to UnivocityParserSuite

3dfce18

MaxGekk added 5 commits November 12, 2018 17:45

Support the SQL config spark.sql.legacy.decimalParsing.enabled

bdca7c4

Updating the migration guide.

8c5593e

Refactoring

18470b0

Removing internal

c28b79f

Test refactoring

1723da2

HyukjinKwon approved these changes Nov 13, 2018

View reviewed changes

Renaming df to decimalFormat

bab8fb2

MaxGekk force-pushed the decimal-parsing-locale branch from 15a09b8 to bab8fb2 Compare November 23, 2018 07:56

Merge remote-tracking branch 'origin/master' into decimal-parsing-locale

5236336

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Merge branch 'master' into decimal-parsing-locale

0d1a4f0

MaxGekk added 3 commits November 28, 2018 22:05

Merge remote-tracking branch 'origin/master' into decimal-parsing-locale

8b1456c

# Conflicts: # docs/sql-migration-guide-upgrade.md

Removing SQL config and special handling of Locale.US

0859624

Merge remote-tracking branch 'fork/decimal-parsing-locale' into decim…

e989b77

…al-parsing-locale # Conflicts: # docs/sql-migration-guide-upgrade.md

MaxGekk mentioned this pull request Nov 29, 2018

[SPARK-26163][SQL] Parsing decimals from JSON using locale #23132

Closed

Merge remote-tracking branch 'origin/master' into decimal-parsing-locale

521bd45

asfgit closed this in f97326b Nov 30, 2018

HyukjinKwon mentioned this pull request Apr 22, 2019

[SPARK-27512][SQL] Avoid to replace ',' in CSV's decimal type inference for backward compatibility #24437

Closed

MaxGekk deleted the decimal-parsing-locale branch August 17, 2019 13:33

[SPARK-25977][SQL] Parsing decimals from CSV using locale #22979

[SPARK-25977][SQL] Parsing decimals from CSV using locale #22979

Uh oh!

Conversation

MaxGekk commented Nov 8, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

MaxGekk commented Nov 8, 2018

Uh oh!

SparkQA commented Nov 8, 2018

Uh oh!

HyukjinKwon commented Nov 8, 2018

Uh oh!

SparkQA commented Nov 8, 2018

Uh oh!

HyukjinKwon commented Nov 9, 2018

Uh oh!

SparkQA commented Nov 9, 2018

Uh oh!

HyukjinKwon commented Nov 9, 2018

Uh oh!

SparkQA commented Nov 9, 2018

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon Nov 11, 2018

Choose a reason for hiding this comment

Uh oh!

MaxGekk Nov 11, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 11, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk Nov 11, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 12, 2018

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Nov 11, 2018

Uh oh!

SparkQA commented Nov 12, 2018

Uh oh!

SparkQA commented Nov 12, 2018

Uh oh!

SparkQA commented Nov 12, 2018

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 22, 2018

Uh oh!

MaxGekk commented Nov 22, 2018

Uh oh!

SparkQA commented Nov 22, 2018

Uh oh!

MaxGekk commented Nov 22, 2018

Uh oh!

MaxGekk commented Nov 22, 2018

Uh oh!

SparkQA commented Nov 23, 2018

Uh oh!

SparkQA commented Nov 23, 2018

Uh oh!

SparkQA commented Nov 27, 2018

Uh oh!

MaxGekk commented Nov 27, 2018

Uh oh!

SparkQA commented Nov 27, 2018

Uh oh!

SparkQA commented Nov 28, 2018

Uh oh!

MaxGekk commented Nov 28, 2018

HyukjinKwon Nov 11, 2018 •

edited

Loading