Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Nov 8, 2018

What changes were proposed in this pull request?

In the PR, I propose using of the locale option to parse decimals from CSV input. After the changes, UnivocityParser converts input string to BigDecimal and to Spark's Decimal by using java.text.DecimalFormat.

How was this patch tested?

Added a test for the en-US, ko-KR, ru-RU, de-DE locales.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 8, 2018

The PR requires this PR #22951 to support the locale option properly.

@SparkQA
Copy link

SparkQA commented Nov 8, 2018

Test build #98599 has finished for PR 22979 at commit f9438c4.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Looks good. Will take a closer look.

@SparkQA
Copy link

SparkQA commented Nov 8, 2018

Test build #98600 has finished for PR 22979 at commit 3125c23.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Also let me leave a cc for @srowen.

# Conflicts:
#	sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CsvExpressionsSuite.scala
@SparkQA
Copy link

SparkQA commented Nov 9, 2018

Test build #98640 has finished for PR 22979 at commit 64a97a2.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Nov 9, 2018

Test build #98651 has finished for PR 22979 at commit 64a97a2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

nullSafeDatum(d, name, nullable, options) { datum =>
val value = new BigDecimal(datum.replaceAll(",", ""))
Decimal(value, dt.precision, dt.scale)
val bigDecimal = decimalParser.parse(datum).asInstanceOf[BigDecimal]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk, is it safe that we assume this Number is BigDecimal? Looks there are some possibilities that it can return other types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it safe that we assume this Number is BigDecimal?

I am not absolutely sure that it always return BigDecimal. Found this at https://docs.oracle.com/javase/8/docs/api/java/text/DecimalFormat.html#parse(java.lang.String,java.text.ParsePosition) :

If isParseBigDecimal() is true, values are returned as BigDecimal objects. The values are the ones constructed by BigDecimal.BigDecimal(String) for corresponding strings in locale-independent format. The special cases negative and positive infinity and NaN are returned as Double instances holding the values of the corresponding Double constants.

So, isParseBigDecimal() returns true when setParseBigDecimal was called with true as in the PR.

Looks there are some possibilities that it can return other types.

In that case we just fail with a cast exception and the record will be handled as a bad record. or you want to see more clear message in the exception?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. The previous codes will anyway throw an exception, I see. One thing I am a little bit unsure is how much different the behaviour is. For instance, looks the previous one handles sign character as well (+ and -).

Let me take a closer look. I think I need to.

Copy link
Member

@HyukjinKwon HyukjinKwon Nov 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance, there was a similar try to change the date parsing library (#21363). I already know the different is quite breaking and the workaround is difficult as far as I know - so I suggested to add a configuration or fallback for now. Probably we should similarily just document the behaviour change in the migration guide but actually less sure yet even about this. anyway will take another look shortly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I suggested to add a configuration or fallback for now ...

What about SQL config spark.sql.legacy.decimalParsing.enabled with default value false.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good if that's not difficult.

@SparkQA
Copy link

SparkQA commented Nov 11, 2018

Test build #98702 has finished for PR 22979 at commit 3dfce18.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 12, 2018

Test build #98735 has finished for PR 22979 at commit 8c5593e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 12, 2018

Test build #98737 has finished for PR 22979 at commit c28b79f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 12, 2018

Test build #98738 has finished for PR 22979 at commit 1723da2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. @MaxGekk, thanks for taking a look for this.

@SparkQA
Copy link

SparkQA commented Nov 22, 2018

Test build #99185 has finished for PR 22979 at commit 2918d04.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 22, 2018

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Nov 22, 2018

Test build #99195 has finished for PR 22979 at commit 2918d04.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 22, 2018

No chance to pass tests in the PR ;-)

test_aggregator (pyspark.sql.tests.test_group.GroupTests) ... #
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000000000000080, pid=40070, tid=139648880690944
#
# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  0x0000000000000080

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 22, 2018

@HyukjinKwon Could it be related to recent changes in python tests?

@MaxGekk MaxGekk force-pushed the decimal-parsing-locale branch from 15a09b8 to bab8fb2 Compare November 23, 2018 07:56
# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@SparkQA
Copy link

SparkQA commented Nov 23, 2018

Test build #99211 has finished for PR 22979 at commit 15a09b8.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CSVInferSchema(options: CSVOptions) extends Serializable

@SparkQA
Copy link

SparkQA commented Nov 23, 2018

Test build #99212 has finished for PR 22979 at commit 5236336.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

MaxGekk referenced this pull request Nov 23, 2018
## What changes were proposed in this pull request?

In the PR,  I propose new options for CSV datasource - `lineSep` similar to Text and JSON datasource. The option allows to specify custom line separator of maximum length of 2 characters (because of a restriction in `uniVocity` parser). New option can be used in reading and writing CSV files.

## How was this patch tested?

Added a few tests with custom `lineSep` for enabled/disabled `multiLine` in read as well as tests in write. Also I added roundtrip tests.

Closes #23080 from MaxGekk/csv-line-sep.

Lead-authored-by: Maxim Gekk <[email protected]>
Co-authored-by: Maxim Gekk <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
@SparkQA
Copy link

SparkQA commented Nov 27, 2018

Test build #99317 has finished for PR 22979 at commit 0d1a4f0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 27, 2018

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Nov 27, 2018

Test build #99323 has finished for PR 22979 at commit 0d1a4f0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 28, 2018

Test build #99403 has finished for PR 22979 at commit e989b77.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 28, 2018

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Nov 29, 2018

Test build #99416 has finished for PR 22979 at commit e989b77.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 29, 2018

Test build #99456 has finished for PR 22979 at commit 521bd45.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@asfgit asfgit closed this in f97326b Nov 30, 2018
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

In the PR, I propose using of the locale option to parse decimals from CSV input. After the changes, `UnivocityParser` converts input string to `BigDecimal` and to Spark's Decimal by using `java.text.DecimalFormat`.

## How was this patch tested?

Added a test for the `en-US`, `ko-KR`, `ru-RU`, `de-DE` locales.

Closes apache#22979 from MaxGekk/decimal-parsing-locale.

Lead-authored-by: Maxim Gekk <[email protected]>
Co-authored-by: Maxim Gekk <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
HyukjinKwon added a commit that referenced this pull request Apr 24, 2019
…ce for backward compatibility

## What changes were proposed in this pull request?

The code below currently infers as decimal but previously it was inferred as string.

**In branch-2.4**, type inference path for decimal and parsing data are different.

https://github.com/apache/spark/blob/2a8343121e62aabe5c69d1e20fbb2c01e2e520e7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala#L153

https://github.com/apache/spark/blob/c284c4e1f6f684ca8db1cc446fdcc43b46e3413c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L125

So the code below:

```scala
scala> spark.read.option("delimiter", "|").option("inferSchema", "true").csv(Seq("1,2").toDS).printSchema()
```

produced string as its type.

```
root
 |-- _c0: string (nullable = true)
```

**In the current master**, it now infers decimal as below:

```
root
 |-- _c0: decimal(2,0) (nullable = true)
```

It happened after #22979 because, now after this PR, we only have one way to parse decimal:

https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala#L92

**After the fix:**

```
root
 |-- _c0: string (nullable = true)
```

This PR proposes to restore the previous behaviour back in `CSVInferSchema`.

## How was this patch tested?

Manually tested and unit tests were added.

Closes #24437 from HyukjinKwon/SPARK-27512.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
@MaxGekk MaxGekk deleted the decimal-parsing-locale branch August 17, 2019 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants