-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-25977][SQL] Parsing decimals from CSV using locale #22979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The PR requires this PR #22951 to support the |
|
Test build #98599 has finished for PR 22979 at commit
|
|
Looks good. Will take a closer look. |
|
Test build #98600 has finished for PR 22979 at commit
|
|
Also let me leave a cc for @srowen. |
# Conflicts: # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CsvExpressionsSuite.scala
|
Test build #98640 has finished for PR 22979 at commit
|
|
retest this please |
|
Test build #98651 has finished for PR 22979 at commit
|
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CsvExpressionsSuite.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
Outdated
Show resolved
Hide resolved
| nullSafeDatum(d, name, nullable, options) { datum => | ||
| val value = new BigDecimal(datum.replaceAll(",", "")) | ||
| Decimal(value, dt.precision, dt.scale) | ||
| val bigDecimal = decimalParser.parse(datum).asInstanceOf[BigDecimal] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGekk, is it safe that we assume this Number is BigDecimal? Looks there are some possibilities that it can return other types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it safe that we assume this Number is BigDecimal?
I am not absolutely sure that it always return BigDecimal. Found this at https://docs.oracle.com/javase/8/docs/api/java/text/DecimalFormat.html#parse(java.lang.String,java.text.ParsePosition) :
If isParseBigDecimal() is true, values are returned as BigDecimal objects. The values are the ones constructed by BigDecimal.BigDecimal(String) for corresponding strings in locale-independent format. The special cases negative and positive infinity and NaN are returned as Double instances holding the values of the corresponding Double constants.
So, isParseBigDecimal() returns true when setParseBigDecimal was called with true as in the PR.
Looks there are some possibilities that it can return other types.
In that case we just fail with a cast exception and the record will be handled as a bad record. or you want to see more clear message in the exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right. The previous codes will anyway throw an exception, I see. One thing I am a little bit unsure is how much different the behaviour is. For instance, looks the previous one handles sign character as well (+ and -).
Let me take a closer look. I think I need to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For instance, there was a similar try to change the date parsing library (#21363). I already know the different is quite breaking and the workaround is difficult as far as I know - so I suggested to add a configuration or fallback for now. Probably we should similarily just document the behaviour change in the migration guide but actually less sure yet even about this. anyway will take another look shortly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I suggested to add a configuration or fallback for now ...
What about SQL config spark.sql.legacy.decimalParsing.enabled with default value false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good if that's not difficult.
|
Test build #98702 has finished for PR 22979 at commit
|
|
Test build #98735 has finished for PR 22979 at commit
|
|
Test build #98737 has finished for PR 22979 at commit
|
|
Test build #98738 has finished for PR 22979 at commit
|
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. @MaxGekk, thanks for taking a look for this.
|
Test build #99185 has finished for PR 22979 at commit
|
|
jenkins, retest this, please |
|
Test build #99195 has finished for PR 22979 at commit
|
|
No chance to pass tests in the PR ;-) |
|
@HyukjinKwon Could it be related to recent changes in python tests? |
15a09b8 to
bab8fb2
Compare
# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
|
Test build #99211 has finished for PR 22979 at commit
|
|
Test build #99212 has finished for PR 22979 at commit
|
## What changes were proposed in this pull request? In the PR, I propose new options for CSV datasource - `lineSep` similar to Text and JSON datasource. The option allows to specify custom line separator of maximum length of 2 characters (because of a restriction in `uniVocity` parser). New option can be used in reading and writing CSV files. ## How was this patch tested? Added a few tests with custom `lineSep` for enabled/disabled `multiLine` in read as well as tests in write. Also I added roundtrip tests. Closes #23080 from MaxGekk/csv-line-sep. Lead-authored-by: Maxim Gekk <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>
|
Test build #99317 has finished for PR 22979 at commit
|
|
jenkins, retest this, please |
|
Test build #99323 has finished for PR 22979 at commit
|
# Conflicts: # docs/sql-migration-guide-upgrade.md
…al-parsing-locale # Conflicts: # docs/sql-migration-guide-upgrade.md
|
Test build #99403 has finished for PR 22979 at commit
|
|
jenkins, retest this, please |
|
Test build #99416 has finished for PR 22979 at commit
|
|
Test build #99456 has finished for PR 22979 at commit
|
|
Merged to master. |
## What changes were proposed in this pull request? In the PR, I propose using of the locale option to parse decimals from CSV input. After the changes, `UnivocityParser` converts input string to `BigDecimal` and to Spark's Decimal by using `java.text.DecimalFormat`. ## How was this patch tested? Added a test for the `en-US`, `ko-KR`, `ru-RU`, `de-DE` locales. Closes apache#22979 from MaxGekk/decimal-parsing-locale. Lead-authored-by: Maxim Gekk <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>
…ce for backward compatibility ## What changes were proposed in this pull request? The code below currently infers as decimal but previously it was inferred as string. **In branch-2.4**, type inference path for decimal and parsing data are different. https://github.com/apache/spark/blob/2a8343121e62aabe5c69d1e20fbb2c01e2e520e7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala#L153 https://github.com/apache/spark/blob/c284c4e1f6f684ca8db1cc446fdcc43b46e3413c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L125 So the code below: ```scala scala> spark.read.option("delimiter", "|").option("inferSchema", "true").csv(Seq("1,2").toDS).printSchema() ``` produced string as its type. ``` root |-- _c0: string (nullable = true) ``` **In the current master**, it now infers decimal as below: ``` root |-- _c0: decimal(2,0) (nullable = true) ``` It happened after #22979 because, now after this PR, we only have one way to parse decimal: https://github.com/apache/spark/blob/7a83d71403edf7d24fa5efc0ef913f3ce76d88b8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala#L92 **After the fix:** ``` root |-- _c0: string (nullable = true) ``` This PR proposes to restore the previous behaviour back in `CSVInferSchema`. ## How was this patch tested? Manually tested and unit tests were added. Closes #24437 from HyukjinKwon/SPARK-27512. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
What changes were proposed in this pull request?
In the PR, I propose using of the locale option to parse decimals from CSV input. After the changes,
UnivocityParserconverts input string toBigDecimaland to Spark's Decimal by usingjava.text.DecimalFormat.How was this patch tested?
Added a test for the
en-US,ko-KR,ru-RU,de-DElocales.