-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17916][SQL] Fix empty string being parsed as null when nullValue is set. #21273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #90387 has finished for PR 21273 at commit
|
|
Sounds good @MaxGekk BTW mind adding |
|
We need to list what're differences (new features and bugs) between v2.5.9 and v2.6.3 for checking compatibility and others? |
|
Test build #90388 has finished for PR 21273 at commit
|
|
@HyukjinKwon @maropu Please, have a look at the PR. |
Could we add a micro-benmark suite for this? |
@gatorsmile In this PR or in a separate one? |
|
LGTM, it would be nice to have a micro-benmark suite in this PR. |
|
@gengliangwang @gatorsmile I added a benchmark for parsing of quoted values. Parsing time dropped by 28% (look at the commit f3a0072) |
|
Test build #90543 has finished for PR 21273 at commit
|
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
gengliangwang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
LGTM |
|
Merged to master. |
…ue is set. I propose to bump version of uniVocity parser up to 2.6.3 where quoted empty strings are replaced by the empty value (passed to `setEmptyValue`) instead of `null` values as in the current version 2.5.9: https://github.com/uniVocity/univocity-parsers/blob/v2.6.3/src/main/java/com/univocity/parsers/csv/CsvParser.java#L125 Empty value for writer is set to `""`. So, empty string in dataframe/dataset is stored as empty quoted string `""`. Empty value for reader is set to empty string (zero size). In this way, saved empty quoted string will be read as just empty string. Please, look at the tests for more details. Here are main changes made in [2.6.0](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.0), [2.6.1](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.1), [2.6.2](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.2), [2.6.3](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.3): - CSV parser now parses quoted values ~30% faster - CSV format detection process has option provide a list of possible delimiters, in order of priority ( i.e. settings.detectFormatAutomatically( '-', '.');) - uniVocity/univocity-parsers#214 - Implemented trim quoted values support - uniVocity/univocity-parsers#230 - NullPointer when stopping parser when nothing is parsed - uniVocity/univocity-parsers#219 - Concurrency issue when calling stopParsing() - uniVocity/univocity-parsers#231 Closes apache#20068 Added tests from the PR apache#20068 Author: Maxim Gekk <[email protected]> Closes apache#21273 from MaxGekk/univocity-2.6.
|
to summarize my findings from jira: |
|
@koertkuipers, would you mind if I ask provide a reproducer please? |
|
@HyukjinKwon see the jira for the example code that reproduces the issue. |
|
i would suggest at least that when the quote character is changed that the empty value should change accordingly. an empty value of also if we could agree on a quote character that means no quotes at all (so some non-printable character perhaps) then i would suggest to change empty value back to null if that particular quote character is set. because a quoted empty string never makes sense if the user is trying to write out unquoted values only. |
|
@koertkuipers you wanna make a PR to make it configuration? |
|
@HyukjinKwon see #22312 |
|
#22234 was already open. Wouldn't it be able to workaround if it's configurable? |
|
it would provide a workaround i think, yes. |
| writerSettings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceFlagInWrite) | ||
| writerSettings.setNullValue(nullValue) | ||
| writerSettings.setEmptyValue(nullValue) | ||
| writerSettings.setEmptyValue("\"\"") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs an update in migration guide.
What changes were proposed in this pull request?
I propose to bump version of uniVocity parser up to 2.6.3 where quoted empty strings are replaced by the empty value (passed to
setEmptyValue) instead ofnullvalues as in the current version 2.5.9:https://github.com/uniVocity/univocity-parsers/blob/v2.6.3/src/main/java/com/univocity/parsers/csv/CsvParser.java#L125
Empty value for writer is set to
"". So, empty string in dataframe/dataset is stored as empty quoted string"". Empty value for reader is set to empty string (zero size). In this way, saved empty quoted string will be read as just empty string. Please, look at the tests for more details.Here are main changes made in 2.6.0, 2.6.1, 2.6.2, 2.6.3:
Closes #20068
How was this patch tested?
Added tests from the PR #20068