Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented May 8, 2018

What changes were proposed in this pull request?

I propose to bump version of uniVocity parser up to 2.6.3 where quoted empty strings are replaced by the empty value (passed to setEmptyValue) instead of null values as in the current version 2.5.9:
https://github.com/uniVocity/univocity-parsers/blob/v2.6.3/src/main/java/com/univocity/parsers/csv/CsvParser.java#L125

Empty value for writer is set to "". So, empty string in dataframe/dataset is stored as empty quoted string "". Empty value for reader is set to empty string (zero size). In this way, saved empty quoted string will be read as just empty string. Please, look at the tests for more details.

Here are main changes made in 2.6.0, 2.6.1, 2.6.2, 2.6.3:

Closes #20068

How was this patch tested?

Added tests from the PR #20068

@SparkQA
Copy link

SparkQA commented May 8, 2018

Test build #90387 has finished for PR 21273 at commit 598ba2d.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Sounds good @MaxGekk BTW mind adding Closes #20068 in the PR description?

@maropu
Copy link
Member

maropu commented May 9, 2018

We need to list what're differences (new features and bugs) between v2.5.9 and v2.6.3 for checking compatibility and others?

@SparkQA
Copy link

SparkQA commented May 9, 2018

Test build #90388 has finished for PR 21273 at commit db18345.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk MaxGekk changed the title [WIP][SPARK-17916][SQL] Fix empty string being parsed as null when nullValue is set. [SPARK-17916][SQL] Fix empty string being parsed as null when nullValue is set. May 10, 2018
@MaxGekk
Copy link
Member Author

MaxGekk commented May 10, 2018

@HyukjinKwon @maropu Please, have a look at the PR.

@gatorsmile
Copy link
Member

cc @gengliangwang

@gatorsmile
Copy link
Member

CSV parser now parses quoted values ~30% faster

Could we add a micro-benmark suite for this?

@MaxGekk
Copy link
Member Author

MaxGekk commented May 10, 2018

CSV parser now parses quoted values ~30% faster

Could we add a micro-benmark suite for this?

@gatorsmile In this PR or in a separate one?

@gengliangwang
Copy link
Member

LGTM, it would be nice to have a micro-benmark suite in this PR.

@MaxGekk
Copy link
Member Author

MaxGekk commented May 12, 2018

@gengliangwang @gatorsmile I added a benchmark for parsing of quoted values. Parsing time dropped by 28% (look at the commit f3a0072)

@SparkQA
Copy link

SparkQA commented May 13, 2018

Test build #90543 has finished for PR 21273 at commit f3a0072.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@maropu
Copy link
Member

maropu commented May 14, 2018

LGTM

@HyukjinKwon
Copy link
Member

Merged to master.

@asfgit asfgit closed this in 7a2d489 May 14, 2018
robert3005 pushed a commit to palantir/spark that referenced this pull request Jun 24, 2018
…ue is set.

I propose to bump version of uniVocity parser up to 2.6.3 where quoted empty strings are replaced by the empty value (passed to `setEmptyValue`) instead of `null` values as in the current version 2.5.9:
https://github.com/uniVocity/univocity-parsers/blob/v2.6.3/src/main/java/com/univocity/parsers/csv/CsvParser.java#L125

Empty value for writer is set to `""`. So, empty string in dataframe/dataset is stored as empty quoted string `""`. Empty value for reader is set to empty string (zero size). In this way, saved empty quoted string will be read as just empty string. Please, look at the tests for more details.

Here are main changes made in [2.6.0](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.0), [2.6.1](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.1), [2.6.2](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.2), [2.6.3](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.3):

- CSV parser now parses quoted values ~30% faster
- CSV format detection process has option provide a list of possible delimiters, in order of priority ( i.e. settings.detectFormatAutomatically( '-', '.');) - uniVocity/univocity-parsers#214
- Implemented trim quoted values support - uniVocity/univocity-parsers#230
- NullPointer when stopping parser when nothing is parsed - uniVocity/univocity-parsers#219
- Concurrency issue when calling stopParsing() - uniVocity/univocity-parsers#231

Closes apache#20068

Added tests from the PR apache#20068

Author: Maxim Gekk <[email protected]>

Closes apache#21273 from MaxGekk/univocity-2.6.
@koertkuipers
Copy link
Contributor

to summarize my findings from jira:
this breaks any usage without quoting. for example we remove all characters from our values that need to be quoted (delimiters, newlines) so we know we will always write unquoted csv, but now we suddenly find these empty quoted strings in our output. the systems we write to cannot handle these quoted values.

@HyukjinKwon
Copy link
Member

@koertkuipers, would you mind if I ask provide a reproducer please?

@koertkuipers
Copy link
Contributor

@HyukjinKwon see the jira for the example code that reproduces the issue.
let me know if you need anything else. best, koert

@koertkuipers
Copy link
Contributor

koertkuipers commented Aug 24, 2018

i would suggest at least that when the quote character is changed that the empty value should change accordingly. an empty value of "" makes no sense if the quote character is not ".

also if we could agree on a quote character that means no quotes at all (so some non-printable character perhaps) then i would suggest to change empty value back to null if that particular quote character is set. because a quoted empty string never makes sense if the user is trying to write out unquoted values only.

@HyukjinKwon
Copy link
Member

@koertkuipers you wanna make a PR to make it configuration?

@koertkuipers
Copy link
Contributor

@HyukjinKwon see #22312

@HyukjinKwon
Copy link
Member

#22234 was already open. Wouldn't it be able to workaround if it's configurable?

@koertkuipers
Copy link
Contributor

it would provide a workaround i think, yes.

writerSettings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceFlagInWrite)
writerSettings.setNullValue(nullValue)
writerSettings.setEmptyValue(nullValue)
writerSettings.setEmptyValue("\"\"")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs an update in migration guide.

@MaxGekk MaxGekk deleted the univocity-2.6 branch August 17, 2019 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants