-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17916][SQL] Fix empty string being parsed as null when nullValue is set. #20068
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@gatorsmile I've created this PR since #12904 has not been updated in a while. |
| writerSettings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceFlagInWrite) | ||
| writerSettings.setNullValue(nullValue) | ||
| writerSettings.setEmptyValue(nullValue) | ||
| writerSettings.setEmptyValue("") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simply expose this as an option and keep the previous behaviour if this option is not set explicitly by the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree. I don't think the previous behavior should not be exposed as an option as the previous behavior was a bug. All it did was that it always coerced empty values to nulls. If the nullValue was not set, then the it was set to "" by default which coerced "" to null. The empty value being set to "" had no affect in this case. If it was set to something else, say \N, then the empty value was also set to \N which resulted in parsing both \N and "" to null, as "" was no longer considered as an empty value and the "" being coerced to null is the Univocity parser's default.
Setting empty value explicitly to the "" literal would ensure that an empty string is always parsed as empty string, unless nullValue is not set or it is set to "", which is what people would do if they want "" to be parsed as null, which would be the old behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we leave some comments here and update the PR description too?
|
ok to test |
|
Test build #85353 has finished for PR 20068 at commit
|
| } | ||
|
|
||
| test("SPARK-17916: An empty string should not be coerced to null when nullValue is passed.") { | ||
| val sparkSession = spark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just use spark.
| writerSettings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceFlagInWrite) | ||
| writerSettings.setNullValue(nullValue) | ||
| writerSettings.setEmptyValue(nullValue) | ||
| writerSettings.setEmptyValue("") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we leave some comments here and update the PR description too?
| val outDir = new File(dir, "out").getCanonicalPath | ||
| val nullValue = "\\N" | ||
|
|
||
| import sparkSession.implicits._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need this (by import testImplicits._ above).
| val nullValue = "\\N" | ||
|
|
||
| import sparkSession.implicits._ | ||
| val dsIn = spark.createDataset(elems) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seq(("bar"), (""), (null: String)).toDS?
| val elems = Seq(("bar"), (""), (null: String)) | ||
|
|
||
| // Checks for new behavior where an empty string is not coerced to null. | ||
| withTempDir { dir => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do this withTempPath { path =>
| val expected = Seq(("bar"), (null: String)) | ||
|
|
||
| assert(computed.size === 2) | ||
| assert(computed.sameElements(expected)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use checkAnswer(..: DataFrame, .. : DataFrame)
| .csv(outDir) | ||
| .as[(String)] | ||
| val computed = dsOut.collect.toSeq | ||
| val expected = Seq(("bar"), (null: String)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is quite the expected output? Could we use the examples provided in the JIRA rather than single row ones?
|
@HyukjinKwon I made code changes based on your suggestions. I also changed the tests to use the data mentioned in the ticket. However, you're right, the tests no longer pass. But that is because the Univocity |
|
Test build #85368 has finished for PR 20068 at commit
|
|
retest this please |
|
Test build #85404 has finished for PR 20068 at commit
|
| // for Spark too, since `nullValue` defaults to an empty string and has a higher precedence to | ||
| // setEmptyValue(). But when `nullValue` is set to a different value, that would mean that the | ||
| // empty string should be parsed not as `null` but as an empty string. | ||
| writerSettings.setEmptyValue("") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make it as a conf like what we did for nullValue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I talked about this with Hyukjin Kwon before. I think the previous behavior should not be exposed as an option as the previous behavior was a bug. All it did was that it always coerced empty values to nulls. If the nullValue was not set, then the it was set to "" by default which coerced "" to null. The empty value being set to "" had no affect in this case. If it was set to something else, say \N, then the empty value was also set to \N which resulted in parsing both \N and "" to null, as "" was no longer considered as an empty value and the "" being coerced to null is the Univocity parser's default.
Setting empty value explicitly to the "" literal would ensure that an empty string is always parsed as empty string, unless nullValue is not set or it is set to "", which is what people would do if they want "" to be parsed as null, which would be the old behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conf is needed for users who want to change the behavior back the previous releases. This also needs to be documented in the migration guides in Spark SQL doc.
|
Can we make the tests pass BTW, @aa8y? |
|
I'll work on it in the next week or two. That would involve a PR to the Univocity CSV parser. |
|
ok to test |
|
ping @aa8y |
|
Test build #86739 has finished for PR 20068 at commit
|
|
I apologize I haven't had time to work on this. I can close this for now and reopen it when I have a working fix for it. |
…ue is set. I propose to bump version of uniVocity parser up to 2.6.3 where quoted empty strings are replaced by the empty value (passed to `setEmptyValue`) instead of `null` values as in the current version 2.5.9: https://github.com/uniVocity/univocity-parsers/blob/v2.6.3/src/main/java/com/univocity/parsers/csv/CsvParser.java#L125 Empty value for writer is set to `""`. So, empty string in dataframe/dataset is stored as empty quoted string `""`. Empty value for reader is set to empty string (zero size). In this way, saved empty quoted string will be read as just empty string. Please, look at the tests for more details. Here are main changes made in [2.6.0](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.0), [2.6.1](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.1), [2.6.2](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.2), [2.6.3](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.3): - CSV parser now parses quoted values ~30% faster - CSV format detection process has option provide a list of possible delimiters, in order of priority ( i.e. settings.detectFormatAutomatically( '-', '.');) - uniVocity/univocity-parsers#214 - Implemented trim quoted values support - uniVocity/univocity-parsers#230 - NullPointer when stopping parser when nothing is parsed - uniVocity/univocity-parsers#219 - Concurrency issue when calling stopParsing() - uniVocity/univocity-parsers#231 Closes apache#20068 Added tests from the PR apache#20068 Author: Maxim Gekk <[email protected]> Closes apache#21273 from MaxGekk/univocity-2.6.
What changes were proposed in this pull request?
When the option
nullValueis set, the empty value is also set to the same value. Therefore empty strings get parsed asnull, which should not happen. This PR explicitly changes this to be an empty string.How was this patch tested?
Tests were added without the fix. It was tested that they failed. Then the fix was added and the tests have been ensured to pass.
Please review http://spark.apache.org/contributing.html before opening a pull request.