Skip to content

Conversation

@aa8y
Copy link

@aa8y aa8y commented Dec 23, 2017

What changes were proposed in this pull request?

When the option nullValue is set, the empty value is also set to the same value. Therefore empty strings get parsed as null, which should not happen. This PR explicitly changes this to be an empty string.

How was this patch tested?

Tests were added without the fix. It was tested that they failed. Then the fix was added and the tests have been ensured to pass.

Please review http://spark.apache.org/contributing.html before opening a pull request.

@aa8y aa8y changed the title SPARK-17916: Fix empty string being parsed as null when nullValue is set. [SPARK-17916][SQL] Fix empty string being parsed as null when nullValue is set. Dec 23, 2017
@aa8y
Copy link
Author

aa8y commented Dec 23, 2017

@gatorsmile I've created this PR since #12904 has not been updated in a while.

writerSettings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceFlagInWrite)
writerSettings.setNullValue(nullValue)
writerSettings.setEmptyValue(nullValue)
writerSettings.setEmptyValue("")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simply expose this as an option and keep the previous behaviour if this option is not set explicitly by the user?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. I don't think the previous behavior should not be exposed as an option as the previous behavior was a bug. All it did was that it always coerced empty values to nulls. If the nullValue was not set, then the it was set to "" by default which coerced "" to null. The empty value being set to "" had no affect in this case. If it was set to something else, say \N, then the empty value was also set to \N which resulted in parsing both \N and "" to null, as "" was no longer considered as an empty value and the "" being coerced to null is the Univocity parser's default.

Setting empty value explicitly to the "" literal would ensure that an empty string is always parsed as empty string, unless nullValue is not set or it is set to "", which is what people would do if they want "" to be parsed as null, which would be the old behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we leave some comments here and update the PR description too?

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Dec 24, 2017

Test build #85353 has finished for PR 20068 at commit ebe2900.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

test("SPARK-17916: An empty string should not be coerced to null when nullValue is passed.") {
val sparkSession = spark
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just use spark.

writerSettings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceFlagInWrite)
writerSettings.setNullValue(nullValue)
writerSettings.setEmptyValue(nullValue)
writerSettings.setEmptyValue("")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we leave some comments here and update the PR description too?

val outDir = new File(dir, "out").getCanonicalPath
val nullValue = "\\N"

import sparkSession.implicits._
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need this (by import testImplicits._ above).

val nullValue = "\\N"

import sparkSession.implicits._
val dsIn = spark.createDataset(elems)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seq(("bar"), (""), (null: String)).toDS?

val elems = Seq(("bar"), (""), (null: String))

// Checks for new behavior where an empty string is not coerced to null.
withTempDir { dir =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do this withTempPath { path =>

val expected = Seq(("bar"), (null: String))

assert(computed.size === 2)
assert(computed.sameElements(expected))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use checkAnswer(..: DataFrame, .. : DataFrame)

.csv(outDir)
.as[(String)]
val computed = dsOut.collect.toSeq
val expected = Seq(("bar"), (null: String))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is quite the expected output? Could we use the examples provided in the JIRA rather than single row ones?

@aa8y
Copy link
Author

aa8y commented Dec 25, 2017

@HyukjinKwon I made code changes based on your suggestions. I also changed the tests to use the data mentioned in the ticket. However, you're right, the tests no longer pass. But that is because the Univocity CsvParser, when it encounters an empty string while parsing the data, replaces it with the nullValue we set (see setNullValue()). And the emptyValue is only effective when the empty string being read has quotes around it (see setEmptyValue()). So I believe, at this point, the issue needs to be fixed in the underlying library being used.

@SparkQA
Copy link

SparkQA commented Dec 25, 2017

Test build #85368 has finished for PR 20068 at commit 156d755.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Dec 26, 2017

Test build #85404 has finished for PR 20068 at commit 156d755.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// for Spark too, since `nullValue` defaults to an empty string and has a higher precedence to
// setEmptyValue(). But when `nullValue` is set to a different value, that would mean that the
// empty string should be parsed not as `null` but as an empty string.
writerSettings.setEmptyValue("")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked about this with Hyukjin Kwon before. I think the previous behavior should not be exposed as an option as the previous behavior was a bug. All it did was that it always coerced empty values to nulls. If the nullValue was not set, then the it was set to "" by default which coerced "" to null. The empty value being set to "" had no affect in this case. If it was set to something else, say \N, then the empty value was also set to \N which resulted in parsing both \N and "" to null, as "" was no longer considered as an empty value and the "" being coerced to null is the Univocity parser's default.

Setting empty value explicitly to the "" literal would ensure that an empty string is always parsed as empty string, unless nullValue is not set or it is set to "", which is what people would do if they want "" to be parsed as null, which would be the old behavior.

Copy link
Member

@gatorsmile gatorsmile Jan 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conf is needed for users who want to change the behavior back the previous releases. This also needs to be documented in the migration guides in Spark SQL doc.

@HyukjinKwon
Copy link
Member

Can we make the tests pass BTW, @aa8y?

@aa8y
Copy link
Author

aa8y commented Jan 3, 2018

I'll work on it in the next week or two. That would involve a PR to the Univocity CSV parser.

@HyukjinKwon
Copy link
Member

ok to test

@HyukjinKwon
Copy link
Member

ping @aa8y

@SparkQA
Copy link

SparkQA commented Jan 28, 2018

Test build #86739 has finished for PR 20068 at commit 156d755.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

@aa8y
Copy link
Author

aa8y commented Apr 20, 2018

I apologize I haven't had time to work on this. I can close this for now and reopen it when I have a working fix for it.

MaxGekk added a commit to MaxGekk/spark that referenced this pull request May 8, 2018
@asfgit asfgit closed this in 7a2d489 May 14, 2018
robert3005 pushed a commit to palantir/spark that referenced this pull request Jun 24, 2018
…ue is set.

I propose to bump version of uniVocity parser up to 2.6.3 where quoted empty strings are replaced by the empty value (passed to `setEmptyValue`) instead of `null` values as in the current version 2.5.9:
https://github.com/uniVocity/univocity-parsers/blob/v2.6.3/src/main/java/com/univocity/parsers/csv/CsvParser.java#L125

Empty value for writer is set to `""`. So, empty string in dataframe/dataset is stored as empty quoted string `""`. Empty value for reader is set to empty string (zero size). In this way, saved empty quoted string will be read as just empty string. Please, look at the tests for more details.

Here are main changes made in [2.6.0](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.0), [2.6.1](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.1), [2.6.2](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.2), [2.6.3](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.3):

- CSV parser now parses quoted values ~30% faster
- CSV format detection process has option provide a list of possible delimiters, in order of priority ( i.e. settings.detectFormatAutomatically( '-', '.');) - uniVocity/univocity-parsers#214
- Implemented trim quoted values support - uniVocity/univocity-parsers#230
- NullPointer when stopping parser when nothing is parsed - uniVocity/univocity-parsers#219
- Concurrency issue when calling stopParsing() - uniVocity/univocity-parsers#231

Closes apache#20068

Added tests from the PR apache#20068

Author: Maxim Gekk <[email protected]>

Closes apache#21273 from MaxGekk/univocity-2.6.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants