Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,11 @@ class CSVOptions(
writerSettings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceFlagInWrite)
writerSettings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceFlagInWrite)
writerSettings.setNullValue(nullValue)
writerSettings.setEmptyValue(nullValue)
// The Univocity parser parses empty strings as `null` by default. This is the default behavior
// for Spark too, since `nullValue` defaults to an empty string and has a higher precedence to
// setEmptyValue(). But when `nullValue` is set to a different value, that would mean that the
// empty string should be parsed not as `null` but as an empty string.
writerSettings.setEmptyValue("")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked about this with Hyukjin Kwon before. I think the previous behavior should not be exposed as an option as the previous behavior was a bug. All it did was that it always coerced empty values to nulls. If the nullValue was not set, then the it was set to "" by default which coerced "" to null. The empty value being set to "" had no affect in this case. If it was set to something else, say \N, then the empty value was also set to \N which resulted in parsing both \N and "" to null, as "" was no longer considered as an empty value and the "" being coerced to null is the Univocity parser's default.

Setting empty value explicitly to the "" literal would ensure that an empty string is always parsed as empty string, unless nullValue is not set or it is set to "", which is what people would do if they want "" to be parsed as null, which would be the old behavior.

Copy link
Member

@gatorsmile gatorsmile Jan 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conf is needed for users who want to change the behavior back the previous releases. This also needs to be documented in the migration guides in Spark SQL doc.

writerSettings.setSkipEmptyLines(true)
writerSettings.setQuoteAllFields(quoteAll)
writerSettings.setQuoteEscapingEnabled(escapeQuotes)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1248,4 +1248,50 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
Row("0,2013-111-11 12:13:14") :: Row(null) :: Nil
)
}

test("SPARK-17916: An empty string should not be coerced to null when nullValue is passed.") {
val litNull: String = null
val df = Seq(
(1, "John Doe"),
(2, ""),
(3, "-"),
(4, litNull)
).toDF("id", "name")

// Checks for new behavior where an empty string is not coerced to null when `nullValue` is
// set to anything but an empty string literal.
withTempPath { path =>
df.write
.option("nullValue", "-")
.csv(path.getAbsolutePath)
val computed = spark.read
.option("nullValue", "-")
.schema(df.schema)
.csv(path.getAbsolutePath)
val expected = Seq(
(1, "John Doe"),
(2, ""),
(3, litNull),
(4, litNull)
).toDF("id", "name")

checkAnswer(computed, expected)
}
// Keeps the old behavior where empty string us coerced to nullValue is not passed.
withTempPath { path =>
df.write
.csv(path.getAbsolutePath)
val computed = spark.read
.schema(df.schema)
.csv(path.getAbsolutePath)
val expected = Seq(
(1, "John Doe"),
(2, litNull),
(3, "-"),
(4, litNull)
).toDF("id", "name")

checkAnswer(computed, expected)
}
}
}