[SPARK-15125][SQL] Changing CSV data source mapping of empty quoted strings in the data to empty strings instead of null #12904

sureshthalamati · 2016-05-04T19:23:56Z

What changes were proposed in this pull request?

Currently empty quoted strings in the input CSV file are incorrectly recognized as null value. This patch adds fix to recognizes quoted empty strings(eg; 1, "") in the data are recognized by default as empty string values.

New CSV option emptyValue is added to allows users to specify the value to write to the output CSV file for empty string, and also the value in in an input file that should be interpreted as empty string in addition to the empty quoted string.

DATA :
col1,col2
1,"-"
2,""
3,
4,"A

Default Old behaviour:
scala> val df = spark.read.format("csv").option("nullValue", "-").option("header", true).load("/Users/suresht/sparktests/emptystring.csv")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: string]

scala> df.show
+----+----+
|col1|col2|
+----+----+
| 1|null|
| 2|null|
| 3|null|
| 4| A|
+----+----+

Default New behavior :

scala> val df = spark.read.format("csv").option("nullValue", "-").option("header", true).load("/Users/suresht/sparktests/emptystring.csv")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: string]

scala> df.show
+----+----+
|col1|col2|
+----+----+
| 1|null|
| 2| |
| 3|null|
| 4| A|
+----+----+

Example using the emptyValue option:

val df = spark.createDataFrame(Seq((1, "")))
df.write.format("csv").option("emptyValue", "EMPTY").save("/tmp/data1")

cat part-r-00000-8d867267-c291-4277-9951-a8b969c0a4d8.csv
1,EMPTY

scala> spark.read.format("csv").option("emptyValue", "EMPTY").load("/tmp/data1").show
+---+---+
|_c0|_c1|
+---+---+
| 1| |
+---+---+

How was this patch tested?

Added new unit tests to the CSVSuite.

@falaki @rxin

rxin · 2016-05-05T00:28:08Z

Shouldn't we always infer these as empty strings, and then users can do a simple project to turn them into nulls?

rxin · 2016-05-05T00:28:25Z

Basically I'm asking why is this config needed if we just treat them as empty strings?

sureshthalamati · 2016-05-05T06:38:16Z

Thanks for reviewing the PR, Reynold. Only reason I added the option is to keep the current behavior by default. I agree with you we can treat them as empty strings always. I am also not aware of any scenario where user wants to treat empty strings as null values.

If there is no need to keep the current default behavior , I will update the PR. Please let me know

rxin · 2016-05-05T06:44:22Z

Yea let's update it.

cc @HyukjinKwon @falaki

HyukjinKwon · 2016-05-05T08:33:51Z

+1 for treating them as empty strings without additional options.

sureshthalamati · 2016-05-06T20:51:01Z

Thank you for the feedback , Reynold , HyukjinKwon. Update the PR.

HyukjinKwon · 2016-05-07T04:00:03Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

Could I ask what happen if we don't set nullValue?

If nullValue is not set it will return empty string for null values by default. The reason I set it explicitly is to make sure my fix is working. Before my fix it was retruning null for the empty quoted string , and empty string for null values by default.

HyukjinKwon · 2016-05-07T04:05:40Z

Here is what I think CSV datasource should handle "", empty string and nullValue for reading.

"","a",

should produce the records as below:

With the option, nullValue set to "a", I think
```
Row("", null, null)
```
Without any options, I think
```
Row("", "a", null)
```
With the option, nullValue set to "", I think
```
Row(null, "a", null)
```

Would this make sense? If so, we need to give a default value, null for nullValue. (Currently the default is "")

HyukjinKwon · 2016-05-07T04:26:13Z

In case of writing, I think

Row("", "a", null)

should produce the CSV as below:

With the option, nullValue set to "a", I think
```
,,
```
Without any options, I think
```
,a,
```
With the option, nullValue set to "", I think
```
,a,
```

+Sorry I just updated the writing examples above.

sureshthalamati · 2016-05-09T02:45:48Z

I am not sure what was the history behind returning empty String for null value. In my opinion it should be null be default. current behavior is also inconsistent; for numerics it will return null and for strings it will return empty string by default.

Example:
See the Year (int), and comment (String in the following data).
year,make,model,comment,price
2017,Tesla,Mode 3,looks nice.,35000.99
,Chevy,Bolt,,29000.00
2015,Porsche,"",,
scala> val df= sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load("/tmp/test1.csv")
df: org.apache.spark.sql.DataFrame = [year: int, make: string ... 3 more fields]

I can update this PR to change the nullValue default if needed,

sureshthalamati · 2016-05-13T07:51:21Z

@HyukjinKwon does your previous comment for meant for some other PR ? This PR does not have any change you mentioned above. Am I missing some thing ?

@andrewor14

HyukjinKwon · 2016-05-13T07:57:00Z

@sureshthalamati oh, the comments are not related with this PR but moving the discussion to here was suggested. So, i did. Sorry for that if it was confusing. I will move this topic to dev-mailing list or JIRA.

(I removed/cleaned up my unrelated comments).

falaki · 2016-10-14T00:45:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVParser.scala

Hard coding this is not a good idea. Please add a new option in CSVOption and pass to the parser. The default value could be "".

falaki · 2016-10-14T00:45:57Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

Would you also add a regression unit-test to make sure this patch also fixes https://issues.apache.org/jira/browse/SPARK-17916?

sureshthalamati · 2016-10-17T17:21:42Z

Thanks for the input @falaki . Sorry for the delated reply, sone how I missed the notifications. I will update the patch with option and also add test case for 17916.

… to specify to interpret empty quoted strings as null or an empty string.

… empty string on read/write

viirya · 2016-10-20T08:37:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala

  val permissive = ParseModes.isPermissiveMode(parseMode)

  val nullValue = parameters.getOrElse("nullValue", "")
+  val emptyValue = parameters.getOrElse("emptyValue", "")


When nullValue and emptyValue are both "" in default, don't they conflict?

+1 for documenting the explicit precedence

Yes, null and empty can not be differentiated when they are set to same value. Currently null value check has higher precedence than the empty value.

input.csv
1,
2,””

Output will be:
1, null
2, null

I think this behavior is ok. By default Univocity CSV parser used in spark also returns null for empty strings.

I agree we should document this behavior.

HyukjinKwon · 2016-10-20T09:35:05Z

FWIW, it seems read.csv() in R seems not differenciating it from "".

> bt <- "A,B,C,D
+ ,\"\",20"
>
> b<- read.csv(text=bt, na.strings=c(""))
> b
   A  B  C  D
1 NA NA 20 NA

felixcheung · 2016-10-20T22:09:33Z

hmm, @HyukjinKwon the behavior with your R example is because of the na.strings=c("") parameter, which tells it to treat empty "" as NA (== JVM null in Spark)

HyukjinKwon · 2016-10-20T23:00:41Z

Oh, sure it is true. I set na.strings as "" because the equivalent option, nullValue has the same default value in Spark's CSV :).

EDITED: Hm, it seems it produces the same output without na.strings=c("") actually..

> bt <- "A,B,C,D
+ ,\"\",20"
>
> b<- read.csv(text=bt)
> b
   A  B  C  D
1 NA NA 20 NA

…rk-15125

antoniobarbuzzi · 2016-11-03T13:59:41Z

I agree with @HyukjinKwon and comment #12904 (comment), except for third use case, i.e. I'd rather do this:

With the option, nullValue set to ""
"","a", should be converted to Row("", "a", null)

In fact, just like the univocity parser does here:

if the parser does not read any character from the input, and the input is within quotes , the empty is used instead of an empty string

Regarding the PR, I'd deletegate the handling of this to the univocity parser, setting the settings.setEmptyValue(params.emptyValue) in the univocity's CSVSettings here and here too.

SparkQA · 2016-11-03T23:28:25Z

Test build #3410 has finished for PR 12904 at commit c7aa4aa.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

sureshthalamati · 2016-11-05T03:47:07Z

I was testing the fix with different scenarios mentioned in the comments. I can not make
CSV writer write quoted empty string for empty strings in the data. One of the issue I filed got fixed , but still can not make it work.

uniVocity/univocity-parsers#123

gatorsmile · 2017-05-23T18:27:42Z

What is the status of this PR?

gatorsmile · 2017-10-28T03:59:25Z

@falaki Should we continue this PR?

MaxGekk · 2018-07-08T20:41:54Z

The issue has been already solved by 7a2d489 . The PR can be closed.

sureshthalamati force-pushed the empstring_fix_spark-15125 branch from e6207e7 to 2629a7c Compare May 6, 2016 20:41

sureshthalamati changed the title ~~[SPARK-15125][SQL] New option to the CSV data source to allows users to specify to how to interpret empty quoted strings.~~ [SPARK-15125][SQL] Changing CSV data source mapping of empty quoted strings in the data to empty strings instead of null May 6, 2016

HyukjinKwon reviewed May 7, 2016
View reviewed changes

This was referenced May 11, 2016

[SPARK-15250][SQL] Remove deprecated json API in DataFrameReader #13040

Closed

[SPARK-15264][SPARK-15274][SQL] CSV Reader Error on Blank Column Names #13041

Closed

falaki reviewed Oct 15, 2016

View reviewed changes

sureshthalamati added 2 commits October 18, 2016 17:29

This patch adds boolean option emptyAsNull to CSV datasource for user…

47325eb

… to specify to interpret empty quoted strings as null or an empty string.

Adding emptyValue option to will allow user to speify the mapping for…

b128fbb

… empty string on read/write

sureshthalamati force-pushed the empstring_fix_spark-15125 branch from 2629a7c to b128fbb Compare October 20, 2016 08:22

viirya reviewed Oct 20, 2016

View reviewed changes

Merge remote-tracking branch 'upstream/master' into empstring_fix_spa…

c7aa4aa

…rk-15125

aa8y mentioned this pull request Dec 23, 2017

[SPARK-17916][SQL] Fix empty string being parsed as null when nullValue is set. #20068

Closed

HyukjinKwon mentioned this pull request Jul 16, 2018

[INFRA] Close stale PR #21781

Closed

asfgit closed this in 1a4fda8 Jul 19, 2018

[SPARK-15125][SQL] Changing CSV data source mapping of empty quoted strings in the data to empty strings instead of null #12904

[SPARK-15125][SQL] Changing CSV data source mapping of empty quoted strings in the data to empty strings instead of null #12904

Uh oh!

Conversation

sureshthalamati commented May 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

rxin commented May 5, 2016

Uh oh!

rxin commented May 5, 2016

Uh oh!

sureshthalamati commented May 5, 2016

Uh oh!

rxin commented May 5, 2016

Uh oh!

HyukjinKwon commented May 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sureshthalamati commented May 6, 2016

Uh oh!

HyukjinKwon May 7, 2016

Choose a reason for hiding this comment

Uh oh!

sureshthalamati May 9, 2016

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented May 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented May 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sureshthalamati commented May 9, 2016

Uh oh!

sureshthalamati commented May 13, 2016

Uh oh!

HyukjinKwon commented May 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

falaki Oct 14, 2016

Choose a reason for hiding this comment

Uh oh!

falaki Oct 14, 2016

Choose a reason for hiding this comment

Uh oh!

sureshthalamati commented Oct 17, 2016

Uh oh!

viirya Oct 20, 2016

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Oct 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sureshthalamati Oct 20, 2016

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Oct 20, 2016

Uh oh!

felixcheung commented Oct 20, 2016

Uh oh!

HyukjinKwon commented Oct 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antoniobarbuzzi commented Nov 3, 2016

Uh oh!

SparkQA commented Nov 3, 2016

Uh oh!

sureshthalamati commented Nov 5, 2016

Uh oh!

gatorsmile commented May 23, 2017

Uh oh!

gatorsmile commented Oct 28, 2017

Uh oh!

MaxGekk commented Jul 8, 2018

Uh oh!

Reviewers

Assignees

sureshthalamati commented May 4, 2016 •

edited

Loading

HyukjinKwon commented May 5, 2016 •

edited

Loading

HyukjinKwon commented May 7, 2016 •

edited

Loading

HyukjinKwon commented May 7, 2016 •

edited

Loading

HyukjinKwon commented May 13, 2016 •

edited

Loading

HyukjinKwon Oct 20, 2016 •

edited

Loading

HyukjinKwon commented Oct 20, 2016 •

edited

Loading