-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-15125][SQL] Changing CSV data source mapping of empty quoted strings in the data to empty strings instead of null #12904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Shouldn't we always infer these as empty strings, and then users can do a simple project to turn them into nulls? |
|
Basically I'm asking why is this config needed if we just treat them as empty strings? |
|
Thanks for reviewing the PR, Reynold. Only reason I added the option is to keep the current behavior by default. I agree with you we can treat them as empty strings always. I am also not aware of any scenario where user wants to treat empty strings as null values. If there is no need to keep the current default behavior , I will update the PR. Please let me know |
|
Yea let's update it. |
|
+1 for treating them as empty strings without additional options. |
e6207e7 to
2629a7c
Compare
|
Thank you for the feedback , Reynold , HyukjinKwon. Update the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could I ask what happen if we don't set nullValue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If nullValue is not set it will return empty string for null values by default. The reason I set it explicitly is to make sure my fix is working. Before my fix it was retruning null for the empty quoted string , and empty string for null values by default.
|
Here is what I think CSV datasource should handle should produce the records as below:
Would this make sense? If so, we need to give a default value, |
|
In case of writing, I think should produce the CSV as below:
+Sorry I just updated the writing examples above. |
|
I am not sure what was the history behind returning empty String for null value. In my opinion it should be null be default. current behavior is also inconsistent; for numerics it will return null and for strings it will return empty string by default. Example: scala> df.show I can update this PR to change the nullValue default if needed, |
|
@HyukjinKwon does your previous comment for meant for some other PR ? This PR does not have any change you mentioned above. Am I missing some thing ? |
|
@sureshthalamati oh, the comments are not related with this PR but moving the discussion to here was suggested. So, i did. Sorry for that if it was confusing. I will move this topic to dev-mailing list or JIRA. (I removed/cleaned up my unrelated comments). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard coding this is not a good idea. Please add a new option in CSVOption and pass to the parser. The default value could be "".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you also add a regression unit-test to make sure this patch also fixes https://issues.apache.org/jira/browse/SPARK-17916?
|
Thanks for the input @falaki . Sorry for the delated reply, sone how I missed the notifications. I will update the patch with option and also add test case for 17916. |
… to specify to interpret empty quoted strings as null or an empty string.
… empty string on read/write
2629a7c to
b128fbb
Compare
| val permissive = ParseModes.isPermissiveMode(parseMode) | ||
|
|
||
| val nullValue = parameters.getOrElse("nullValue", "") | ||
| val emptyValue = parameters.getOrElse("emptyValue", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When nullValue and emptyValue are both "" in default, don't they conflict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for documenting the explicit precedence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, null and empty can not be differentiated when they are set to same value. Currently null value check has higher precedence than the empty value.
input.csv
1,
2,””
Output will be:
1, null
2, null
I think this behavior is ok. By default Univocity CSV parser used in spark also returns null for empty strings.
I agree we should document this behavior.
|
FWIW, it seems > bt <- "A,B,C,D
+ ,\"\",20"
>
> b<- read.csv(text=bt, na.strings=c(""))
> b
A B C D
1 NA NA 20 NA |
|
hmm, @HyukjinKwon the behavior with your R example is because of the |
|
Oh, sure it is true. I set EDITED: Hm, it seems it produces the same output without > bt <- "A,B,C,D
+ ,\"\",20"
>
> b<- read.csv(text=bt)
> b
A B C D
1 NA NA 20 NA |
|
I agree with @HyukjinKwon and comment #12904 (comment), except for third use case, i.e. I'd rather do this: With the option, nullValue set to In fact, just like the univocity parser does here:
Regarding the PR, I'd deletegate the handling of this to the univocity parser, setting the |
|
Test build #3410 has finished for PR 12904 at commit
|
|
I was testing the fix with different scenarios mentioned in the comments. I can not make |
|
What is the status of this PR? |
|
@falaki Should we continue this PR? |
|
The issue has been already solved by 7a2d489 . The PR can be closed. |
What changes were proposed in this pull request?
Currently empty quoted strings in the input CSV file are incorrectly recognized as null value. This patch adds fix to recognizes quoted empty strings(eg; 1, "") in the data are recognized by default as empty string values.
New CSV option emptyValue is added to allows users to specify the value to write to the output CSV file for empty string, and also the value in in an input file that should be interpreted as empty string in addition to the empty quoted string.
DATA :
col1,col2
1,"-"
2,""
3,
4,"A
Default Old behaviour:
scala> val df = spark.read.format("csv").option("nullValue", "-").option("header", true).load("/Users/suresht/sparktests/emptystring.csv")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
scala> df.show
+----+----+
|col1|col2|
+----+----+
| 1|null|
| 2|null|
| 3|null|
| 4| A|
+----+----+
Default New behavior :
scala> val df = spark.read.format("csv").option("nullValue", "-").option("header", true).load("/Users/suresht/sparktests/emptystring.csv")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
scala> df.show
+----+----+
|col1|col2|
+----+----+
| 1|null|
| 2| |
| 3|null|
| 4| A|
+----+----+
Example using the emptyValue option:
val df = spark.createDataFrame(Seq((1, "")))
df.write.format("csv").option("emptyValue", "EMPTY").save("/tmp/data1")
cat part-r-00000-8d867267-c291-4277-9951-a8b969c0a4d8.csv
1,EMPTY
scala> spark.read.format("csv").option("emptyValue", "EMPTY").load("/tmp/data1").show
+---+---+
|_c0|_c1|
+---+---+
| 1| |
+---+---+
How was this patch tested?
Added new unit tests to the CSVSuite.
@falaki @rxin