-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-15264][SPARK-15274][SQL] CSV Reader Error on Blank Column Names #13041
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc: @andrewor14 |
|
(I think it would be nicer if the PR description is fill up.) |
|
Related with #12904 and #12921. @andrewor14 Do you mind if I review this? |
|
Test build #58308 has finished for PR 13041 at commit
|
|
sure, you don't have to ask for permission to review a patch |
|
Test build #58315 has finished for PR 13041 at commit
|
| val header = if (csvOptions.headerFlag) { | ||
| firstRow | ||
| firstRow.zipWithIndex.map { case (value, index) => | ||
| if (value == "" || value == null) s"C$index" else value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see Spark allows a empty string as a field. So, I wonder if we should rename this with the index and prefix, C. Also, I think "" will throw an NPE whereas empty string without quotes will produce a correct field because the default of nullValue is "".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code does rename it with the index and prefix, C.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean if one of values in the header is a empty string then, I think the field name should be a empty string since apparently it works with fields named empty strings. I tested this by manually giving a schema.
Also, If the header is used for schema, then I think the names should be as they are. We don't really change field names specified in ORC, Parquet or JSON.
|
@anabranch First of all, I think currently (at least for me) it is really really confusing to deal with Secondly, JSON data source would not support empty strings as fields. I think we should clarify what we want for empty strings. For example, JSON data sources simply ignores when it meets empty strings as far as I know but CSV data source currently throws NPE. Also, I think we need to decide if we are going to support fields named empty strings or not for data sources. |
|
@HyukjinKwon I think we should rename both null and empty strings. In both cases there's no way to actually query the column. Also I looked at the other two patches and although they're related they really don't block this patch since this one only touches the header. I think it's OK to merge this one independently of the other two. |
| if (value == "" || value == null) s"C$index" else value | ||
| } | ||
| } else { | ||
| firstRow.zipWithIndex.map { case (value, index) => s"C$index" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
The only changes I would suggest here are: (1) Change the condition to Otherwise this patch LGTM. |
|
Also, once you fix those can you add [SPARK-15274] to the title? |
|
Test build #58399 has finished for PR 13041 at commit
|
|
@andrewor14 they are related because the first row is already parsed by Univocity parser in which |
|
@andrewor14 Also, I am careful of this because the header might be intendedly a empty string, meaning the field name is literally a empty string. For example, for me the header below might mean just like will be Shouldn't this be supported with a separate option if this should be supported? (I am a bit less sure why duplicated field names (empty string) should be allowed, though.) |
| } | ||
| } else { | ||
| firstRow.zipWithIndex.map { case (value, index) => s"C$index" } | ||
| firstRow.zipWithIndex.map { case (value, index) => s"_c$index" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should this be _c?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be consistent with what Spark does with unnamed columns. See my comment above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thanks.
|
@HyukjinKwon I'm not saying they're not related. I'm just saying it's not necessary to block progress on this patch because of the other ones. I agree that we should decide what the proper defaults for |
|
|
|
Test build #58415 has finished for PR 13041 at commit
|
|
Could I maybe ask your thoughts on the comment above #13041 (comment)? |
|
This patch itself LGTM. I'm going to merge it into master 2.0. @HyukjinKwon let's move our discussion about |
## What changes were proposed in this pull request? When a CSV begins with: - `,,` OR - `"","",` meaning that the first column names are either empty or blank strings and `header` is specified to be `true`, then the column name is replaced with `C` + the index number of that given column. For example, if you were to read in the CSV: ``` "","second column" "hello", "there" ``` Then column names would become `"C0", "second column"`. This behavior aligns with what currently happens when `header` is specified to be `false` in recent versions of Spark. ### Current Behavior in Spark <=1.6 In Spark <=1.6, a CSV with a blank column name becomes a blank string, `""`, meaning that this column cannot be accessed. However the CSV reads in without issue. ### Current Behavior in Spark 2.0 Spark throws a NullPointerError and will not read in the file. #### Reproduction in 2.0 https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/346304/2828750690305044/484361/latest.html ## How was this patch tested? A new test was added to `CSVSuite` to account for this issue. We then have asserts that test for being able to select both the empty column names as well as the regular column names. Author: Bill Chambers <[email protected]> Author: Bill Chambers <[email protected]> Closes #13041 from anabranch/master. (cherry picked from commit 603f445) Signed-off-by: Andrew Or <[email protected]>
|
@andrewor14 Then, do you think this should be supported for JSON data source as well? |
|
I am not a committer and I might not have the right to say this but in my point of view, this change should not be included in Spark 2.0 but 2.1. I think this change will affect the consistency of behaviour of field names having empty strings across SparkSQL and I think it is not clarified yet. |
|
@HyukjinKwon Spark 2.0 has not been released yet and thus CSV has never been in a released version of Spark. Why would you want to break compatibility in Spark 2.0 and 2.1 instead of just getting it right from the beginning in 2.0. I think these are exactly the kind of API audits we should be doing before the release. |
|
@marmbrus I am sorry that I said that without knowing the background enough. I wanted to say that might break the consistency of dealing with fields in Spark SQL but it seems merged quickly. I remember I was told Spark 2.0 release is very close and I intended to say this might have to be quickly consistent if it is problematic. |
|
That's okay, is good to be cautious around release time. I just wanted to
be clear why I thought merging in this case was justified :)
|
What changes were proposed in this pull request?
When a CSV begins with:
,,OR
"","",meaning that the first column names are either empty or blank strings and
headeris specified to betrue, then the column name is replaced withC+ the index number of that given column. For example, if you were to read in the CSV:Then column names would become
"C0", "second column".This behavior aligns with what currently happens when
headeris specified to befalsein recent versions of Spark.Current Behavior in Spark <=1.6
In Spark <=1.6, a CSV with a blank column name becomes a blank string,
"", meaning that this column cannot be accessed. However the CSV reads in without issue.Current Behavior in Spark 2.0
Spark throws a NullPointerError and will not read in the file.
Reproduction in 2.0
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/346304/2828750690305044/484361/latest.html
How was this patch tested?
A new test was added to
CSVSuiteto account for this issue. We then have asserts that test for being able to select both the empty column names as well as the regular column names.