Skip to content

Conversation

@bllchmbrs
Copy link
Contributor

@bllchmbrs bllchmbrs commented May 11, 2016

What changes were proposed in this pull request?

When a CSV begins with:

  • ,,
    OR
  • "","",

meaning that the first column names are either empty or blank strings and header is specified to be true, then the column name is replaced with C + the index number of that given column. For example, if you were to read in the CSV:

"","second column"
"hello", "there"

Then column names would become "C0", "second column".

This behavior aligns with what currently happens when header is specified to be false in recent versions of Spark.

Current Behavior in Spark <=1.6

In Spark <=1.6, a CSV with a blank column name becomes a blank string, "", meaning that this column cannot be accessed. However the CSV reads in without issue.

Current Behavior in Spark 2.0

Spark throws a NullPointerError and will not read in the file.

Reproduction in 2.0

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/346304/2828750690305044/484361/latest.html

How was this patch tested?

A new test was added to CSVSuite to account for this issue. We then have asserts that test for being able to select both the empty column names as well as the regular column names.

@bllchmbrs bllchmbrs changed the title fix SPARK-15264, add test cases [SPARK-15264][SQL] CSV Reader Error on Blank Columns May 11, 2016
@bllchmbrs
Copy link
Contributor Author

cc: @andrewor14

@HyukjinKwon
Copy link
Member

(I think it would be nicer if the PR description is fill up.)

@bllchmbrs bllchmbrs changed the title [SPARK-15264][SQL] CSV Reader Error on Blank Columns [SPARK-15264][SQL] CSV Reader Error on Blank Column Names May 11, 2016
@HyukjinKwon
Copy link
Member

Related with #12904 and #12921.

@andrewor14 Do you mind if I review this?

@SparkQA
Copy link

SparkQA commented May 11, 2016

Test build #58308 has finished for PR 13041 at commit 17b3c58.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

sure, you don't have to ask for permission to review a patch

@SparkQA
Copy link

SparkQA commented May 11, 2016

Test build #58315 has finished for PR 13041 at commit 85d0843.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val header = if (csvOptions.headerFlag) {
firstRow
firstRow.zipWithIndex.map { case (value, index) =>
if (value == "" || value == null) s"C$index" else value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see Spark allows a empty string as a field. So, I wonder if we should rename this with the index and prefix, C. Also, I think "" will throw an NPE whereas empty string without quotes will produce a correct field because the default of nullValue is "".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code does rename it with the index and prefix, C.

Copy link
Member

@HyukjinKwon HyukjinKwon May 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean if one of values in the header is a empty string then, I think the field name should be a empty string since apparently it works with fields named empty strings. I tested this by manually giving a schema.

Also, If the header is used for schema, then I think the names should be as they are. We don't really change field names specified in ORC, Parquet or JSON.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 11, 2016

@anabranch First of all, I think currently (at least for me) it is really really confusing to deal with null, "" and empty string in CSV. I am trying to make this clear here and there (eg.#12921). So, I hope we can hold off this at least until that PR is merged.

Secondly, JSON data source would not support empty strings as fields. I think we should clarify what we want for empty strings. For example, JSON data sources simply ignores when it meets empty strings as far as I know but CSV data source currently throws NPE. Also, I think we need to decide if we are going to support fields named empty strings or not for data sources.

@andrewor14
Copy link
Contributor

andrewor14 commented May 11, 2016

@HyukjinKwon I think we should rename both null and empty strings. In both cases there's no way to actually query the column. Also I looked at the other two patches and although they're related they really don't block this patch since this one only touches the header. I think it's OK to merge this one independently of the other two.

if (value == "" || value == null) s"C$index" else value
}
} else {
firstRow.zipWithIndex.map { case (value, index) => s"C$index" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked to @marmbrus offline. Elsewhere we use _c0 instead of C0, so we should make that consistent. You're gonna have to change the name both here and in L65.

@andrewor14
Copy link
Contributor

andrewor14 commented May 11, 2016

The only changes I would suggest here are:

(1) Change the condition to value == null || value.isEmpty || value == csvOptions.nullValue
(2) Renaming C0 to _c0

Otherwise this patch LGTM.

@andrewor14
Copy link
Contributor

Also, once you fix those can you add [SPARK-15274] to the title?

@bllchmbrs bllchmbrs changed the title [SPARK-15264][SQL] CSV Reader Error on Blank Column Names [SPARK-15264][SPARK-15274][SQL] CSV Reader Error on Blank Column Names May 11, 2016
@SparkQA
Copy link

SparkQA commented May 11, 2016

Test build #58399 has finished for PR 13041 at commit d665bea.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 11, 2016

@andrewor14 they are related because the first row is already parsed by Univocity parser in which nullValue is set, here. Both PRs are One of both PRs is handling nullValue to treat "" and empty string
+and the other one it dealing with reading "" and empty string.
It seems currently nullValue with Univocity parser is working differently with what Spark expects in CSV datasource library. This is partly what I meant really really confusing.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 11, 2016

@andrewor14 Also, I am careful of this because the header might be intendedly a empty string, meaning the field name is literally a empty string. For example, for me the header below

,,1,2,3

might mean

["","","1","2","3"]

just like

a,a,1,2,3

will be

["a","a","1","2,"3"]

Shouldn't this be supported with a separate option if this should be supported? (I am a bit less sure why duplicated field names (empty string) should be allowed, though.)

}
} else {
firstRow.zipWithIndex.map { case (value, index) => s"C$index" }
firstRow.zipWithIndex.map { case (value, index) => s"_c$index" }
Copy link
Member

@HyukjinKwon HyukjinKwon May 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should this be _c?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be consistent with what Spark does with unnamed columns. See my comment above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks.

@andrewor14
Copy link
Contributor

@HyukjinKwon I'm not saying they're not related. I'm just saying it's not necessary to block progress on this patch because of the other ones. I agree that we should decide what the proper defaults for nullValue and emptyValue are, but that's a separate issue.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 11, 2016

(I remember , for example, setting nullValue for "abc" will affect the values what Univocity thinks are null ending up with duplicated field names "abc". I think we might have to remove this for Univocity and treat all null in Spark
I will test and tell you what I think when I get to my computer.)
oh, sorry I see what you mean. Yes this wouldn't be the reason to block this.

@SparkQA
Copy link

SparkQA commented May 11, 2016

Test build #58415 has finished for PR 13041 at commit d983718.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Could I maybe ask your thoughts on the comment above #13041 (comment)?

@andrewor14
Copy link
Contributor

I talked with @marmbrus and @yhuai offline about the semantics of blank and empty strings in the header row. For the latter, I can't imagine a user actually intending to name the column empty string because you can't actually query the column that way, so we should just rename that.

@andrewor14
Copy link
Contributor

This patch itself LGTM. I'm going to merge it into master 2.0. @HyukjinKwon let's move our discussion about nullValue and emptyValue defaults to #12904.

asfgit pushed a commit that referenced this pull request May 12, 2016
## What changes were proposed in this pull request?

When a CSV begins with:
- `,,`
OR
- `"","",`

meaning that the first column names are either empty or blank strings and `header` is specified to be `true`, then the column name is replaced with `C` + the index number of that given column. For example, if you were to read in the CSV:
```
"","second column"
"hello", "there"
```
Then column names would become `"C0", "second column"`.

This behavior aligns with what currently happens when `header` is specified to be `false` in recent versions of Spark.

### Current Behavior in Spark <=1.6
In Spark <=1.6, a CSV with a blank column name becomes a blank string, `""`, meaning that this column cannot be accessed. However the CSV reads in without issue.

### Current Behavior in Spark 2.0
Spark throws a NullPointerError and will not read in the file.

#### Reproduction in 2.0
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/346304/2828750690305044/484361/latest.html

## How was this patch tested?
A new test was added to `CSVSuite` to account for this issue. We then have asserts that test for being able to select both the empty column names as well as the regular column names.

Author: Bill Chambers <[email protected]>
Author: Bill Chambers <[email protected]>

Closes #13041 from anabranch/master.

(cherry picked from commit 603f445)
Signed-off-by: Andrew Or <[email protected]>
@asfgit asfgit closed this in 603f445 May 12, 2016
@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 12, 2016

@andrewor14 Then, do you think this should be supported for JSON data source as well?

@HyukjinKwon
Copy link
Member

I am not a committer and I might not have the right to say this but in my point of view, this change should not be included in Spark 2.0 but 2.1. I think this change will affect the consistency of behaviour of field names having empty strings across SparkSQL and I think it is not clarified yet.

@marmbrus
Copy link
Contributor

@HyukjinKwon Spark 2.0 has not been released yet and thus CSV has never been in a released version of Spark. Why would you want to break compatibility in Spark 2.0 and 2.1 instead of just getting it right from the beginning in 2.0. I think these are exactly the kind of API audits we should be doing before the release.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 12, 2016

@marmbrus I am sorry that I said that without knowing the background enough. I wanted to say that might break the consistency of dealing with fields in Spark SQL but it seems merged quickly. I remember I was told Spark 2.0 release is very close and I intended to say this might have to be quickly consistent if it is problematic.

@marmbrus
Copy link
Contributor

marmbrus commented May 12, 2016 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants