[SPARK-15264][SPARK-15274][SQL] CSV Reader Error on Blank Column Names #13041

bllchmbrs · 2016-05-11T01:48:51Z

What changes were proposed in this pull request?

When a CSV begins with:

,,
OR
"","",

meaning that the first column names are either empty or blank strings and header is specified to be true, then the column name is replaced with C + the index number of that given column. For example, if you were to read in the CSV:

"","second column"
"hello", "there"

Then column names would become "C0", "second column".

This behavior aligns with what currently happens when header is specified to be false in recent versions of Spark.

Current Behavior in Spark <=1.6

In Spark <=1.6, a CSV with a blank column name becomes a blank string, "", meaning that this column cannot be accessed. However the CSV reads in without issue.

Current Behavior in Spark 2.0

Spark throws a NullPointerError and will not read in the file.

Reproduction in 2.0

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/346304/2828750690305044/484361/latest.html

How was this patch tested?

A new test was added to CSVSuite to account for this issue. We then have asserts that test for being able to select both the empty column names as well as the regular column names.

bllchmbrs · 2016-05-11T01:50:16Z

cc: @andrewor14

HyukjinKwon · 2016-05-11T02:24:17Z

(I think it would be nicer if the PR description is fill up.)

HyukjinKwon · 2016-05-11T02:57:15Z

Related with #12904 and #12921.

@andrewor14 Do you mind if I review this?

SparkQA · 2016-05-11T03:12:35Z

Test build #58308 has finished for PR 13041 at commit 17b3c58.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-05-11T03:48:56Z

sure, you don't have to ask for permission to review a patch

SparkQA · 2016-05-11T04:12:19Z

Test build #58315 has finished for PR 13041 at commit 85d0843.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-05-11T05:00:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/DefaultSource.scala

    val header = if (csvOptions.headerFlag) {
-      firstRow
+      firstRow.zipWithIndex.map { case (value, index) =>
+        if (value == "" || value == null) s"C$index" else value


I see Spark allows a empty string as a field. So, I wonder if we should rename this with the index and prefix, C. Also, I think "" will throw an NPE whereas empty string without quotes will produce a correct field because the default of nullValue is "".

This code does rename it with the index and prefix, C.

I mean if one of values in the header is a empty string then, I think the field name should be a empty string since apparently it works with fields named empty strings. I tested this by manually giving a schema.

Also, If the header is used for schema, then I think the names should be as they are. We don't really change field names specified in ORC, Parquet or JSON.

HyukjinKwon · 2016-05-11T05:11:02Z

@anabranch First of all, I think currently (at least for me) it is really really confusing to deal with null, "" and empty string in CSV. I am trying to make this clear here and there (eg.#12921). So, I hope we can hold off this at least until that PR is merged.

Secondly, JSON data source would not support empty strings as fields. I think we should clarify what we want for empty strings. For example, JSON data sources simply ignores when it meets empty strings as far as I know but CSV data source currently throws NPE. Also, I think we need to decide if we are going to support fields named empty strings or not for data sources.

andrewor14 · 2016-05-11T20:26:07Z

@HyukjinKwon I think we should rename both null and empty strings. In both cases there's no way to actually query the column. Also I looked at the other two patches and although they're related they really don't block this patch since this one only touches the header. I think it's OK to merge this one independently of the other two.

andrewor14 · 2016-05-11T20:29:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/DefaultSource.scala

+        if (value == "" || value == null) s"C$index" else value
+      }
    } else {
      firstRow.zipWithIndex.map { case (value, index) => s"C$index" }


I talked to @marmbrus offline. Elsewhere we use _c0 instead of C0, so we should make that consistent. You're gonna have to change the name both here and in L65.

andrewor14 · 2016-05-11T20:29:49Z

The only changes I would suggest here are:

(1) Change the condition to value == null || value.isEmpty || value == csvOptions.nullValue
(2) Renaming C0 to _c0

Otherwise this patch LGTM.

andrewor14 · 2016-05-11T20:33:05Z

Also, once you fix those can you add [SPARK-15274] to the title?

SparkQA · 2016-05-11T21:55:07Z

Test build #58399 has finished for PR 13041 at commit d665bea.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-05-11T21:55:14Z

@andrewor14 they are related because the first row is already parsed by Univocity parser in which nullValue is set, here. ~~Both PRs are~~ One of both PRs is handling nullValue to treat "" and empty string
+and the other one it dealing with reading "" and empty string.
It seems currently nullValue with Univocity parser is working differently with what Spark expects in CSV datasource library. This is partly what I meant really really confusing.

HyukjinKwon · 2016-05-11T22:03:31Z

@andrewor14 Also, I am careful of this because the header might be intendedly a empty string, meaning the field name is literally a empty string. For example, for me the header below

,,1,2,3

might mean

["","","1","2","3"]

just like

a,a,1,2,3

will be

["a","a","1","2,"3"]

Shouldn't this be supported with a separate option if this should be supported? (I am a bit less sure why duplicated field names (empty string) should be allowed, though.)

HyukjinKwon · 2016-05-11T22:12:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/DefaultSource.scala

+      }
    } else {
-      firstRow.zipWithIndex.map { case (value, index) => s"C$index" }
+      firstRow.zipWithIndex.map { case (value, index) => s"_c$index" }


Why should this be _c?

to be consistent with what Spark does with unnamed columns. See my comment above.

I see. Thanks.

andrewor14 · 2016-05-11T22:16:21Z

@HyukjinKwon I'm not saying they're not related. I'm just saying it's not necessary to block progress on this patch because of the other ones. I agree that we should decide what the proper defaults for nullValue and emptyValue are, but that's a separate issue.

HyukjinKwon · 2016-05-11T22:35:16Z

(I remember , for example, setting nullValue for "abc" will affect the values what Univocity thinks are null ending up with duplicated field names "abc". I think we might have to remove this for Univocity and treat all null in Spark
I will test and tell you what I think when I get to my computer.) oh, sorry I see what you mean. Yes this wouldn't be the reason to block this.

SparkQA · 2016-05-11T23:49:24Z

Test build #58415 has finished for PR 13041 at commit d983718.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-05-11T23:52:21Z

Could I maybe ask your thoughts on the comment above #13041 (comment)?

andrewor14 · 2016-05-12T00:40:05Z

I talked with @marmbrus and @yhuai offline about the semantics of blank and empty strings in the header row. For the latter, I can't imagine a user actually intending to name the column empty string because you can't actually query the column that way, so we should just rename that.

andrewor14 · 2016-05-12T00:41:03Z

This patch itself LGTM. I'm going to merge it into master 2.0. @HyukjinKwon let's move our discussion about nullValue and emptyValue defaults to #12904.

## What changes were proposed in this pull request? When a CSV begins with: - `,,` OR - `"","",` meaning that the first column names are either empty or blank strings and `header` is specified to be `true`, then the column name is replaced with `C` + the index number of that given column. For example, if you were to read in the CSV: ``` "","second column" "hello", "there" ``` Then column names would become `"C0", "second column"`. This behavior aligns with what currently happens when `header` is specified to be `false` in recent versions of Spark. ### Current Behavior in Spark <=1.6 In Spark <=1.6, a CSV with a blank column name becomes a blank string, `""`, meaning that this column cannot be accessed. However the CSV reads in without issue. ### Current Behavior in Spark 2.0 Spark throws a NullPointerError and will not read in the file. #### Reproduction in 2.0 https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/346304/2828750690305044/484361/latest.html ## How was this patch tested? A new test was added to `CSVSuite` to account for this issue. We then have asserts that test for being able to select both the empty column names as well as the regular column names. Author: Bill Chambers <[email protected]> Author: Bill Chambers <[email protected]> Closes #13041 from anabranch/master. (cherry picked from commit 603f445) Signed-off-by: Andrew Or <[email protected]>

HyukjinKwon · 2016-05-12T00:44:39Z

@andrewor14 Then, do you think this should be supported for JSON data source as well?

HyukjinKwon · 2016-05-12T01:09:33Z

I am not a committer and I might not have the right to say this but in my point of view, this change should not be included in Spark 2.0 but 2.1. I think this change will affect the consistency of behaviour of field names having empty strings across SparkSQL and I think it is not clarified yet.

marmbrus · 2016-05-12T02:20:46Z

@HyukjinKwon Spark 2.0 has not been released yet and thus CSV has never been in a released version of Spark. Why would you want to break compatibility in Spark 2.0 and 2.1 instead of just getting it right from the beginning in 2.0. I think these are exactly the kind of API audits we should be doing before the release.

HyukjinKwon · 2016-05-12T02:32:49Z

@marmbrus I am sorry that I said that without knowing the background enough. I wanted to say that might break the consistency of dealing with fields in Spark SQL but it seems merged quickly. I remember I was told Spark 2.0 release is very close and I intended to say this might have to be quickly consistent if it is problematic.

marmbrus · 2016-05-12T03:39:46Z

That's okay, is good to be cautious around release time. I just wanted to be clear why I thought merging in this case was justified :)

fix SPARK-15264, add test cases

17b3c58

bllchmbrs changed the title ~~fix SPARK-15264, add test cases~~ [SPARK-15264][SQL] CSV Reader Error on Blank Columns May 11, 2016

bllchmbrs changed the title ~~[SPARK-15264][SQL] CSV Reader Error on Blank Columns~~ [SPARK-15264][SQL] CSV Reader Error on Blank Column Names May 11, 2016

Rename test case

85d0843

HyukjinKwon reviewed May 11, 2016
View reviewed changes

andrewor14 reviewed May 11, 2016
View reviewed changes

bllchmbrs changed the title ~~[SPARK-15264][SQL] CSV Reader Error on Blank Column Names~~ [SPARK-15264][SPARK-15274][SQL] CSV Reader Error on Blank Column Names May 11, 2016

Bill Chambers added 3 commits May 11, 2016 13:35

Fix according to @andrewor14 comments, rename C0 to _c0, fix null checks

0ed0b4c

Merge branch 'master' of github.com:anabranch/spark

e380f32

fix tests

d665bea

fix python test failure

d983718

HyukjinKwon reviewed May 11, 2016
View reviewed changes

asfgit closed this in 603f445 May 12, 2016

HyukjinKwon mentioned this pull request May 12, 2016

[SPARK-15125][SQL] Changing CSV data source mapping of empty quoted strings in the data to empty strings instead of null #12904

Closed

[SPARK-15264][SPARK-15274][SQL] CSV Reader Error on Blank Column Names #13041

[SPARK-15264][SPARK-15274][SQL] CSV Reader Error on Blank Column Names #13041

Uh oh!

Conversation

bllchmbrs commented May 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Current Behavior in Spark <=1.6

Current Behavior in Spark 2.0

Reproduction in 2.0

How was this patch tested?

Uh oh!

bllchmbrs commented May 11, 2016

Uh oh!

HyukjinKwon commented May 11, 2016

Uh oh!

HyukjinKwon commented May 11, 2016

Uh oh!

SparkQA commented May 11, 2016

Uh oh!

andrewor14 commented May 11, 2016

Uh oh!

SparkQA commented May 11, 2016

Uh oh!

HyukjinKwon May 11, 2016

Choose a reason for hiding this comment

Uh oh!

bllchmbrs May 11, 2016

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon May 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented May 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewor14 commented May 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewor14 May 11, 2016

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented May 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewor14 commented May 11, 2016

Uh oh!

SparkQA commented May 11, 2016

Uh oh!

HyukjinKwon commented May 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented May 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon May 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewor14 May 11, 2016

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon May 11, 2016

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented May 11, 2016

Uh oh!

HyukjinKwon commented May 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented May 11, 2016

Uh oh!

HyukjinKwon commented May 11, 2016

Uh oh!

andrewor14 commented May 12, 2016

Uh oh!

andrewor14 commented May 12, 2016

Uh oh!

HyukjinKwon commented May 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

bllchmbrs commented May 11, 2016 •

edited

Loading

HyukjinKwon May 11, 2016 •

edited

Loading

HyukjinKwon commented May 11, 2016 •

edited

Loading

andrewor14 commented May 11, 2016 •

edited

Loading

andrewor14 commented May 11, 2016 •

edited

Loading

HyukjinKwon commented May 11, 2016 •

edited

Loading

HyukjinKwon commented May 11, 2016 •

edited

Loading

HyukjinKwon May 11, 2016 •

edited

Loading

HyukjinKwon commented May 11, 2016 •

edited

Loading

HyukjinKwon commented May 12, 2016 •

edited

Loading

HyukjinKwon commented May 12, 2016 •

edited

Loading