[SPARK-26208][SQL] add headers to empty csv files when header=true #23173

koertkuipers · 2018-11-28T20:05:14Z

What changes were proposed in this pull request?

Add headers to empty csv files when header=true, because otherwise these files are invalid when reading.

How was this patch tested?

Added test for roundtrip of empty dataframe to csv file with headers and back in CSVSuite

Please review http://spark.apache.org/contributing.html before opening a pull request.

SparkQA · 2018-11-28T20:16:24Z

Test build #99397 has finished for PR 23173 at commit aad5f09.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-28T20:48:24Z

Test build #99400 has finished for PR 23173 at commit 1f897be.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-11-28T22:26:15Z

It seems this is similar to @HyukjinKwon PR: #13252

koertkuipers · 2018-11-28T22:46:54Z

i was not aware of SPARK-15473. thanks. let me look at @HyukjinKwon pullreq and mark my jira as a related.

SparkQA · 2018-11-29T01:22:35Z

Test build #99405 has finished for PR 23173 at commit bfadbf9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/OutputWriter.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

SparkQA · 2018-11-29T22:36:56Z

Test build #99467 has finished for PR 23173 at commit 258a1e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-29T23:00:45Z

Test build #99477 has finished for PR 23173 at commit 088c710.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/OutputWriter.scala

SparkQA · 2018-11-30T04:46:28Z

Test build #99485 has finished for PR 23173 at commit 6f498a0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class CSVInferSchema(options: CSVOptions) extends Serializable

SparkQA · 2018-11-30T04:57:25Z

Test build #99486 has finished for PR 23173 at commit 29fc6b8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala

SparkQA · 2018-11-30T19:51:14Z

Test build #99519 has finished for PR 23173 at commit 6165e1a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-12-01T09:58:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala

+  }
+
+  override def write(row: InternalRow): Unit = {
+    val gen = getGen()


Wait .. is this going to create UnivocityGenerator for each record?

Ah, it's getOrElse. Okay but still can we simplify this logic? Looks a bit confusing. For instance, I think we can do this with lazy val.

@HyukjinKwon Do you mean creating generator in lazy val?

lazy val univocityGenerator = { val charset = Charset.forName(params.charset) val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset) new UnivocityGenerator(dataSchema, os, params) }

The problem is in the close method, you will have to call univocityGenerator.close() in the method. If the lazy val wasn't instantiated before (empty partition and the header option is false), the generator will be created and closed immediately. And as a result, you will get an empty file for the empty partition. That's why I prefer the approach with Option[UnivocityGenerator] in #23052.

I see the problem. OrcFileFormat uses a flag approach. For instance:

private var isGeneratorInitiated = false lazy val univocityGenerator = { isGeneratorInitiated = true val charset = Charset.forName(params.charset) val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset) new UnivocityGenerator(dataSchema, os, params) } if (isGeneratorInitiated) { univocityGenerator.close() }

Should be okay to stick to it.

ok i changed it to lazy val and flag

Frankly speaking I don't see any reasons for this. For now we have 2 flags actually - isGeneratorInitiated and another one inside of lazy val. And 2 slightly different approaches - with the Option type in Json and Text, and lazy val + flag in Orc and CSV.

Yeah we have two different approaches, both of which are fine IMHO. I think it's reasonable to clean that up in a follow-up if desired. WDYT @HyukjinKwon ?

i will revert this change to lazy val for now since it doesnt have anything to do wit this pullreq or jira: the Option approach was created in another pullreq.

OK. Shouldn't be a big deal.

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

SparkQA · 2018-12-01T17:57:47Z

Test build #99553 has finished for PR 23173 at commit 238efa5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

This reverts commit 238efa5.

SparkQA · 2018-12-02T01:16:52Z

Test build #99559 has finished for PR 23173 at commit 9d5cb7b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

LGTM

HyukjinKwon · 2018-12-02T09:38:22Z

Merged to master.

## What changes were proposed in this pull request? Add headers to empty csv files when header=true, because otherwise these files are invalid when reading. ## How was this patch tested? Added test for roundtrip of empty dataframe to csv file with headers and back in CSVSuite Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#23173 from koertkuipers/feat-empty-csv-with-header. Authored-by: Koert Kuipers <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

koertkuipers added 2 commits November 28, 2018 14:03

write headers to empty csv files when header=true

3192656

Merge branch 'master' into feat-empty-csv-with-header

aad5f09

add missing init overrides

1f897be

add missing init override

bfadbf9

srowen requested changes Nov 29, 2018

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/OutputWriter.scala Outdated Show resolved Hide resolved

koertkuipers added 2 commits November 29, 2018 13:02

Merge branch 'master' into feat-empty-csv-with-header

fd1ee62

provide default do-nothing implementation for OutputWriter.init

258a1e4

MaxGekk reviewed Nov 29, 2018

View reviewed changes

cleanup test

088c710

HyukjinKwon reviewed Nov 30, 2018

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/OutputWriter.scala Outdated Show resolved Hide resolved

koertkuipers added 3 commits November 29, 2018 20:07

add repartition to test

879be71

Merge branch 'master' into feat-empty-csv-with-header

6f498a0

remove init

29fc6b8

srowen reviewed Nov 30, 2018

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala Outdated Show resolved Hide resolved

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala Outdated Show resolved Hide resolved

move code up so its clearer its part of constructor

6165e1a

MaxGekk approved these changes Dec 1, 2018

View reviewed changes

HyukjinKwon reviewed Dec 1, 2018

View reviewed changes

MaxGekk reviewed Dec 1, 2018

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala Outdated Show resolved Hide resolved

koertkuipers added 2 commits December 1, 2018 11:14

dont check number of partitions

5e87e7e

use lazy val for univocity generator

238efa5

Revert "use lazy val for univocity generator"

9d5cb7b

This reverts commit 238efa5.

srowen approved these changes Dec 1, 2018

View reviewed changes

HyukjinKwon approved these changes Dec 2, 2018

View reviewed changes

asfgit closed this in c7d95cc Dec 2, 2018

[SPARK-26208][SQL] add headers to empty csv files when header=true #23173

[SPARK-26208][SQL] add headers to empty csv files when header=true #23173

Uh oh!

Conversation

koertkuipers commented Nov 28, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Nov 28, 2018

Uh oh!

SparkQA commented Nov 28, 2018

Uh oh!

MaxGekk commented Nov 28, 2018

Uh oh!

koertkuipers commented Nov 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Nov 29, 2018

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Nov 29, 2018

Uh oh!

SparkQA commented Nov 29, 2018

Uh oh!

Uh oh!

SparkQA commented Nov 30, 2018

Uh oh!

SparkQA commented Nov 30, 2018

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Nov 30, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Dec 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Dec 1, 2018

Uh oh!

SparkQA commented Dec 2, 2018

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Dec 2, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

koertkuipers commented Nov 28, 2018 •

edited

Loading

HyukjinKwon Dec 1, 2018 •

edited

Loading