Skip to content

Conversation

@koertkuipers
Copy link
Contributor

What changes were proposed in this pull request?

Add headers to empty csv files when header=true, because otherwise these files are invalid when reading.

How was this patch tested?

Added test for roundtrip of empty dataframe to csv file with headers and back in CSVSuite

Please review http://spark.apache.org/contributing.html before opening a pull request.

@SparkQA
Copy link

SparkQA commented Nov 28, 2018

Test build #99397 has finished for PR 23173 at commit aad5f09.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 28, 2018

Test build #99400 has finished for PR 23173 at commit 1f897be.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member

MaxGekk commented Nov 28, 2018

It seems this is similar to @HyukjinKwon PR: #13252

@koertkuipers
Copy link
Contributor Author

koertkuipers commented Nov 28, 2018

i was not aware of SPARK-15473. thanks. let me look at @HyukjinKwon pullreq and mark my jira as a related.

@SparkQA
Copy link

SparkQA commented Nov 29, 2018

Test build #99405 has finished for PR 23173 at commit bfadbf9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 29, 2018

Test build #99467 has finished for PR 23173 at commit 258a1e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 29, 2018

Test build #99477 has finished for PR 23173 at commit 088c710.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 30, 2018

Test build #99485 has finished for PR 23173 at commit 6f498a0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CSVInferSchema(options: CSVOptions) extends Serializable

@SparkQA
Copy link

SparkQA commented Nov 30, 2018

Test build #99486 has finished for PR 23173 at commit 29fc6b8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 30, 2018

Test build #99519 has finished for PR 23173 at commit 6165e1a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

override def write(row: InternalRow): Unit = {
val gen = getGen()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait .. is this going to create UnivocityGenerator for each record?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it's getOrElse. Okay but still can we simplify this logic? Looks a bit confusing. For instance, I think we can do this with lazy val.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Do you mean creating generator in lazy val?

lazy val univocityGenerator = {
    val charset = Charset.forName(params.charset)
    val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
    new UnivocityGenerator(dataSchema, os, params)
}

The problem is in the close method, you will have to call univocityGenerator.close() in the method. If the lazy val wasn't instantiated before (empty partition and the header option is false), the generator will be created and closed immediately. And as a result, you will get an empty file for the empty partition. That's why I prefer the approach with Option[UnivocityGenerator] in #23052.

Copy link
Member

@HyukjinKwon HyukjinKwon Dec 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the problem. OrcFileFormat uses a flag approach. For instance:

private var isGeneratorInitiated = false

lazy val univocityGenerator = {
  isGeneratorInitiated = true
  val charset = Charset.forName(params.charset)
  val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
  new UnivocityGenerator(dataSchema, os, params)
}

if (isGeneratorInitiated) {
  univocityGenerator.close()
}

Should be okay to stick to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i changed it to lazy val and flag

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frankly speaking I don't see any reasons for this. For now we have 2 flags actually - isGeneratorInitiated and another one inside of lazy val. And 2 slightly different approaches - with the Option type in Json and Text, and lazy val + flag in Orc and CSV.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we have two different approaches, both of which are fine IMHO. I think it's reasonable to clean that up in a follow-up if desired. WDYT @HyukjinKwon ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will revert this change to lazy val for now since it doesnt have anything to do wit this pullreq or jira: the Option approach was created in another pullreq.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Shouldn't be a big deal.

@SparkQA
Copy link

SparkQA commented Dec 1, 2018

Test build #99553 has finished for PR 23173 at commit 238efa5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 2, 2018

Test build #99559 has finished for PR 23173 at commit 9d5cb7b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HyukjinKwon
Copy link
Member

Merged to master.

@asfgit asfgit closed this in c7d95cc Dec 2, 2018
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

Add headers to empty csv files when header=true, because otherwise these files are invalid when reading.

## How was this patch tested?

Added test for roundtrip of empty dataframe to csv file with headers and back in CSVSuite

Please review http://spark.apache.org/contributing.html before opening a pull request.

Closes apache#23173 from koertkuipers/feat-empty-csv-with-header.

Authored-by: Koert Kuipers <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants