-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-26208][SQL] add headers to empty csv files when header=true #23173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26208][SQL] add headers to empty csv files when header=true #23173
Conversation
|
Test build #99397 has finished for PR 23173 at commit
|
|
Test build #99400 has finished for PR 23173 at commit
|
|
It seems this is similar to @HyukjinKwon PR: #13252 |
|
i was not aware of SPARK-15473. thanks. let me look at @HyukjinKwon pullreq and mark my jira as a related. |
|
Test build #99405 has finished for PR 23173 at commit
|
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/OutputWriter.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
Outdated
Show resolved
Hide resolved
|
Test build #99467 has finished for PR 23173 at commit
|
|
Test build #99477 has finished for PR 23173 at commit
|
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/OutputWriter.scala
Outdated
Show resolved
Hide resolved
|
Test build #99485 has finished for PR 23173 at commit
|
|
Test build #99486 has finished for PR 23173 at commit
|
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
Outdated
Show resolved
Hide resolved
|
Test build #99519 has finished for PR 23173 at commit
|
| } | ||
|
|
||
| override def write(row: InternalRow): Unit = { | ||
| val gen = getGen() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait .. is this going to create UnivocityGenerator for each record?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it's getOrElse. Okay but still can we simplify this logic? Looks a bit confusing. For instance, I think we can do this with lazy val.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Do you mean creating generator in lazy val?
lazy val univocityGenerator = {
val charset = Charset.forName(params.charset)
val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
new UnivocityGenerator(dataSchema, os, params)
}
The problem is in the close method, you will have to call univocityGenerator.close() in the method. If the lazy val wasn't instantiated before (empty partition and the header option is false), the generator will be created and closed immediately. And as a result, you will get an empty file for the empty partition. That's why I prefer the approach with Option[UnivocityGenerator] in #23052.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the problem. OrcFileFormat uses a flag approach. For instance:
private var isGeneratorInitiated = false
lazy val univocityGenerator = {
isGeneratorInitiated = true
val charset = Charset.forName(params.charset)
val os = CodecStreams.createOutputStreamWriter(context, new Path(path), charset)
new UnivocityGenerator(dataSchema, os, params)
}
if (isGeneratorInitiated) {
univocityGenerator.close()
}
Should be okay to stick to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok i changed it to lazy val and flag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Frankly speaking I don't see any reasons for this. For now we have 2 flags actually - isGeneratorInitiated and another one inside of lazy val. And 2 slightly different approaches - with the Option type in Json and Text, and lazy val + flag in Orc and CSV.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we have two different approaches, both of which are fine IMHO. I think it's reasonable to clean that up in a follow-up if desired. WDYT @HyukjinKwon ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will revert this change to lazy val for now since it doesnt have anything to do wit this pullreq or jira: the Option approach was created in another pullreq.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Shouldn't be a big deal.
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
Outdated
Show resolved
Hide resolved
|
Test build #99553 has finished for PR 23173 at commit
|
This reverts commit 238efa5.
|
Test build #99559 has finished for PR 23173 at commit
|
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Merged to master. |
## What changes were proposed in this pull request? Add headers to empty csv files when header=true, because otherwise these files are invalid when reading. ## How was this patch tested? Added test for roundtrip of empty dataframe to csv file with headers and back in CSVSuite Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#23173 from koertkuipers/feat-empty-csv-with-header. Authored-by: Koert Kuipers <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
Add headers to empty csv files when header=true, because otherwise these files are invalid when reading.
How was this patch tested?
Added test for roundtrip of empty dataframe to csv file with headers and back in CSVSuite
Please review http://spark.apache.org/contributing.html before opening a pull request.