[SPARK-23271[SQL] Parquet output contains only _SUCCESS file after writing an empty dataframe #20525

dilipbiswal · 2018-02-07T07:53:27Z

What changes were proposed in this pull request?

Below are the two cases.

case 1

scala> List.empty[String].toDF().rdd.partitions.length
res18: Int = 1

When we write the above data frame as parquet, we create a parquet file containing
just the schema of the data frame.

Case 2

scala> val anySchema = StructType(StructField("anyName", StringType, nullable = false) :: Nil)
anySchema: org.apache.spark.sql.types.StructType = StructType(StructField(anyName,StringType,false))
scala> spark.read.schema(anySchema).csv("/tmp/empty_folder").rdd.partitions.length
res22: Int = 0

For the 2nd case, since number of partitions = 0, we don't call the write task (the task has logic to create the empty metadata only parquet file)

The fix is to create a dummy single partition RDD and set up the write task based on it to ensure
the metadata-only file.

How was this patch tested?

A new test is added to DataframeReaderWriterSuite.

gatorsmile · 2018-02-07T07:56:17Z

Update the title to [SPARK-23271] [SQL] Parquet output contains only _SUCCESS file after writing an empty dataframe

gatorsmile · 2018-02-07T07:56:30Z

cc @cloud-fan @zsxwing

dilipbiswal · 2018-02-07T07:56:38Z

sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala

Its not legal to write an empty struct in parquet. Its explained by Herman in SPARK-20593. Previously, we didn't setup a write
task for this where as now with this fix we do.

gatorsmile · 2018-02-07T07:57:09Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala

Nit: extra space before df1

@gatorsmile Thank you. Fixed.

SparkQA · 2018-02-07T08:05:02Z

Test build #87148 has finished for PR 20525 at commit 2764b1c.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

pashazm · 2018-02-07T09:33:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

Looks like shuffle will be here if partitions number is zero. If so, maybe, other solution is possible?

Yea the shuffle can be avoided. We can just launch a write task for empty RDD, instead of calling rdd.repartition(1).

@cloud-fan @pashazm I was thinking, this would not be a regular event to write empty datasets , right ? Should we be even optimizing this path ? Secondly, is shuffling an empty data set that expensive ?

@cloud-fan, actually i had tried to launch a write task for empty RDD, but was hitting a NullPointerException from scheduler ? Looks like things are setup to only work off of partitions of RDD. Could we try to create this empty metadata file from the driver in this case ? If we go that route, then we may have to refactor the write task code. Seems like a lot for this little corner case, what do you think ?

You could try coalesce(1) that should not shuffle.

@hvanhovell Thanks. I have a question. Can we go from zero partition to one partition with coalesce() ? In the code we seem to be doing a min(prevPartition, requestedPartition) to set the target number of partition code

@hvanhovell Just tried. We stay at numPartitions = 0 after coalesce(). So it does not fix the problem.

One simple way to fix it: create an empty 1-partition RDD and use it here.

Yea, you can have

sparkSession.sparkContext.parallelize(Array.empty[InternalRow])

@cloud-fan @jiangxb1987 Thanks a LOT. This works perfectly.

SparkQA · 2018-02-07T11:10:59Z

Test build #87149 has finished for PR 20525 at commit 95db6d6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-02-07T22:19:57Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala

If this is specific to parquet, can we have this ParquetFileFormatSuite instead?

@dongjoon-hyun Thank you. Let me check if we have similar issue for orc. If not, i will move it to ParquetFileFormatSuite.

Thank you @dilipbiswal .
I checked with ORC, too. Your patch works for ORC too. I mean keeping schema although it create a file.
In this suite, can you extend the test case for ORC too?

Thank you very much @dongjoon-hyun. You are super quick :-). Yes, i will add the test case for ORC.

Ur, FileBasedDataSourceSuite may be more suitable. It has a similar test case. You can add your test case there in a similar manner.
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala#L59-L73

@dongjoon-hyun Sure. Will take a look. Thanks !!

SparkQA · 2018-02-08T05:22:23Z

Test build #87182 has finished for PR 20525 at commit 9536469.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-02-08T06:43:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

sparkSession.sparkContext.parallelize(Array.empty[InternalRow], 1)

cloud-fan · 2018-02-08T06:45:11Z

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

an easier to create an empty dataframe: spark.emptyDataFrame.select(lit(1).as("i"))

cloud-fan · 2018-02-08T07:26:16Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala

unnecessary change.

will remove

please remove it

cloud-fan · 2018-02-08T07:27:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

how about val rddWithNonEmptyPartitions ...

SparkQA · 2018-02-08T08:05:02Z

Test build #87193 has finished for PR 20525 at commit 37343a8.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-08T08:05:02Z

Test build #87195 has finished for PR 20525 at commit 92f490e.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-08T08:05:02Z

Test build #87200 has finished for PR 20525 at commit cb73001.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2018-02-08T08:12:49Z

retest this please

gatorsmile · 2018-02-08T08:26:46Z

BTW, this is a behavior change. We need to document it in the migration guide.

SparkQA · 2018-02-08T08:57:49Z

Test build #87204 has finished for PR 20525 at commit cb73001.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2018-02-08T09:26:48Z

retest this please

SparkQA · 2018-02-08T12:29:16Z

Test build #87211 has finished for PR 20525 at commit cb73001.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2018-02-08T16:46:43Z

@gatorsmile Thanks. I will create a doc pr and address it.

cloud-fan · 2018-02-09T05:32:32Z

I think it's better to have the doc change in the same PR, then it's more clear which patch caused the behavior change.

dilipbiswal · 2018-02-09T06:25:18Z

@cloud-fan Actually i had already created the doc pr in the morning using the same JIRA number. Whenchen, if we want to have both the changes in the same commit , will we be able to do it when we merge the patch ? If not, pl let me know , i will close that PR and move over the change to this branch.

cloud-fan · 2018-02-09T06:30:45Z

no we can't merge 2 PRs together. Please pick one of your PRs and put all the changes there, thanks!

dilipbiswal · 2018-02-09T06:38:55Z

@cloud-fan @gatorsmile Done.

cloud-fan · 2018-02-09T07:48:01Z

docs/sql-programming-guide.md

Since Spark 2.3, writing an empty dataframe to a directory launches at least one write task, even physically the dataframe has no partition. This introduces a small behavior change that for self-described file formats like Parquet and Orc, Spark creates a metadata-only file in the target directory when writing 0-partition dataframe, so that schema inference can still work if users read that directory later. The new behavior is more reasonable and more consistent regarding writing empty dataframe.

even -> even if ?
self-described -> self-describing ?
@cloud-fan Nicely written. Thanks. Let me know if you are ok with the above two change ?

yea the above 2 changes are good!

"launches at least one write task"
Actually isn't it exactly one write task ? I am okay with what you have. Just wanted to check to make sure.

cloud-fan · 2018-02-09T07:50:29Z

sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala

How does it fail? If it's a runtime error we should fail earlier during analysis. This worth a new JIRA.

@cloud-fan I forgot :-) I will double check and get back.

@cloud-fan It fails in the executor like this -

org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with an empty group: message spark_schema { } at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:27) at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:37) at org.apache.parquet.schema.MessageType.accept(MessageType.java:58) at org.apache.parquet.schema.TypeUtil.checkValidWriteSchema(TypeUtil.java:23) at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:225) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:342) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:302) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:37) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:278) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:276) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:281) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:206) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:205) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.

Let me study the code to see how we can fail earlier.

Let's open a JIRA. We can fix it in another PR.

@cloud-fan OK Wenchen. Created SPARK-23372 - FYI

zsxwing · 2018-02-09T07:51:20Z

@tdas @brkyvz Do we still need the fix for 0-partition DataFrame in Structured Streaming after this change?

cloud-fan · 2018-03-08T04:15:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

nit: rddWithNonEmptyPartitions.partitions.indices

made the change. Learnt a new trick today :-)

cloud-fan · 2018-03-08T04:15:41Z

LGTM

jiangxb1987 · 2018-03-08T05:32:04Z

LGTM

SparkQA · 2018-03-08T06:54:22Z

Test build #88072 has finished for PR 20525 at commit f28324a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-08T11:22:54Z

Test build #88078 has finished for PR 20525 at commit aa29eba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…an empty dataframe

SparkQA · 2018-03-08T20:03:16Z

Test build #88099 has finished for PR 20525 at commit bc48bbd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-03-08T22:58:53Z

thanks, merging to master!

dilipbiswal · 2018-03-08T23:25:37Z

@cloud-fan @jiangxb1987 Thank you very much !!

HyukjinKwon · 2018-03-10T14:49:57Z

late LGTM too.

dilipbiswal commented Feb 7, 2018

View reviewed changes

gatorsmile reviewed Feb 7, 2018

View reviewed changes

dilipbiswal changed the title ~~SPARK-23271 Parquet output contains only _SUCCESS file after writing an empty dataframe~~ [SPARK-23271[SQL] Parquet output contains only _SUCCESS file after writing an empty dataframe Feb 7, 2018

pashazm reviewed Feb 7, 2018

View reviewed changes

dongjoon-hyun reviewed Feb 7, 2018

View reviewed changes

cloud-fan reviewed Feb 8, 2018

View reviewed changes

cloud-fan reviewed Feb 9, 2018

View reviewed changes

cloud-fan reviewed Mar 8, 2018

View reviewed changes

dilipbiswal added 13 commits March 8, 2018 08:37

SPARK-23271 Parquet output contains only _SUCCESS file after writing …

54b09c9

…an empty dataframe

review comment

c80afc4

Review comments

804d159

Create a dummy single partition RDD to pass to write task

d79b05e

Code review

20851c7

Code review

1c98244

Code review

e09c12a

Document the empty dataframe write semantics

943ca79

doc fix

fcdc7a9

doc fix

d00b836

remove un-used import

bc19030

Re-target the fix to 2.4

ed71866

review ..

bc48bbd

dilipbiswal force-pushed the spark-23271 branch from aa29eba to bc48bbd Compare March 8, 2018 16:42

asfgit closed this in d90e77b Mar 8, 2018

cloud-fan mentioned this pull request Nov 22, 2018

[SPARK-26081][SQL] Prevent empty files for empty partitions in Text datasources #23052

Closed

gengliangwang mentioned this pull request Jan 24, 2019

[SPARK-26709][SQL] OptimizeMetadataOnlyQuery does not handle empty records correctly #23635

Closed

This was referenced Jun 6, 2021

[SPARK-35592][SQL] An empty dataframe is saved with partitions should write a metadata only file #32794

Closed

[SPARK-35592][SQL] An empty dataframe is saved with partitions should write a metadata only file #32818

Closed

[SPARK-23271[SQL] Parquet output contains only _SUCCESS file after writing an empty dataframe #20525

[SPARK-23271[SQL] Parquet output contains only _SUCCESS file after writing an empty dataframe #20525

Uh oh!

Conversation

dilipbiswal commented Feb 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gatorsmile commented Feb 7, 2018

Uh oh!

gatorsmile commented Feb 7, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 7, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 7, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Feb 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Feb 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 8, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 8, 2018

Uh oh!

SparkQA commented Feb 8, 2018

Uh oh!

SparkQA commented Feb 8, 2018

Uh oh!

dilipbiswal commented Feb 8, 2018

dilipbiswal commented Feb 7, 2018 •

edited

Loading

dongjoon-hyun Feb 7, 2018 •

edited

Loading

dongjoon-hyun Feb 7, 2018 •

edited

Loading

gatorsmile commented Feb 8, 2018 •

edited

Loading