Skip to content

Conversation

@dilipbiswal
Copy link
Contributor

@dilipbiswal dilipbiswal commented Feb 7, 2018

What changes were proposed in this pull request?

Below are the two cases.

case 1

scala> List.empty[String].toDF().rdd.partitions.length
res18: Int = 1

When we write the above data frame as parquet, we create a parquet file containing
just the schema of the data frame.

Case 2

scala> val anySchema = StructType(StructField("anyName", StringType, nullable = false) :: Nil)
anySchema: org.apache.spark.sql.types.StructType = StructType(StructField(anyName,StringType,false))
scala> spark.read.schema(anySchema).csv("/tmp/empty_folder").rdd.partitions.length
res22: Int = 0

For the 2nd case, since number of partitions = 0, we don't call the write task (the task has logic to create the empty metadata only parquet file)

The fix is to create a dummy single partition RDD and set up the write task based on it to ensure
the metadata-only file.

How was this patch tested?

A new test is added to DataframeReaderWriterSuite.

@gatorsmile
Copy link
Member

Update the title to [SPARK-23271] [SQL] Parquet output contains only _SUCCESS file after writing an empty dataframe

@gatorsmile
Copy link
Member

cc @cloud-fan @zsxwing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not legal to write an empty struct in parquet. Its explained by Herman in SPARK-20593. Previously, we didn't setup a write
task for this where as now with this fix we do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: extra space before df1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile Thank you. Fixed.

@dilipbiswal dilipbiswal changed the title SPARK-23271 Parquet output contains only _SUCCESS file after writing an empty dataframe [SPARK-23271[SQL] Parquet output contains only _SUCCESS file after writing an empty dataframe Feb 7, 2018
@SparkQA
Copy link

SparkQA commented Feb 7, 2018

Test build #87148 has finished for PR 20525 at commit 2764b1c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like shuffle will be here if partitions number is zero. If so, maybe, other solution is possible?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea the shuffle can be avoided. We can just launch a write task for empty RDD, instead of calling rdd.repartition(1).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan @pashazm I was thinking, this would not be a regular event to write empty datasets , right ? Should we be even optimizing this path ? Secondly, is shuffling an empty data set that expensive ?

@cloud-fan, actually i had tried to launch a write task for empty RDD, but was hitting a NullPointerException from scheduler ? Looks like things are setup to only work off of partitions of RDD. Could we try to create this empty metadata file from the driver in this case ? If we go that route, then we may have to refactor the write task code. Seems like a lot for this little corner case, what do you think ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could try coalesce(1) that should not shuffle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell Thanks. I have a question. Can we go from zero partition to one partition with coalesce() ? In the code we seem to be doing a min(prevPartition, requestedPartition) to set the target number of partition code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell Just tried. We stay at numPartitions = 0 after coalesce(). So it does not fix the problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One simple way to fix it: create an empty 1-partition RDD and use it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, you can have

sparkSession.sparkContext.parallelize(Array.empty[InternalRow])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan @jiangxb1987 Thanks a LOT. This works perfectly.

@SparkQA
Copy link

SparkQA commented Feb 7, 2018

Test build #87149 has finished for PR 20525 at commit 95db6d6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is specific to parquet, can we have this ParquetFileFormatSuite instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun Thank you. Let me check if we have similar issue for orc. If not, i will move it to ParquetFileFormatSuite.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Feb 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @dilipbiswal .
I checked with ORC, too. Your patch works for ORC too. I mean keeping schema although it create a file.
In this suite, can you extend the test case for ORC too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @dongjoon-hyun. You are super quick :-). Yes, i will add the test case for ORC.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Feb 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, FileBasedDataSourceSuite may be more suitable. It has a similar test case. You can add your test case there in a similar manner.
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala#L59-L73

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun Sure. Will take a look. Thanks !!

@SparkQA
Copy link

SparkQA commented Feb 8, 2018

Test build #87182 has finished for PR 20525 at commit 9536469.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sparkSession.sparkContext.parallelize(Array.empty[InternalRow], 1)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an easier to create an empty dataframe: spark.emptyDataFrame.select(lit(1).as("i"))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about val rddWithNonEmptyPartitions ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@SparkQA
Copy link

SparkQA commented Feb 8, 2018

Test build #87193 has finished for PR 20525 at commit 37343a8.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 8, 2018

Test build #87195 has finished for PR 20525 at commit 92f490e.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 8, 2018

Test build #87200 has finished for PR 20525 at commit cb73001.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

retest this please

@gatorsmile
Copy link
Member

gatorsmile commented Feb 8, 2018

BTW, this is a behavior change. We need to document it in the migration guide.

@SparkQA
Copy link

SparkQA commented Feb 8, 2018

Test build #87204 has finished for PR 20525 at commit cb73001.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 8, 2018

Test build #87211 has finished for PR 20525 at commit cb73001.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

@gatorsmile Thanks. I will create a doc pr and address it.

@cloud-fan
Copy link
Contributor

I think it's better to have the doc change in the same PR, then it's more clear which patch caused the behavior change.

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Actually i had already created the doc pr in the morning using the same JIRA number. Whenchen, if we want to have both the changes in the same commit , will we be able to do it when we merge the patch ? If not, pl let me know , i will close that PR and move over the change to this branch.

@cloud-fan
Copy link
Contributor

no we can't merge 2 PRs together. Please pick one of your PRs and put all the changes there, thanks!

@dilipbiswal
Copy link
Contributor Author

@cloud-fan @gatorsmile Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Spark 2.3, writing an empty dataframe to a directory launches at least one write task, even physically the dataframe has no partition. This introduces a small behavior change that for self-described file formats like Parquet and Orc, Spark creates a metadata-only file in the target directory when writing 0-partition dataframe, so that schema inference can still work if users read that directory later. The new behavior is more reasonable and more consistent regarding writing empty dataframe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even -> even if ?
self-described -> self-describing ?
@cloud-fan Nicely written. Thanks. Let me know if you are ok with the above two change ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea the above 2 changes are good!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"launches at least one write task"
Actually isn't it exactly one write task ? I am okay with what you have. Just wanted to check to make sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it fail? If it's a runtime error we should fail earlier during analysis. This worth a new JIRA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan I forgot :-) I will double check and get back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan It fails in the executor like this -

org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with an empty group: message spark_schema {
}

	at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:27)
	at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:37)
	at org.apache.parquet.schema.MessageType.accept(MessageType.java:58)
	at org.apache.parquet.schema.TypeUtil.checkValidWriteSchema(TypeUtil.java:23)
	at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:225)
	at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:342)
	at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:302)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:37)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:278)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:276)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:281)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:206)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:205)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me study the code to see how we can fail earlier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's open a JIRA. We can fix it in another PR.

Copy link
Contributor Author

@dilipbiswal dilipbiswal Feb 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan OK Wenchen. Created SPARK-23372 - FYI

@zsxwing
Copy link
Member

zsxwing commented Feb 9, 2018

@tdas @brkyvz Do we still need the fix for 0-partition DataFrame in Structured Streaming after this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rddWithNonEmptyPartitions.partitions.indices

Copy link
Contributor Author

@dilipbiswal dilipbiswal Mar 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made the change. Learnt a new trick today :-)

@cloud-fan
Copy link
Contributor

LGTM

1 similar comment
@jiangxb1987
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Mar 8, 2018

Test build #88072 has finished for PR 20525 at commit f28324a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 8, 2018

Test build #88078 has finished for PR 20525 at commit aa29eba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 8, 2018

Test build #88099 has finished for PR 20525 at commit bc48bbd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in d90e77b Mar 8, 2018
@dilipbiswal
Copy link
Contributor Author

@cloud-fan @jiangxb1987 Thank you very much !!

@HyukjinKwon
Copy link
Member

late LGTM too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants