[SPARK-21435][SQL] Empty files should be skipped while write to file #18654

xuanyuanking · 2017-07-17T12:11:05Z

What changes were proposed in this pull request?

Add EmptyDirectoryWriteTask for empty task while writing files. Fix the empty result for parquet format by leaving the first partition for meta writing.

How was this patch tested?

Add new test in FileFormatWriterSuite

xuanyuanking · 2017-07-17T12:19:34Z

@HyukjinKwon Thanks for you comment, as your mentioned in #18650 and #17395, empty results of parquet can be fixed by leave the first partition, how about the orc format? The orc format error for empty result should also consider together within this patch?

HyukjinKwon · 2017-07-17T14:06:40Z

I think ORC can be dealt with separately (the problem is within ORC source given my past investigation).

HyukjinKwon · 2017-07-17T14:26:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala


    val writeTask =
-      if (description.partitionColumns.isEmpty && description.bucketIdExpression.isEmpty) {
+      if (sparkPartitionId != 0 && !iterator.hasNext) {


I guess this might be okay in that sense we are guaranteed partitions to be started from 0 up to my knowledge. Could you take a look and see if it makes sense to you cc @cloud-fan if you don't mind? I am not confident enough to proceed reviewing and leave a sign-off.

This is a little hacky but is the simplest fix I think.

cc @hvanhovell

cc @yhuai too who reviewed my similar PR before.

HyukjinKwon · 2017-07-17T14:27:34Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala

+class FileFormatWriterSuite extends QueryTest with SharedSQLContext {
+
+  test("empty file should be skipped while write to file") {
+    withTempDir { dir =>


withTempPath can be used instead I believe.

HyukjinKwon · 2017-07-17T14:31:09Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala

+      dir.delete()
+      spark.range(10000).repartition(10).write.parquet(dir.toString)
+      val df = spark.read.parquet(dir.toString)
+      val allFiles = dir.listFiles(new FilenameFilter {


Can we just do this simpler? for example,

.listFiles().filter { f => f.isFile && !f.getName.startsWith(".") && !f.getName.startsWith("_") }

Both is ok I think, just copy this from HadoopFsRelationSuite.

Yea. If both are okay, let's go for the shorter one.

+1 for the shorter one

HyukjinKwon · 2017-07-17T14:33:13Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala

+          !name.startsWith(".") && !name.startsWith("_")
+        }
+      })
+      assert(allFiles.length == 10)


Could I ask what this test targets? I think I am lost around here ...

Just make sure the source dir have many files, and the output dir only have 2 files.
If this make people confuse, just leave a notes and delete the assert?

but I guess this one (the latter) does not test this change? If this test passes regardless of this PR change, I would rather remove this one.

OK, I'll remove this assert and leave a note.

HyukjinKwon · 2017-07-17T14:34:11Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala

+
+      withTempDir { dst_dir =>
+        dst_dir.delete()
+        df.where("id = 50").write.parquet(dst_dir.toString)


I would explicitly repartition here.

why we need repartition?

I was thinking just in order to make sure the (previous) number of files written out.

I mean.. for example, if we happen to have few partitions in the df in any event, I guess this test can become invalid ...

SparkQA · 2017-07-17T14:40:01Z

Test build #79667 has finished for PR 18654 at commit 6153001.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2017-07-17T23:43:40Z

retest this please

cloud-fan · 2017-07-18T01:11:08Z

leaving the first partition for meta writing

What is the meta we need to write?

HyukjinKwon · 2017-07-18T01:20:56Z

schema and the footer in case of Parquet. There is more context here - #17395 (comment).

For example, if we don't write out the empty files, it breaks:

spark.range(100).filter("id > 100").write.parquet("/tmp/abc")
spark.read.parquet("/tmp/abc").show()

xuanyuanking · 2017-07-18T01:29:24Z

Yep, empty result dir need this meta, otherwise will throw the exception:

org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:188)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:188)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:187)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:381)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:571)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:555)
  ... 48 elided

cloud-fan · 2017-07-18T01:56:44Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala

+            !name.startsWith(".") && !name.startsWith("_")
+          }
+        })
+        // First partition file and the data file


Ideally we only need the first partition file if all other partitions are empty, but this is hard to do right now.

Can't agree more, firstly I try to implement like this but the FileFormatWriter.write can only see the iterator of each task self.

SparkQA · 2017-07-18T02:04:37Z

Test build #79687 has finished for PR 18654 at commit 6153001.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-07-18T03:44:23Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala

+class FileFormatWriterSuite extends QueryTest with SharedSQLContext {
+
+  test("empty file should be skipped while write to file") {
+    withTempPath { dir =>


Could we maybe just do as below?

withTempPath { path => spark.range(100).repartition(10).where("id = 50").write.parquet(path) val partFiles = path.listFiles() .filter(f => f.isFile && !f.getName.startsWith(".") && !f.getName.startsWith("_")) assert(partFiles.length === 2) }

More clear :) No need to create source files in real.

SparkQA · 2017-07-18T05:31:14Z

Test build #79694 has finished for PR 18654 at commit f7d7c09.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-18T07:04:54Z

Test build #79695 has finished for PR 18654 at commit d118d68.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2017-07-18T07:26:36Z

retest this please

SparkQA · 2017-07-18T09:56:09Z

Test build #79700 has finished for PR 18654 at commit d118d68.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2017-07-18T11:35:58Z

retest this please

SparkQA · 2017-07-18T14:05:10Z

Test build #79704 has finished for PR 18654 at commit d118d68.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2017-07-19T01:17:14Z

ping @cloud-fan @HyukjinKwon

cloud-fan · 2017-07-19T02:28:18Z

LGTM, merging to master!

xuanyuanking added 3 commits July 17, 2017 15:38

empty files should be skipped while write to file

ff92ba3

Handle the empty result of parquet

e08fb19

Modify UT and add more notes

6153001

xuanyuanking mentioned this pull request Jul 17, 2017

[SPARK-21435][SQL] Empty files should be skipped while write to file #18650

Closed

HyukjinKwon reviewed Jul 17, 2017

View reviewed changes

cloud-fan reviewed Jul 18, 2017

View reviewed changes

Modify test case

f7d7c09

HyukjinKwon reviewed Jul 18, 2017

View reviewed changes

Simplified UT

d118d68

asfgit closed this in 81c99a5 Jul 19, 2017

xuanyuanking deleted the SPARK-21435 branch July 21, 2017 01:34

HyukjinKwon mentioned this pull request Nov 26, 2019

Revert "[SPARK-26081][SPARK-29999]" #26671

Closed

[SPARK-21435][SQL] Empty files should be skipped while write to file #18654

[SPARK-21435][SQL] Empty files should be skipped while write to file #18654

Uh oh!

Conversation

xuanyuanking commented Jul 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

xuanyuanking commented Jul 17, 2017

Uh oh!

HyukjinKwon commented Jul 17, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jul 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 17, 2017

Uh oh!

xuanyuanking commented Jul 17, 2017

Uh oh!

cloud-fan commented Jul 18, 2017

Uh oh!

HyukjinKwon commented Jul 18, 2017

Uh oh!

xuanyuanking commented Jul 18, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 18, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 18, 2017

Uh oh!

SparkQA commented Jul 18, 2017

Uh oh!

xuanyuanking commented Jul 18, 2017

Uh oh!

SparkQA commented Jul 18, 2017

Uh oh!

xuanyuanking commented Jul 18, 2017

xuanyuanking commented Jul 17, 2017 •

edited

Loading

HyukjinKwon Jul 18, 2017 •

edited

Loading