[SPARK-20065][SS][WIP] Avoid to output empty parquet files #17395

uncleGen · 2017-03-23T08:22:16Z

Problem Description

Reported by Silvio Fiorito

I've got a Kafka topic which I'm querying, running a windowed aggregation, with a 30 second watermark, 10 second trigger, writing out to Parquet with append output mode.

Every 10 second trigger generates a file, regardless of whether there was any data for that trigger, or whether any records were actually finalized by the watermark.

Is this expected behavior or should it not write out these empty files?

val df = spark.readStream.format("kafka")....

val query = df
  .withWatermark("timestamp", "30 seconds")
  .groupBy(window($"timestamp", "10 seconds"))
  .count()
  .select(date_format($"window.start", "HH:mm:ss").as("time"), $"count")

query
  .writeStream
  .format("parquet")
  .option("checkpointLocation", aggChk)
  .trigger(ProcessingTime("10 seconds"))
  .outputMode("append")
  .start(aggPath)

As the query executes, do a file listing on "aggPath" and you'll see 339 byte files at a minimum until we arrive at the first watermark and the initial batch is finalized. Even after that though, as there are empty batches it'll keep generating empty files every trigger.

What changes were proposed in this pull request?

Check the partition is empty or not, and skip empty partition to avoid output empty file.

How was this patch tested?

Jenkins

HyukjinKwon · 2017-03-23T08:36:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

-      newOutputWriter(fileCounter)
+      // Skip the empty partition to avoid creating a mass of 'empty' files.
+      if (iter.hasNext) {
+        newOutputWriter(fileCounter)


I proposed the similar PR before (about a year ago?) but got reverted. In this case, Parquet would not write out the footer and schema information. Namely, this will break the case below:

spark.range(100).filter("id > 100").write.parquet("/tmp/abc") spark.read.parquet("/tmp/abc").show()

~~Up to my knowledge, we don't have test cases for them if I haven't missed related PRs~~ it seems now there is.

@HyukjinKwon IIUC, this case should fail as expected, as there is no output. Am i missing something?

spark.range(100).filter("id > 100").write.parquet("/tmp/abc") spark.read.parquet("/tmp/abc").show()

Reading empty data should be fine too. It should preserve the schema. I am pretty sure that we want this case because mine was reverted due to the case above.

See #12855, https://issues.apache.org/jira/browse/SPARK-10216 and https://issues.apache.org/jira/browse/SPARK-15393

Thanks for your prompt. How about just left one empty file containing the metadata when df has empty partition? Furthmore, we may just left one metadata file?

Yes, I was thinking in that way. I remember I did several tries at that time but failed to make a confident fix, and could not have some time to work on that further.

Another problem is, it might be related with a datasource-specific issue because, for example, ORC does not write out empty df. For example,

scala> spark.range(100).filter("id > 100").write.orc("/tmp/abc1") scala> spark.read.orc("/tmp/abc1").show() org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It must be specified manually.; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:182) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:182)

This issue is described in https://issues.apache.org/jira/browse/SPARK-15474.

FWIW, I happened to see https://issues.apache.org/jira/browse/SPARK-15693 around that time and I kind of felt we may be able to consolidate this issue with it although it is a rough idea.

SparkQA · 2017-03-23T09:49:32Z

Test build #75089 has finished for PR 17395 at commit 86a7d2f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-23T14:35:16Z

Test build #75094 has finished for PR 17395 at commit 42da5af.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

uncleGen · 2017-03-24T02:13:29Z

Let me change this pr into WIP based on the discussion with @HyukjinKwon

HyukjinKwon · 2017-03-24T02:16:07Z

Thank you for taking my opinion into account.

SparkQA · 2017-04-29T00:38:27Z

Test build #76286 has finished for PR 17395 at commit 42da5af.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-05-07T14:59:40Z

Hi @uncleGen, how is it going?

uncleGen · 2017-05-08T05:49:03Z

@HyukjinKwon Sorry for the long absence. I will keep online for next period of time. Please give me some time.

HyukjinKwon · 2017-05-08T05:59:56Z

Yea, I just pinged because I am just interested in this :).

HyukjinKwon · 2017-06-19T04:46:21Z

hmmm. @uncleGen, shell we close this for now? reopening when it's ready would welcome.

avoid to output empty parquet files

86a7d2f

HyukjinKwon reviewed Mar 23, 2017

View reviewed changes

uncleGen added 2 commits March 23, 2017 19:03

Merge branch 'master' into SPARK-20065

599218b

fix the unit test failure

42da5af

uncleGen force-pushed the SPARK-20065 branch from 475eec8 to 42da5af Compare March 23, 2017 12:30

uncleGen changed the title ~~[SPARK-20065][SS] Avoid to output empty parquet files~~ [SPARK-20065][SS][WIP] Avoid to output empty parquet files Mar 24, 2017

uncleGen closed this Jun 20, 2017

HyukjinKwon mentioned this pull request Jul 17, 2017

[SPARK-21435][SQL] Empty files should be skipped while write to file #18650

Closed

xuanyuanking mentioned this pull request Jul 17, 2017

[SPARK-21435][SQL] Empty files should be skipped while write to file #18654

Closed

[SPARK-20065][SS][WIP] Avoid to output empty parquet files #17395

[SPARK-20065][SS][WIP] Avoid to output empty parquet files #17395

Uh oh!

Conversation

uncleGen commented Mar 23, 2017

Problem Description

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon Mar 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

uncleGen Mar 23, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Mar 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Mar 23, 2017

Choose a reason for hiding this comment

Uh oh!

uncleGen Mar 24, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Mar 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 23, 2017

Uh oh!

SparkQA commented Mar 23, 2017

Uh oh!

uncleGen commented Mar 24, 2017

Uh oh!

HyukjinKwon commented Mar 24, 2017

Uh oh!

SparkQA commented Apr 29, 2017

Uh oh!

HyukjinKwon commented May 7, 2017

Uh oh!

uncleGen commented May 8, 2017

Uh oh!

HyukjinKwon commented May 8, 2017

Uh oh!

HyukjinKwon commented Jun 19, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HyukjinKwon Mar 23, 2017 •

edited

Loading

HyukjinKwon Mar 23, 2017 •

edited

Loading

HyukjinKwon Mar 24, 2017 •

edited

Loading