[SPARK-10216][SQL]Avoid creating empty files during overwrite into Hive table with group by query #8411

sirpkt · 2015-08-25T04:14:26Z

I added check routine at saveAsHiveFile() of InsertIntoHiveTable.
It checks whether the given partition has data in it or not
and creates/writes file only when it actually has data.

andrewor14 · 2015-09-01T20:38:04Z

@yhuai @liancheng

AmplabJenkins · 2015-11-16T20:52:14Z

Can one of the admins verify this patch?

yhuai · 2015-11-20T23:14:06Z

@sirpkt Will you have time to bring it up to date? Also, can we add the same logic in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala (for data source table)?

… group by query ## What changes were proposed in this pull request? Currently, `INSERT INTO` with `GROUP BY` query tries to make at least 200 files (default value of `spark.sql.shuffle.partition`), which results in lots of empty files. This PR makes it avoid creating empty files during overwriting into Hive table and in internal data sources with group by query. This checks whether the given partition has data in it or not and creates/writes file only when it actually has data. ## How was this patch tested? Unittests in `InsertIntoHiveTableSuite` and `HadoopFsRelationTest`. Closes #8411 Author: hyukjinkwon <[email protected]> Author: Keuntae Park <[email protected]> Closes #12855 from HyukjinKwon/pr/8411. (cherry picked from commit 8d05a7a) Signed-off-by: Michael Armbrust <[email protected]>

sirpkt added 4 commits August 25, 2015 10:21

do not make empty file when insert overwrite into Hive table

8018455

Merge branch 'master' into NoEmptyInsert

46f085a

Merge remote-tracking branch 'upstream/master' into SPARK-10216

689252a

change test name to reflect issue number and name

e2749d7

HyukjinKwon mentioned this pull request May 3, 2016

[SPARK-10216][SQL] Avoid creating empty files during overwriting with group by query #12855

Closed

asfgit closed this in 8d05a7a May 17, 2016

marmbrus mentioned this pull request May 18, 2016

Revert "[SPARK-10216][SQL] Avoid creating empty files during overwrit… #13181

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-10216][SQL]Avoid creating empty files during overwrite into Hive table with group by query #8411

[SPARK-10216][SQL]Avoid creating empty files during overwrite into Hive table with group by query #8411

Uh oh!

sirpkt commented Aug 25, 2015

Uh oh!

andrewor14 commented Sep 1, 2015

Uh oh!

AmplabJenkins commented Nov 16, 2015

Uh oh!

yhuai commented Nov 20, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-10216][SQL]Avoid creating empty files during overwrite into Hive table with group by query #8411

[SPARK-10216][SQL]Avoid creating empty files during overwrite into Hive table with group by query #8411

Uh oh!

Conversation

sirpkt commented Aug 25, 2015

Uh oh!

andrewor14 commented Sep 1, 2015

Uh oh!

AmplabJenkins commented Nov 16, 2015

Uh oh!

yhuai commented Nov 20, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants