Skip to content

Conversation

@sirpkt
Copy link
Contributor

@sirpkt sirpkt commented Aug 25, 2015

I added check routine at saveAsHiveFile() of InsertIntoHiveTable.
It checks whether the given partition has data in it or not
and creates/writes file only when it actually has data.

@andrewor14
Copy link
Contributor

@yhuai @liancheng

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@yhuai
Copy link
Contributor

yhuai commented Nov 20, 2015

@sirpkt Will you have time to bring it up to date? Also, can we add the same logic in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala (for data source table)?

@asfgit asfgit closed this in 8d05a7a May 17, 2016
asfgit pushed a commit that referenced this pull request May 17, 2016
… group by query

## What changes were proposed in this pull request?

Currently, `INSERT INTO` with `GROUP BY` query tries to make at least 200 files (default value of `spark.sql.shuffle.partition`), which results in lots of empty files.

This PR makes it avoid creating empty files during overwriting into Hive table and in internal data sources  with group by query.

This checks whether the given partition has data in it or not and creates/writes file only when it actually has data.

## How was this patch tested?

Unittests in `InsertIntoHiveTableSuite` and `HadoopFsRelationTest`.

Closes #8411

Author: hyukjinkwon <[email protected]>
Author: Keuntae Park <[email protected]>

Closes #12855 from HyukjinKwon/pr/8411.

(cherry picked from commit 8d05a7a)
Signed-off-by: Michael Armbrust <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants