-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-10216][SQL] Avoid creating empty files during overwriting with group by query #12855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I submitted this PR because #8411 looks abandoned and looks the author is not answering from the last comment by a committer. (It has been inactive almost half a year). |
|
@yhuai Could you please take a look? |
|
Test build #57579 has finished for PR 12855 at commit
|
|
Test build #57581 has finished for PR 12855 at commit
|
|
Test build #57587 has finished for PR 12855 at commit
|
|
Should we have the same logic for data sources? |
|
@rxin I thought so as well but I haven't tested yet. Could I look into that and make another PR if this one is merged maybe (would that be okay?)? |
|
Can you look at it together with this? Seems like a good logical grouping and arguably data sources are more important than the Hive ones. |
|
@rxin Sure, I will thanks. |
|
@rxin I could find the same issue in internal datasources. I just added the same logics and a test in |
| case t: Throwable => | ||
| throw new SparkException("Task failed while writing rows", t) | ||
| } | ||
| if (iterator.hasNext) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply added iterator.hasNext check.
|
Test build #57624 has finished for PR 12855 at commit
|
|
Test build #57625 has finished for PR 12855 at commit
|
|
cc @marmbrus |
|
Hi @marmbrus , Could you please take a look? |
| sql( | ||
| """ | ||
| |INSERT OVERWRITE TABLE table1 | ||
| |SELECT count(key), value FROM testDataset GROUP BY value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems you want to explicitly control the number of shuffle partitions? Otherwise, this test will not testing anything if the number of shuffle partitions is set to 2 by any chance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes. Thank you!
|
Test build #58662 has finished for PR 12855 at commit
|
|
Test build #58666 has finished for PR 12855 at commit
|
|
Thanks, merging to master and 2.0. |
… group by query ## What changes were proposed in this pull request? Currently, `INSERT INTO` with `GROUP BY` query tries to make at least 200 files (default value of `spark.sql.shuffle.partition`), which results in lots of empty files. This PR makes it avoid creating empty files during overwriting into Hive table and in internal data sources with group by query. This checks whether the given partition has data in it or not and creates/writes file only when it actually has data. ## How was this patch tested? Unittests in `InsertIntoHiveTableSuite` and `HadoopFsRelationTest`. Closes #8411 Author: hyukjinkwon <[email protected]> Author: Keuntae Park <[email protected]> Closes #12855 from HyukjinKwon/pr/8411. (cherry picked from commit 8d05a7a) Signed-off-by: Michael Armbrust <[email protected]>
|
This breaks writing empty dataframes for me. Before this PR I could write empty dataframes without any problems. Now it only writes a _SUCCESS file, and no metadata. Also, it sometimes throws a NullPointerException: Edit: Added JIRA: https://issues.apache.org/jira/browse/SPARK-15393 |
|
Thanks for reporting. |
|
I'm going to revert this until we figure out the issues @HyukjinKwon can you reopen? |
|
Sure, I thought you could reopen PRs you created, but if not feel free to create a new one and link. |
|
@marmbrus Sorry for letting you revert this, I should have thought of this further before opening this PR. I will try to think more and try more carefully. |
|
No worries! |
…rit… This reverts commit 8d05a7a from #12855, which seems to have caused regressions when working with empty DataFrames. Author: Michael Armbrust <[email protected]> Closes #13181 from marmbrus/revert12855. (cherry picked from commit 2ba3ff0) Signed-off-by: Michael Armbrust <[email protected]>
…rit… This reverts commit 8d05a7a from #12855, which seems to have caused regressions when working with empty DataFrames. Author: Michael Armbrust <[email protected]> Closes #13181 from marmbrus/revert12855.
|
I can reproduce the issue that @jurriaan reports on 1.6.0 and on 1.5.2. The issue does not occur on 1.3.1. I have added a comment to the JIRA issue with more detailed instructions how to reproduce: https://issues.apache.org/jira/browse/SPARK-15393 Note that this might mean that this PR did not cause the issue. |
|
@DanielMe Yes, actually it seems a different issue when you use This was reverted because the latter case fails. So.. it seems the former case has been being failed from older versions and the latter is not being failed after this one is reverted. |
|
The issue that @jurriaan reported is still there in Spark 2.1.0. |
What changes were proposed in this pull request?
Currently,
INSERT INTOwithGROUP BYquery tries to make at least 200 files (default value ofspark.sql.shuffle.partition), which results in lots of empty files.This PR makes it avoid creating empty files during overwriting into Hive table and in internal data sources with group by query.
This checks whether the given partition has data in it or not and creates/writes file only when it actually has data.
How was this patch tested?
Unittests in
InsertIntoHiveTableSuiteandHadoopFsRelationTest.Closes #8411