-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-8890][SQL][WIP] Reduce memory consumption for dynamic partition insert #7514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #37775 has finished for PR 7514 at commit
|
|
Test build #37776 has finished for PR 7514 at commit
|
|
Test build #37777 has finished for PR 7514 at commit
|
|
Test build #37780 has finished for PR 7514 at commit
|
|
@rxin Where would be the best place to add a test for this functionality? |
|
Test build #37839 has finished for PR 7514 at commit
|
|
Test build #38260 has finished for PR 7514 at commit
|
|
@rxin @davies @JoshRosen Hey all, could I please get a review of these updates? I'd love to get this fix in. |
|
Test build #39237 has finished for PR 7514 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is expensive, we should avoid that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a preferred way to do this? I could have the HashSet be created once to avoid creating it every time and clear it between calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we make sure that only visit the items once, then the rows will not be outputted twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is that after a sort, everything is reorganized so we may end up traversing some elements that have already been processed, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the iterator can only be consumed once, so we only sort the items that have not been visited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, so just use an `ExternalSorter`` based off that iterator to do the sort to avoid potential memory problems.
|
Test build #39239 has finished for PR 7514 at commit
|
|
Hey, thanks for working on this! Since we are really close to the first 1.5 RC I went ahead and tried an alternative solution based on our external sorter. #8010 Comments welcome :) |
This patch will do the following based on discussion here: https://issues.apache.org/jira/browse/SPARK-8890: