-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-10250][CORE] External group by to handle huge keys #8438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
(Not as high priority as Spark 1.5.0 things.) |
|
Test build #41571 has finished for PR 8438 at commit
|
|
@JoshRosen can someone review? I'll fix the merge conflicts though |
|
Test build #42612 has finished for PR 8438 at commit
|
|
I'm also not sure what MiMa tests are. Can someone give me a bit of context so I can fix the build? |
|
Test build #42618 has finished for PR 8438 at commit
|
|
Hi @mccheah , on the mima tests -- those check for binary compatabilities across spark versions. If you click through to the link for the test build, you'll see this in the test output: you can see its complaining about the classes like I think this is a case of false positives, since those classes are |
|
Hey @mccheah, This is a really cool patch. Although we're trying to encourage users to migrate workloads to the DataFrames API, there are still many workloads for which this is a useful improvement. This patch has some high inherent complexity and risk, though, since the details of managing file cleanup are non-trivial and it needs to touch a number of stable code paths which haven't been modified in a long time. I'd like to review this patch but I'm a bit too busy with other 1.6 development and design tasks to devote adequate review time right now. I may have some time to review this in a couple of weeks, though. If you'd like to try to make this easier to review, I have a couple of suggestions (perhaps off-base, since I haven't considered them in detail yet):
|
|
The feature flag definitely makes sense - I'll include that. I think the feature is a little risky to use without cleanup at all, so I'd feel more comfortable merging them both at once. |
|
@mccheah, to clarify: I was suggesting merging the changes to ContextCleaner ahead of this change; I agree that the cleanup changes are a blocker to merging this. |
a6d5896 to
d01cd4d
Compare
|
@JoshRosen sorry I lagged a bit on this. I refactored the commit structure to clean things up, but I think we want ContextCleaner interface refactors to be its own JIRA and PR, as you suggested. The thing is, I don't think dependent pull requests on the same branch are really a thing in GitHub, so to speak. (This is different on Github as opposed to something like Gerrit, for instance). |
|
Ok, we need to merge #8981 before moving forward with this one. |
|
Oh also spilling is now behind a feature flag. |
|
Test build #43239 has finished for PR 8438 at commit
|
d01cd4d to
d87b554
Compare
|
Test build #43333 has finished for PR 8438 at commit
|
…tCleaner extends Preparing the way for SPARK-10250 which will introduce other objects that will be cleaned via detecting their being cleaned up.
d87b554 to
860811d
Compare
|
Test build #43400 timed out for PR 8438 at commit |
|
Jenkins, retest this please |
|
I get the feeling that the spilling tests I wrote are really really slow, so I may need to adjust them. |
|
Test build #43422 timed out for PR 8438 at commit |
|
@mccheah thanks for submitting the patch. Unfortunately we haven't had the bandwidth to review a feature as large as this one, and it affects only Given that this patch has been idle for more than 2 months I would recommend that we close this issue. |
|
Sorry, yeah I haven't had the bandwidth to look at this further. I agree that groupByKey should be avoided anyways so it might not be worthwhile to pursue this further. |
This takes the Python implementation of group-by-key and brings Scala up to parity, making Scala's group-by-key fault-tolerant to a single large group. It does so via using an ExternalList data structure to combine the groups, where ExternalList can spill if it gets too large.
First, the performance testing in cases where a single group would be too big to fit in memory:
I tried a few versions of this. In my first implementation, I wrote the ExternalList class, and simply replaced the CompactBuffer usage in PairRDDFunctions.groupByKey with external lists. I then ran the ExternalListSuite that did the group by key operation with some parameters modified. With 7200000 unique integers, bucketed into 5 buckets, the trials yielded:
89.333 ms
89.992 ms
104.026 ms
I then switched to the current implementation. It matches exactly what Python does with an ExternalSorter. So in essence, this is a sort-based group by. It wasn't clear to me why sort-based group by is better... until I saw the numbers, for the same RDD and buckets:
49632 ms
53615 ms
54340 ms
Therefore I went with the current implementation. It's not immediately clear to me however how the Python implementation - and this implementation for that matter - gets the specific speedup in using ExternalSorter. It would be great to get some feedback around why that is the case.
Some caveats / things to note that I am concerned about: