-
Notifications
You must be signed in to change notification settings - Fork 28.9k
Fix JIRA-983 and support exteranl sort for sortByKey #931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15322/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like your IDE changed the style of the comments here. Please leave them as they were originally. Our style in Spark is not the default Scala one, it's this:
/**
* aaa
* bbb
*/
|
Also FYI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be ): ExternalAppendOnlyMap
|
Hey @xiajunluan, this is a good start, but I made some comments throughout. There are a few other question though:
|
|
Hi @mateiz
|
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15445/ |
|
Looks like Jenkins is complaining about a line longer than 100 characters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minKeyHash is no longer used (github won't let me comment a few lines above)
|
Build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16364/ |
|
Jenkins, retest this please. |
|
Build triggered. |
|
Build started. |
|
Build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16400/ |
|
Jenkins, test this please. |
1 similar comment
|
Jenkins, test this please. |
|
@xiajunluan would you have time to update this in the next few days? It's pretty close but there were those two small issues Andrew pointed out as well as a compile error. This would be great to get into 1.1. |
|
Jenkins, test this please. @xiajunluan actually I think the main issue now is that this isn't merging cleanly. |
|
@pwendell @xiajunluan I think I'm going to send a new PR based on this because I want to use some of the changes to ExternalAppendOnlyMap in sort-based shuffle. I also noticed an issue in this one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems unnecessary to have a combiner here: if there are multiple key-value pairs with the same key, this requires them to all fit in memory. Instead we should have an option for the ExternalAppendOnlyMap to not attempt to combine them. I'll work on this in my PR.
(Squashed version of Andrew Xia's pull request apache#931) Conflicts: core/src/main/scala/org/apache/spark/rdd/OrderedRDDFunctions.scala core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala
|
Hey, so I rebased this PR and made it mergeable in my own branch, https://github.com/mateiz/spark/tree/spark-931. However, in doing this I realized that there might be some problems here that are fundamental. The main issue is that AppendOnlyMap, and ExternalAppendOnlyMap, require there's only one value for each key. The in-memory AOM will be very inefficient otherwise, and the EAOM depends on it. This means that for sort, we have to create (Key, ArrayBuffer[Value]) pairs, which will consume more memory by default than our in-memory sort, and will make us crash if there are too many identical values (something we do today but which may happen sooner here). Thus it seems that long-term we need a very different solution here, basically an external merge-sort. A second, possibly less serious issue is that the changes to EAOM to use comparator.compare instead of hash code equality make it less efficient in the default hashing-based case, because instead of saving one key's hash code in an Int and reusing it to compare with other keys in various places, we always recompute it when we compare each pair of elements. For these reasons I'd actually hold off on merging this (even my merged version) until we implement an external merge-sort as part of sort-based shuffle. Then we can use that data structure here. |
|
QA tests have started for PR 931. This patch DID NOT merge cleanly! |
|
QA results for PR 931: |
|
@xiajunluan we can now do this using the ExternalSorter added in #1499: see the new PR at #1677. Would you mind closing this old one? The new PR avoids some of the problems I mentioned above with each key having too many values. |
This patch simply uses the ExternalSorter class from sort-based shuffle. Closes apache#931 and Closes apache#1090 Author: Matei Zaharia <[email protected]> Closes apache#1677 from mateiz/spark-983 and squashes the following commits: 96b3fda [Matei Zaharia] SPARK-983. Support external sorting in sortByKey()
Change class ExternalAppendOnlyMap and make it support customized comparator function(not only sorted by hashCode).