-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-983] Support external sorting #1090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
Jenkins, ok to test |
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15798/ |
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. All automated tests passed. |
|
All automated tests passed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use the same parameter for sorting, it might make sense to call this something else, since this isn't exactly a shuffle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that all DiskBuffers share the same values for these variables. It might make sense to declare them once in the parent class rather than for each DiskBuffer
|
Hi @msiddalingaiah, thanks for adding this much-needed functionality. I haven't looked too closely into the details, but it seems that this shares much of the logic (and code, in some cases) with the ExternalAppendOnlyMap. It would be good if this is integrated somehow with the map. One possibility is to just use the ExternalAppendOnlyMap as your underlying buffer, and use the array index as the key and the actual value as the combiner. I haven't explored fully myself whether this is possible with the current code, but it would be super cool if it works out. If there is no easy way to do this, we should at least abstract out the common logic as helper methods in Utils.scala or something. |
|
Thanks. It's not clear to me if it can use ExternalAppendInlyMap as the underlying buffer either. There was some discussion about how to handle memory management in Jira, There was no concensus at the time, so there was duplication. Some code can be factored into a common class, I chose not to change too much at once. I'm tied up in the near future. When does this have to be resolved? |
|
This is the PR which uses |
|
@jerryshao @andrewor14 @xiajunluan Please advise. |
|
Hey @msiddalingaiah - in some cases more than one person will submit solutions for a patch. The assignments on JIRA are just tentative, we don't consider them an exclusive reservation. In this case @aarondav assigned this task to you on May 27th. Then someone else submitted a fix on May 31st. Your patch showed up on June 14th. When this happens we just try to take the best patch out there and in this case #931 is in better shape than this patch. However, we're happy to give both people credit for working on the feature when we make the Spark credits if there is overlap. Thanks for your time working on this! |
This patch simply uses the ExternalSorter class from sort-based shuffle. Closes apache#931 and Closes apache#1090 Author: Matei Zaharia <[email protected]> Closes apache#1677 from mateiz/spark-983 and squashes the following commits: 96b3fda [Matei Zaharia] SPARK-983. Support external sorting in sortByKey()
* [CARMEL-6299] Expose stage/task retry count to Carmel Overview
This satisfies SPARK-983 Support external sorting for RDD#sortByKey()
It also adds a general sortPartitions() method to RDD.