Skip to content

Conversation

@msiddalingaiah
Copy link
Contributor

This satisfies SPARK-983 Support external sorting for RDD#sortByKey()
It also adds a general sortPartitions() method to RDD.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@aarondav
Copy link
Contributor

Jenkins, ok to test

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15798/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15799/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use the same parameter for sorting, it might make sense to call this something else, since this isn't exactly a shuffle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that all DiskBuffers share the same values for these variables. It might make sense to declare them once in the parent class rather than for each DiskBuffer

@andrewor14
Copy link
Contributor

Hi @msiddalingaiah, thanks for adding this much-needed functionality. I haven't looked too closely into the details, but it seems that this shares much of the logic (and code, in some cases) with the ExternalAppendOnlyMap. It would be good if this is integrated somehow with the map.

One possibility is to just use the ExternalAppendOnlyMap as your underlying buffer, and use the array index as the key and the actual value as the combiner. I haven't explored fully myself whether this is possible with the current code, but it would be super cool if it works out. If there is no easy way to do this, we should at least abstract out the common logic as helper methods in Utils.scala or something.

@msiddalingaiah
Copy link
Contributor Author

Thanks. It's not clear to me if it can use ExternalAppendInlyMap as the underlying buffer either. There was some discussion about how to handle memory management in Jira, There was no concensus at the time, so there was duplication.

Some code can be factored into a common class, I chose not to change too much at once.

I'm tied up in the near future. When does this have to be resolved?

@jerryshao
Copy link
Contributor

This is the PR which uses ExternalAppendOnlyMap to do external sort. I think it would be nice to use ExternalAppendOnlyMap as you guys mentioned before.

@msiddalingaiah
Copy link
Contributor Author

@jerryshao @andrewor14 @xiajunluan
I'm confused. Does the PR mentioned above also address SPARK-983?
SPARK-983 was assigned to me some time ago.

Please advise.

@pwendell
Copy link
Contributor

Hey @msiddalingaiah - in some cases more than one person will submit solutions for a patch. The assignments on JIRA are just tentative, we don't consider them an exclusive reservation. In this case @aarondav assigned this task to you on May 27th. Then someone else submitted a fix on May 31st. Your patch showed up on June 14th.

When this happens we just try to take the best patch out there and in this case #931 is in better shape than this patch. However, we're happy to give both people credit for working on the feature when we make the Spark credits if there is overlap.

Thanks for your time working on this!

@asfgit asfgit closed this in 72e3369 Aug 1, 2014
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
This patch simply uses the ExternalSorter class from sort-based shuffle.

Closes apache#931 and Closes apache#1090

Author: Matei Zaharia <[email protected]>

Closes apache#1677 from mateiz/spark-983 and squashes the following commits:

96b3fda [Matei Zaharia] SPARK-983. Support external sorting in sortByKey()
wangyum pushed a commit that referenced this pull request May 26, 2023
* [CARMEL-6299] Expose stage/task retry count to Carmel Overview
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants