Fix JIRA-983 and support exteranl sort for sortByKey #931

xiajunluan · 2014-05-31T14:16:35Z

Change class ExternalAppendOnlyMap and make it support customized comparator function(not only sorted by hashCode).

AmplabJenkins · 2014-05-31T14:17:59Z

Merged build triggered.

AmplabJenkins · 2014-05-31T14:18:07Z

Merged build started.

AmplabJenkins · 2014-05-31T14:19:34Z

Merged build finished.

AmplabJenkins · 2014-05-31T14:19:34Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15322/

mateiz · 2014-06-01T20:55:22Z

core/src/main/scala/org/apache/spark/rdd/OrderedRDDFunctions.scala

It looks like your IDE changed the style of the comments here. Please leave them as they were originally. Our style in Spark is not the default Scala one, it's this:

/** * aaa * bbb */

mateiz · 2014-06-02T00:13:58Z

Also FYI sbt scalastyle is complaining about some issues: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15322/console

mateiz · 2014-06-03T00:48:43Z

core/src/main/scala/org/apache/spark/rdd/OrderedRDDFunctions.scala

Should be ): ExternalAppendOnlyMap

mateiz · 2014-06-03T01:09:42Z

Hey @xiajunluan, this is a good start, but I made some comments throughout. There are a few other question though:

Performance: have you benchmarked this against the old version for non-sorting use cases? We need to make sure the pluggable Comparator doesn't break stuff
Long-term it would be good to spill values even within a key for sort, i.e. don't have ArrayBuffer as a combiner, just put in many values. But this probably can't be done easily in this patch.

xiajunluan · 2014-06-04T11:20:59Z

Hi @mateiz

I will measure the performance influence after I add the pluggable comparator.
I agree with you. if we just implement sortByKey, we should not use combiner(it is for combineByKey related API), it will need firstly aggregate values and after sorting, unfold values for same key. In this patch, I would like to reuse current class and fix this bug quickly. for long-term, I think we should write another similar AppendOnlyMap and ExternalAppendOnlyMap class for sortByKey, and ignore functions such as createCombiner, mergeValue, etc. I will try to design these class later.

AmplabJenkins · 2014-06-04T12:22:59Z

Merged build triggered.

AmplabJenkins · 2014-06-04T12:23:07Z

Merged build started.

AmplabJenkins · 2014-06-04T12:24:37Z

Merged build finished.

AmplabJenkins · 2014-06-04T12:24:37Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15445/

mateiz · 2014-06-06T01:15:55Z

Looks like Jenkins is complaining about a line longer than 100 characters

andrewor14 · 2014-06-17T18:36:30Z

core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala

minKeyHash is no longer used (github won't let me comment a few lines above)

AmplabJenkins · 2014-07-07T08:32:24Z

Build finished.

AmplabJenkins · 2014-07-07T08:32:25Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16364/

pwendell · 2014-07-08T07:59:54Z

Jenkins, retest this please.

AmplabJenkins · 2014-07-08T08:01:06Z

Build triggered.

AmplabJenkins · 2014-07-08T08:01:16Z

Build started.

AmplabJenkins · 2014-07-08T10:02:37Z

Build finished.

AmplabJenkins · 2014-07-08T10:02:37Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16400/

pwendell · 2014-07-11T07:56:01Z

Jenkins, test this please.

pwendell · 2014-07-12T07:23:03Z

Jenkins, test this please.

mateiz · 2014-07-15T02:22:50Z

@xiajunluan would you have time to update this in the next few days? It's pretty close but there were those two small issues Andrew pointed out as well as a compile error. This would be great to get into 1.1.

pwendell · 2014-07-15T06:12:12Z

Jenkins, test this please. @xiajunluan actually I think the main issue now is that this isn't merging cleanly.

mateiz · 2014-07-17T02:11:24Z

@pwendell @xiajunluan I think I'm going to send a new PR based on this because I want to use some of the changes to ExternalAppendOnlyMap in sort-based shuffle. I also noticed an issue in this one.

mateiz · 2014-07-17T02:13:28Z

core/src/main/scala/org/apache/spark/rdd/OrderedRDDFunctions.scala

It seems unnecessary to have a combiner here: if there are multiple key-value pairs with the same key, this requires them to all fit in memory. Instead we should have an option for the ExternalAppendOnlyMap to not attempt to combine them. I'll work on this in my PR.

(Squashed version of Andrew Xia's pull request apache#931) Conflicts: core/src/main/scala/org/apache/spark/rdd/OrderedRDDFunctions.scala core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala

mateiz · 2014-07-17T03:24:02Z

Hey, so I rebased this PR and made it mergeable in my own branch, https://github.com/mateiz/spark/tree/spark-931. However, in doing this I realized that there might be some problems here that are fundamental.

The main issue is that AppendOnlyMap, and ExternalAppendOnlyMap, require there's only one value for each key. The in-memory AOM will be very inefficient otherwise, and the EAOM depends on it. This means that for sort, we have to create (Key, ArrayBuffer[Value]) pairs, which will consume more memory by default than our in-memory sort, and will make us crash if there are too many identical values (something we do today but which may happen sooner here). Thus it seems that long-term we need a very different solution here, basically an external merge-sort.

A second, possibly less serious issue is that the changes to EAOM to use comparator.compare instead of hash code equality make it less efficient in the default hashing-based case, because instead of saving one key's hash code in an Int and reusing it to compare with other keys in various places, we always recompute it when we compare each pair of elements.

For these reasons I'd actually hold off on merging this (even my merged version) until we implement an external merge-sort as part of sort-based shuffle. Then we can use that data structure here.

SparkQA · 2014-07-17T05:33:03Z

QA tests have started for PR 931. This patch DID NOT merge cleanly!
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16767/consoleFull

SparkQA · 2014-07-17T05:33:42Z

QA results for PR 931:
- This patch FAILED unit tests.

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16767/consoleFull

mateiz · 2014-07-31T01:46:45Z

@xiajunluan we can now do this using the ExternalSorter added in #1499: see the new PR at #1677. Would you mind closing this old one? The new PR avoids some of the problems I mentioned above with each key having too many values.

This patch simply uses the ExternalSorter class from sort-based shuffle. Closes apache#931 and Closes apache#1090 Author: Matei Zaharia <[email protected]> Closes apache#1677 from mateiz/spark-983 and squashes the following commits: 96b3fda [Matei Zaharia] SPARK-983. Support external sorting in sortByKey()

mateiz reviewed Jun 1, 2014
View reviewed changes

mateiz reviewed Jun 3, 2014
View reviewed changes

core/src/main/scala/org/apache/spark/rdd/OrderedRDDFunctions.scala Outdated

Copy link

Contributor

mateiz Jun 3, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be ): ExternalAppendOnlyMap

xiajunluan closed this Jun 4, 2014

xiajunluan reopened this Jun 4, 2014

jerryshao mentioned this pull request Jun 17, 2014

[SPARK-983] Support external sorting #1090

Closed

andrewor14 reviewed Jun 17, 2014
View reviewed changes

xiajunluan added 3 commits July 16, 2014 09:03

Fix JIRA-983 and support exteranl sort for sortByKey

b5560d3

fix unit test failure

eb8b70c

fix scala style errors.

7f26d13

mateiz reviewed Jul 17, 2014
View reviewed changes

fix rebase conflict

d8fd1c4

pwendell mentioned this pull request Jul 31, 2014

SPARK-983. Support external sorting in sortByKey() #1677

Closed

asfgit closed this in 72e3369 Aug 1, 2014

Fix JIRA-983 and support exteranl sort for sortByKey #931

Fix JIRA-983 and support exteranl sort for sortByKey #931

Uh oh!

Conversation

xiajunluan commented May 31, 2014

Uh oh!

AmplabJenkins commented May 31, 2014

Uh oh!

AmplabJenkins commented May 31, 2014

Uh oh!

AmplabJenkins commented May 31, 2014

Uh oh!

AmplabJenkins commented May 31, 2014

Uh oh!

mateiz Jun 1, 2014

Choose a reason for hiding this comment

Uh oh!

mateiz commented Jun 2, 2014

Uh oh!

mateiz Jun 3, 2014

Choose a reason for hiding this comment

Uh oh!

mateiz commented Jun 3, 2014

Uh oh!

xiajunluan commented Jun 4, 2014

Uh oh!

AmplabJenkins commented Jun 4, 2014

Uh oh!

AmplabJenkins commented Jun 4, 2014

Uh oh!

AmplabJenkins commented Jun 4, 2014

Uh oh!

AmplabJenkins commented Jun 4, 2014

Uh oh!

mateiz commented Jun 6, 2014

Uh oh!

andrewor14 Jun 17, 2014

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Jul 7, 2014

Uh oh!

AmplabJenkins commented Jul 7, 2014

Uh oh!

pwendell commented Jul 8, 2014

Uh oh!

AmplabJenkins commented Jul 8, 2014

Uh oh!

AmplabJenkins commented Jul 8, 2014

Uh oh!

AmplabJenkins commented Jul 8, 2014

Uh oh!

AmplabJenkins commented Jul 8, 2014

Uh oh!

pwendell commented Jul 11, 2014

Uh oh!

pwendell commented Jul 12, 2014

Uh oh!

mateiz commented Jul 15, 2014

Uh oh!

pwendell commented Jul 15, 2014

Uh oh!

mateiz commented Jul 17, 2014

Uh oh!

mateiz Jul 17, 2014

Choose a reason for hiding this comment

Uh oh!

mateiz commented Jul 17, 2014

Uh oh!

SparkQA commented Jul 17, 2014

Uh oh!

SparkQA commented Jul 17, 2014

Uh oh!

mateiz commented Jul 31, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development