Skip to content

Conversation

@xiajunluan
Copy link
Contributor

Change class ExternalAppendOnlyMap and make it support customized comparator function(not only sorted by hashCode).

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15322/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like your IDE changed the style of the comments here. Please leave them as they were originally. Our style in Spark is not the default Scala one, it's this:

/**
 * aaa
 * bbb
 */

@mateiz
Copy link
Contributor

mateiz commented Jun 2, 2014

Also FYI sbt scalastyle is complaining about some issues: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15322/console

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be ): ExternalAppendOnlyMap

@mateiz
Copy link
Contributor

mateiz commented Jun 3, 2014

Hey @xiajunluan, this is a good start, but I made some comments throughout. There are a few other question though:

  • Performance: have you benchmarked this against the old version for non-sorting use cases? We need to make sure the pluggable Comparator doesn't break stuff
  • Long-term it would be good to spill values even within a key for sort, i.e. don't have ArrayBuffer as a combiner, just put in many values. But this probably can't be done easily in this patch.

@xiajunluan xiajunluan closed this Jun 4, 2014
@xiajunluan xiajunluan reopened this Jun 4, 2014
@xiajunluan
Copy link
Contributor Author

Hi @mateiz

  1. I will measure the performance influence after I add the pluggable comparator.
  2. I agree with you. if we just implement sortByKey, we should not use combiner(it is for combineByKey related API), it will need firstly aggregate values and after sorting, unfold values for same key. In this patch, I would like to reuse current class and fix this bug quickly. for long-term, I think we should write another similar AppendOnlyMap and ExternalAppendOnlyMap class for sortByKey, and ignore functions such as createCombiner, mergeValue, etc. I will try to design these class later.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15445/

@mateiz
Copy link
Contributor

mateiz commented Jun 6, 2014

Looks like Jenkins is complaining about a line longer than 100 characters

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minKeyHash is no longer used (github won't let me comment a few lines above)

@AmplabJenkins
Copy link

Build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16364/

@pwendell
Copy link
Contributor

pwendell commented Jul 8, 2014

Jenkins, retest this please.

@AmplabJenkins
Copy link

Build triggered.

@AmplabJenkins
Copy link

Build started.

@AmplabJenkins
Copy link

Build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16400/

@pwendell
Copy link
Contributor

Jenkins, test this please.

1 similar comment
@pwendell
Copy link
Contributor

Jenkins, test this please.

@mateiz
Copy link
Contributor

mateiz commented Jul 15, 2014

@xiajunluan would you have time to update this in the next few days? It's pretty close but there were those two small issues Andrew pointed out as well as a compile error. This would be great to get into 1.1.

@pwendell
Copy link
Contributor

Jenkins, test this please. @xiajunluan actually I think the main issue now is that this isn't merging cleanly.

@mateiz
Copy link
Contributor

mateiz commented Jul 17, 2014

@pwendell @xiajunluan I think I'm going to send a new PR based on this because I want to use some of the changes to ExternalAppendOnlyMap in sort-based shuffle. I also noticed an issue in this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems unnecessary to have a combiner here: if there are multiple key-value pairs with the same key, this requires them to all fit in memory. Instead we should have an option for the ExternalAppendOnlyMap to not attempt to combine them. I'll work on this in my PR.

mateiz pushed a commit to mateiz/spark that referenced this pull request Jul 17, 2014
(Squashed version of Andrew Xia's pull request apache#931)

Conflicts:
	core/src/main/scala/org/apache/spark/rdd/OrderedRDDFunctions.scala
	core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala
@mateiz
Copy link
Contributor

mateiz commented Jul 17, 2014

Hey, so I rebased this PR and made it mergeable in my own branch, https://github.com/mateiz/spark/tree/spark-931. However, in doing this I realized that there might be some problems here that are fundamental.

The main issue is that AppendOnlyMap, and ExternalAppendOnlyMap, require there's only one value for each key. The in-memory AOM will be very inefficient otherwise, and the EAOM depends on it. This means that for sort, we have to create (Key, ArrayBuffer[Value]) pairs, which will consume more memory by default than our in-memory sort, and will make us crash if there are too many identical values (something we do today but which may happen sooner here). Thus it seems that long-term we need a very different solution here, basically an external merge-sort.

A second, possibly less serious issue is that the changes to EAOM to use comparator.compare instead of hash code equality make it less efficient in the default hashing-based case, because instead of saving one key's hash code in an Int and reusing it to compare with other keys in various places, we always recompute it when we compare each pair of elements.

For these reasons I'd actually hold off on merging this (even my merged version) until we implement an external merge-sort as part of sort-based shuffle. Then we can use that data structure here.

@SparkQA
Copy link

SparkQA commented Jul 17, 2014

QA tests have started for PR 931. This patch DID NOT merge cleanly!
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16767/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 17, 2014

QA results for PR 931:
- This patch FAILED unit tests.

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16767/consoleFull

@mateiz
Copy link
Contributor

mateiz commented Jul 31, 2014

@xiajunluan we can now do this using the ExternalSorter added in #1499: see the new PR at #1677. Would you mind closing this old one? The new PR avoids some of the problems I mentioned above with each key having too many values.

@asfgit asfgit closed this in 72e3369 Aug 1, 2014
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
This patch simply uses the ExternalSorter class from sort-based shuffle.

Closes apache#931 and Closes apache#1090

Author: Matei Zaharia <[email protected]>

Closes apache#1677 from mateiz/spark-983 and squashes the following commits:

96b3fda [Matei Zaharia] SPARK-983. Support external sorting in sortByKey()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants