[SPARK-5750][SPARK-3441][SPARK-5836][CORE] Added documentation explaining shuffle #5074

ilganeli · 2015-03-17T19:46:58Z

I've updated the Spark Programming Guide to add a section on the shuffle operation providing some background on what it does. I've also addressed some of its performance impacts.

I've included documentation to address the following issues:
https://issues.apache.org/jira/browse/SPARK-5836
https://issues.apache.org/jira/browse/SPARK-3441
https://issues.apache.org/jira/browse/SPARK-5750

https://issues.apache.org/jira/browse/SPARK-4227 is related but can be addressed in a separate PR since it involves updates to the Spark Configuration Guide.

…he shuffle operation and included errata from a number of other JIRAs

SparkQA · 2015-03-17T19:48:15Z

Test build #28732 has started for PR 5074 at commit 75ef67b.

This patch merges cleanly.

SparkQA · 2015-03-17T21:05:30Z

Test build #28732 has finished for PR 5074 at commit 75ef67b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-17T21:05:34Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28732/
Test PASSed.

rxin · 2015-03-18T03:53:44Z

docs/programming-guide.md

I think a better word is "undefined", rather than "random".

SparkQA · 2015-03-18T13:28:12Z

Test build #28799 has started for PR 5074 at commit 159dd1c.

This patch merges cleanly.

SparkQA · 2015-03-18T14:44:29Z

Test build #28799 has finished for PR 5074 at commit 159dd1c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-18T14:44:33Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28799/
Test PASSed.

srowen · 2015-03-19T13:05:55Z

docs/programming-guide.md

I think a shuffle is not just about collecting data by key. For example a repartitioning can cause a shuffle. Personally I'd say that some operations need to redistribute data so that it is grouped differently into partitions, which typically means rearranging and copying data across executors or machines.

ilganeli · 2015-03-19T19:19:04Z

Thanks for the review, Sean - I've incorporated your recommendations and added a short discussion on sort-based shuffle.

SparkQA · 2015-03-19T19:23:11Z

Test build #28882 has started for PR 5074 at commit a8adb57.

This patch merges cleanly.

sryza · 2015-03-19T19:50:06Z

docs/programming-guide.md

I would replace uses of "operation" that don't refer to actions or transformations to avoid confusion.

sryza · 2015-03-19T19:53:38Z

The doc here should be broken up across multiple lines. The programming guide is kind of inconsistent on this, but I think the move is towards keeping lines under 100 characters.

sryza · 2015-03-19T19:57:30Z

docs/programming-guide.md

A few factual edits:

Spark always writes all shuffle data to disk on the map side.

A hash table is only used on the map side for reduceByKey and aggregateByKey, and only on the reduce side for the ByKey operations.

sortByKey no longer can OOM.

Also, I would avoid mentioning hash-based shuffle at all because sort-based shuffle is now pretty much what we expect everybody to use.

Fixed a few more small items.

ilganeli · 2015-03-25T21:03:02Z

Done.

SparkQA · 2015-03-25T21:08:21Z

Test build #29178 has started for PR 5074 at commit 85f9c6e.

This patch merges cleanly.

SparkQA · 2015-03-25T21:19:19Z

Test build #29170 has finished for PR 5074 at commit 349d1fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-25T21:19:23Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29170/
Test PASSed.

SparkQA · 2015-03-25T22:30:28Z

Test build #29178 has finished for PR 5074 at commit 85f9c6e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-25T22:30:33Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29178/
Test PASSed.

srowen · 2015-03-26T13:36:29Z

docs/programming-guide.md

To be concrete and finish this off, I'd propose keeping only the following text. @sryza @ilganeli

Although the set of elements in each partition of newly shuffled data will be deterministic, the ordering of these elements is not. If one desires predictably ordered data following shuffle operations, sortBy can be used to perform a global sort.

Sean, I'd suggest this text:
Although the set of elements in each partition of newly shuffled data will be deterministic, the
ordering of these elements is not. If one desires predictably ordered data following shuffle
operations, sortBy can be used to perform a global sort. A similar operation,
repartitionAndSortWithinPartitions coupled with mapPartitions,
may be used to enact a Hadoop style shuffle.

The reason is that this would address https://issues.apache.org/jira/browse/SPARK-3441 .

This doesn't sort the partitions though, right? and what needs a mapPartitions then?

I went and re-read the JIRA in question. I think Sandy was simply pointing out that the above could be used as a replacement for groupBy and that repartitionAndSortWithinPartitions functions as a Hadoop-style shuffle. I agree that it's not needed here.

Maybe you both are already on the same page about this already, but I think the relevant question being addressed might be "how do you control (or make deterministic) the ordering of elements within a key". Currently the only way to do this is with a repartitionAndSortWithinPartitions followed by a mapPartitions that groups the sorted elements that make up the partition. Talking about this could be TMI though.

But that does not give an RDD where partition n all sorts before n+1 right? I get that the question is more determinism but mentioning in the same breath as sortBy feels confusing

The line that sets it up is talking about ordering within a partition. sortBy is mentioned afterward and global sort is specifically called out. If you read the paragraph front to end does it seem unclear?

Fair point, yeah. This could be a dumb question, but do we know that the partitions will always appear in the same order? I suppose it depends on the partitioner, but, even for HashPartitioner -- do we know that, say, everything hashing to 0 occurs in partition 0 every time? If this is order of partitions is deterministic for any reasonable case, then I get it at last, yeah.

If that understanding is correct, then, here's another pass at the paragraph:

Although the set of elements in each partition of newly shuffled data will be deterministic, and so is the ordering of partitions themselves, the ordering of these elements within each partition is not. If one desires predictably ordered data following shuffle operations, then it's possible to use:

mapPartitions to sort each partition using, for example, .sorted

repartitionAndSortWithinPartitions to efficiently sort partitions while simultaneously repartitioning

sortBy to make a globally ordered RDD

Sean - I like this final wording, I've added this in the latest. Thanks.

@srowen that's exactly how HashPartitioner works. As long as the partition function isn't using System.currentTimeMillis or something the ordering of partitions is deterministic.

SparkQA · 2015-03-26T16:18:19Z

Test build #29238 has started for PR 5074 at commit 2c5df08.

This patch merges cleanly.

SparkQA · 2015-03-26T17:40:52Z

Test build #29238 has finished for PR 5074 at commit 2c5df08.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-26T17:40:56Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29238/
Test PASSed.

Removed extraneous reference to repartitionAndSort

SparkQA · 2015-03-26T19:58:18Z

Test build #29253 has started for PR 5074 at commit 7a0b96f.

This patch merges cleanly.

SparkQA · 2015-03-26T21:20:05Z

Test build #29253 has finished for PR 5074 at commit 7a0b96f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-26T21:20:09Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29253/
Test PASSed.

srowen · 2015-03-26T21:39:22Z

Looks good. I'll pause a little longer for final comments from @sryza.

Reworded discussion on sorting within partitions.

SparkQA · 2015-03-27T15:58:15Z

Test build #29309 has started for PR 5074 at commit 6178e24.

This patch merges cleanly.

JoshRosen · 2015-03-27T16:37:38Z

docs/programming-guide.md

Is this true for reduceByKey, though? It's certainly true for groupByKey, but reduceByKey and combineByKey will perform map-side combining so it's not strictly true that all values for a key must be co-located before computing the new reduced value.

I looked at the discussion up-thread and noticed that this originally referred to groupByKey() but was changed because we're trying to discourage users from using that operation. The previous sentence is clear, because it says "organize all the data for a single reduceByKey reduce task", which is true because the "data" here can refer to partially-combined results rather than input records. This current sentence was what tripped me up, since only groupByKey requires that "all values" for any key be co-located. But maybe I'm just overthinking this, since I suppose that "values" could also refer to partially-combined values.

I'm fine with this as long as it's not confusing to anyone else.

That's a reasonable point. I think it slightly outweighs the earlier concern about even mentioning groupByKey. @ilganeli would you mind making this final change? to, say, "a single groupByKey task"

My suggestion is not great. Leave the reduceByKey example. I think the final sentence can be "... and then bring together values across partitions to compute the final result for each key - ..." I can just tap this in since I am pretty certain everyone is happy at this point.

SparkQA · 2015-03-27T17:22:21Z

Test build #29309 has finished for PR 5074 at commit 6178e24.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-27T17:22:25Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29309/
Test PASSed.

srowen · 2015-03-29T11:30:15Z

Since this is looking good to me and @ilganeli has been patient, I'd like to just merge this and make that last edit manually today.

…ning shuffle I've updated the Spark Programming Guide to add a section on the shuffle operation providing some background on what it does. I've also addressed some of its performance impacts. I've included documentation to address the following issues: https://issues.apache.org/jira/browse/SPARK-5836 https://issues.apache.org/jira/browse/SPARK-3441 https://issues.apache.org/jira/browse/SPARK-5750 https://issues.apache.org/jira/browse/SPARK-4227 is related but can be addressed in a separate PR since it involves updates to the Spark Configuration Guide. Author: Ilya Ganelin <[email protected]> Author: Ilya Ganelin <[email protected]> Closes #5074 from ilganeli/SPARK-5750 and squashes the following commits: 6178e24 [Ilya Ganelin] Update programming-guide.md 7a0b96f [Ilya Ganelin] Update programming-guide.md 2c5df08 [Ilya Ganelin] Merge branch 'SPARK-5750' of github.com:ilganeli/spark into SPARK-5750 dffbd2d [Ilya Ganelin] [SPARK-5750] Slight wording update 1ff4eb4 [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-5750 85f9c6e [Ilya Ganelin] Update programming-guide.md 349d1fa [Ilya Ganelin] Added cross linkf or configuration page eeb5a7a [Ilya Ganelin] [SPARK-5750] Added some minor fixes dd5cc9d [Ilya Ganelin] [SPARK-5750] Fixed some factual inaccuracies with regards to shuffle internals. a8adb57 [Ilya Ganelin] [SPARK-5750] Incoporated feedback from Sean Owen 9954bbe [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-5750 159dd1c [Ilya Ganelin] [SPARK-5750] Style fixes from rxin. 75ef67b [Ilya Ganelin] [SPARK-5750][SPARK-3441][SPARK-5836] Added documentation explaining the shuffle operation and included errata from a number of other JIRAs (cherry picked from commit 4bdfb7b) Signed-off-by: Sean Owen <[email protected]>

tdas · 2015-06-18T00:43:19Z

docs/programming-guide.md

@ilganeli @srowen @JoshRosen Can you explain why was this added? Shuffle files are cleared automatically when the driver garbage collects the shuffle object which triggers messages to all the executors to delete all files related to the shuffle. This was added in Spark 1.0. Has there been any change in this behavior since then that justifies this statement?

If this statement is true, then this is a major major issue for long running jobs like Spark Streaming job. If this statement is not true, then this should not have been added and should be fixed promptly. Users have come to be personally asking me that they are not upgrading to Spark 1.3 because they this behavior is a regression.

+1. I had a couple of people ask me this during Spark Summit. I was investigating this myself today.

@tdas This bit is a response to https://issues.apache.org/jira/browse/SPARK-5836 You would likely know better, and if GC of shuffle-related objects should trigger cleanup of shuffle files, then phew, at least there is some mechanism for that. I know people occasionally ask about problems with too many shuffle files lying around, but that doesn't mean the mechanism doesn't work. I think the best short-term change is just to update this statement if you're pretty confident this mechanism works.

That clarifies things for me. I know that there has been some concern about
the shuffle files filling up disk, but that as of now can happen because
one or more of the following reasons.

GC does not kick in for a long time (very high driver memory). The
solution may often be periodically call GC.

Nothing goes out of scope and so nothing is GCed.

There are some issues reported with shuffle files not being cleaned up
in Mesos

The 3rd one is a bug and we will fix it. The first two should be clarified
in the docs. That is better than than this current very scary description.

On Wed, Jun 17, 2015 at 11:41 PM, Sean Owen [email protected]
wrote:

In docs/programming-guide.md
#5074 (comment):

+organize the data, and a set of reduce tasks to aggregate it. This nomenclature comes from
+MapReduce and does not directly relate to Spark's map and reduce operations.
+
+Internally, results from individual map tasks are kept in memory until they can't fit. Then, these
+are sorted based on the target partition and written to a single file. On the reduce side, tasks
+read the relevant sorted blocks.
+
+Certain shuffle operations can consume significant amounts of heap memory since they employ
+in-memory data structures to organize records before or after transferring them. Specifically,
+reduceByKey and aggregateByKey create these structures on the map side and 'ByKey operations
+generate these on the reduce side. When data does not fit in memory Spark will spill these tables
+to disk, incurring the additional overhead of disk I/O and increased garbage collection.
+
+Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files
+are not cleaned up from Spark's temporary storage until Spark is stopped, which means that
+long-running Spark jobs may consume available disk space. This is done so the shuffle doesn't need

@tdas https://github.com/tdas This bit is a response to
https://issues.apache.org/jira/browse/SPARK-5836 You would likely know
better, and if GC of shuffle-related objects should trigger cleanup of
shuffle files, then phew, at least there is some mechanism for that. I
know people occasionally ask about problems with too many shuffle files
lying around, but that doesn't mean the mechanism doesn't work. I think the
best short-term change is just to update this statement if you're pretty
confident this mechanism works.

—
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/5074/files#r32702504.

@tdas #6901 WDYT?

[SPARK-5750][SPARK-3441][SPARK-5836] Added documentation explaining t…

75ef67b

…he shuffle operation and included errata from a number of other JIRAs

ilganeli changed the title ~~[SPARK-5750][SPARK-3441][SPARK-5836] Added documentation explaining shuffle~~ [Docs][SPARK-5750][SPARK-3441][SPARK-5836] Added documentation explaining shuffle Mar 17, 2015

ilganeli changed the title ~~[Docs][SPARK-5750][SPARK-3441][SPARK-5836] Added documentation explaining shuffle~~ [SPARK-5750][SPARK-3441][SPARK-5836][CORE] Added documentation explaining shuffle Mar 17, 2015

rxin reviewed Mar 18, 2015
View reviewed changes

docs/programming-guide.md Outdated

Copy link

Contributor

rxin Mar 18, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better word is "undefined", rather than "random".

[SPARK-5750] Style fixes from rxin.

159dd1c

srowen reviewed Mar 19, 2015
View reviewed changes

Ilya Ganelin added 2 commits March 19, 2015 11:47

Merge remote-tracking branch 'upstream/master' into SPARK-5750

9954bbe

[SPARK-5750] Incoporated feedback from Sean Owen

a8adb57

sryza reviewed Mar 19, 2015
View reviewed changes

Update programming-guide.md

85f9c6e

Fixed a few more small items.

srowen reviewed Mar 26, 2015
View reviewed changes

Ilya Ganelin added 3 commits March 26, 2015 09:06

Merge remote-tracking branch 'upstream/master' into SPARK-5750

1ff4eb4

[SPARK-5750] Slight wording update

dffbd2d

Merge branch 'SPARK-5750' of github.com:ilganeli/spark into SPARK-5750

2c5df08

Update programming-guide.md

7a0b96f

Removed extraneous reference to repartitionAndSort

Update programming-guide.md

6178e24

Reworded discussion on sorting within partitions.

JoshRosen reviewed Mar 27, 2015
View reviewed changes

asfgit closed this in 4bdfb7b Mar 30, 2015

andrewor14 mentioned this pull request May 2, 2015

[SPARK-7007][core] Add a metric source for ExecutorAllocationManager #5589

Closed

tdas reviewed Jun 18, 2015
View reviewed changes

[SPARK-5750][SPARK-3441][SPARK-5836][CORE] Added documentation explaining shuffle #5074

[SPARK-5750][SPARK-3441][SPARK-5836][CORE] Added documentation explaining shuffle #5074

Uh oh!

Conversation

ilganeli commented Mar 17, 2015

Uh oh!

SparkQA commented Mar 17, 2015

Uh oh!

SparkQA commented Mar 17, 2015

Uh oh!

AmplabJenkins commented Mar 17, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 18, 2015

Uh oh!

SparkQA commented Mar 18, 2015

Uh oh!

AmplabJenkins commented Mar 18, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilganeli commented Mar 19, 2015

Uh oh!

SparkQA commented Mar 19, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sryza commented Mar 19, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilganeli commented Mar 25, 2015

Uh oh!

SparkQA commented Mar 25, 2015

Uh oh!

SparkQA commented Mar 25, 2015

Uh oh!

AmplabJenkins commented Mar 25, 2015

Uh oh!

SparkQA commented Mar 25, 2015

Uh oh!

AmplabJenkins commented Mar 25, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 26, 2015

Uh oh!

SparkQA commented Mar 26, 2015

Uh oh!

AmplabJenkins commented Mar 26, 2015

Uh oh!

SparkQA commented Mar 26, 2015

Uh oh!

SparkQA commented Mar 26, 2015

Uh oh!

AmplabJenkins commented Mar 26, 2015

Uh oh!

srowen commented Mar 26, 2015

Uh oh!

SparkQA commented Mar 27, 2015

Uh oh!