[SPARK-3172] [SPARK-3577] improve shuffle spill metrics #2504

sryza · 2014-09-23T06:04:12Z

The posted patch addresses both SPARK-3172 and SPARK-3577. It renames ShuffleWriteMetrics to WriteMetrics and uses it for tracking all three of shuffle write, spilling on the fetch side, and spilling on the write side (which only occurs during sort-based shuffle).

I'll fix and add tests if people think restructuring the metrics in this way makes sense.

I'm a little unsure about the name shuffleReadSpillMetrics, as spilling happens during aggregation, not read, but I had trouble coming up with something better.

I'm also unsure on what the most useful columns would be to display in the UI - I remember some pushback on adding new columns. Ultimately these metrics will be most helpful if they can inform users whether and how much they need to increase the number of partitions / increase spark.shuffle.memoryFraction. Reporting spill time informs users whether spilling is a significant impacting performance. Reporting memory size can help with understanding how much needs to be done to avoid spilling.

@pwendell any thoughts on this?

SparkQA · 2014-09-23T06:09:25Z

QA tests have started for PR 2504 at commit c854514.

This patch merges cleanly.

SparkQA · 2014-09-23T06:12:46Z

QA tests have finished for PR 2504 at commit c854514.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class WriteMetrics extends Serializable

SparkQA · 2014-09-23T06:12:47Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20689/

andrewor14 · 2014-09-25T19:08:00Z

retest this please

SparkQA · 2014-09-25T19:14:27Z

QA tests have started for PR 2504 at commit c854514.

This patch merges cleanly.

SparkQA · 2014-09-25T19:17:57Z

QA tests have finished for PR 2504 at commit c854514.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-09-25T19:17:58Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20815/

andrewor14 · 2014-09-25T19:27:18Z

Hey @sryza I haven't looked in detail at the changes but I have a few high-level comments. We only ever spill during a shuffle read or shuffle write, so I think it actually makes sense to just add a field in each of ShuffleWriteMetrics and ShuffleReadMetrics that reports the number of bytes spilled ("Bytes Spilled") and the time spent spilling ("Spill Time" or something). The existing code groups the bytes spilled from both shuffle read and shuffle write into ShuffleWriteMetrics, which I think is slightly incorrect.

As for the columns, I think it's OK to add more for these particular metrics. Eventually we'll have a mechanism for the user to toggle which columns he/she is interested in (through a series of checkboxes or something), and the default set could be a subset of all the columns. If you're worried about having too many columns, we could group the corresponding columns in ShuffleReadMetrics and ShuffleWriteMetrics and separate them with a slash or something.

sryza · 2014-09-25T20:33:13Z

I considered that approach as well, but found that this one sat more elegantly with the metrics collecting code.

DiskObjectWriter, which is used both when spilling and when writing out final shuffle data, accepts a WriteMetrics (nee ShuffleWriteMetrics) object, and increments it as it writes. Having a more complex ShuffleWriteMetrics object would require pushing down knowledge about the purpose of the write into DiskObjectWriter so it could increment the appropriate fields. Or we could make a distinction between the metric-collecting objects that DiskObjectWriters takes and the ShuffleWriteMetrics/ShuffleReadMetrics that go into the TaskMetrics, but this seems to me like a layer of complexity that's worth avoiding if we can.

hammer · 2014-11-10T23:14:54Z

@andrewor14 @sryza what happened with this patch? these metrics would be nice to have.

kayousterhout · 2014-11-10T23:24:22Z

@sryza @andrewor14, @shivaram just sent me a pointer to this -- just wanted to comment that this should be totally OK to add to the UI now that we have functionality to show/hide columns (#2867), because these could be hidden by default and included in the "Show Additional Metrics" list.

Opening shuffle files can be very significant when the disk is contended, especially when using ext3. While writing data to a file can avoid hitting disk (and instead hit the buffer cache), opening a file always involves writing some metadata about the file to disk, so the open time can be a very significant portion of the shuffle write time. In one recent job, the time to write shuffle data to the file was only 4ms for each task, but the time to open the file was about 100x as long (~400ms). When we added metrics about spilled data (apache#2504), we should ensure that the file open time is also included there.

kayousterhout · 2015-02-12T08:43:47Z

@sryza I'm wondering about SPARK-3172 -- is that a common source of confusion / debugging difficulty? Just wondering whether it's worthwhile to fix that, since this patch would be simpler if you could just have a single set of spill metrics for the task.

sryza · 2015-02-12T08:56:46Z

I think SPARK-3172 is probably the more useful portion of this patch. Ultimately, users want to be able to trace their shuffle resource consumption back to the transformations they used. A single shuffle spill metric makes it difficult to determine which of the stage boundaries bordering a task (and from there, the transformation that triggered the stage boundary) is the culprit.

kayousterhout · 2015-02-12T19:48:33Z

core/src/main/scala/org/apache/spark/Aggregator.scala

If we're doing map-side combining, this should be for the shuffle write, right? (Is there a way to distinguish between these cases?)

Ah yeah, looks like this is a copy/paste mistake. In this path, I think we're always on the map side, so the call should just be getOrCreateShuffleWriteSpillMetrics.

kayousterhout · 2015-02-12T19:53:24Z

Ok that makes sense...now my concern is whether it's possible to do that without changing the developer API (see my in-line comment).

A few other things:
-Can you fix the import ordering? :)
-It looks like, in external sorter, right now you just keep track of the spilled bytes. Can you add the spill time there too?

SparkQA · 2015-02-19T07:48:02Z

Test build #27713 has started for PR 2504 at commit b38fe51.

This patch merges cleanly.

sryza · 2015-02-19T07:49:09Z

I rewrote this patch with more of an aim to maintain compatibility than the last version. This is still not completely done and likely will not past tests. I'm hoping to get validation on the general approach before polishing.

SparkQA · 2015-02-19T08:45:25Z

Test build #27713 has finished for PR 2504 at commit b38fe51.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-19T08:45:30Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27713/
Test FAILed.

kayousterhout · 2015-02-19T17:48:42Z

@sryza is the idea here to maintain backward compatibility by re-using the ShuffleWriteMetrics class for spill metrics?

I preferred the old approach, where you defined a DiskWriteMetrics class, and then ShuffleReadMetrics includes "val spillMetrics = new DiskWriteMetrics", and shuffle write metrics includes two DiskWriteMetrics (one for spill and one for normal write). To make everything backward compatible, I might just write the JsonMetrics in the same way as before for ShuffleWriteMetrics. I'm just concerned this approach is pretty hard to understand for a new person looking at this. Happy for others to weigh in though.

sryza · 2015-02-23T08:44:50Z

Both approaches are fine with me. I've had some discussions recently with @pwendell recently where he's stressed the importance of not breaking developer APIs. Patrick, care to weigh in?

nchammas · 2015-02-25T20:52:17Z

Minor side comment: Can we update the title of this PR to something like:

[SPARK-3172] [SPARK-3577] Concise description of changes goes here

Right now this PR shows up weird on the PR dashboard.

Opening shuffle files can be very significant when the disk is contended, especially when using ext3. While writing data to a file can avoid hitting disk (and instead hit the buffer cache), opening a file always involves writing some metadata about the file to disk, so the open time can be a very significant portion of the shuffle write time. In one job I ran recently, the time to write shuffle data to the file was only 4ms for each task, but the time to open the file was about 100x as long (~400ms). When we add metrics about spilled data (#2504), we should ensure that the file open time is also included there. Author: Kay Ousterhout <[email protected]> Closes #4550 from kayousterhout/SPARK-3570 and squashes the following commits: ea3a4ae [Kay Ousterhout] Added comment about excluded open time fdc5185 [Kay Ousterhout] Improved comment 42b7e43 [Kay Ousterhout] Fixed parens for nanotime 2423555 [Kay Ousterhout] [SPARK-3570] Include time to open files in shuffle write time. (cherry picked from commit d8ccf65) Signed-off-by: Andrew Or <[email protected]>

Opening shuffle files can be very significant when the disk is contended, especially when using ext3. While writing data to a file can avoid hitting disk (and instead hit the buffer cache), opening a file always involves writing some metadata about the file to disk, so the open time can be a very significant portion of the shuffle write time. In one job I ran recently, the time to write shuffle data to the file was only 4ms for each task, but the time to open the file was about 100x as long (~400ms). When we add metrics about spilled data (#2504), we should ensure that the file open time is also included there. Author: Kay Ousterhout <[email protected]> Closes #4550 from kayousterhout/SPARK-3570 and squashes the following commits: ea3a4ae [Kay Ousterhout] Added comment about excluded open time fdc5185 [Kay Ousterhout] Improved comment 42b7e43 [Kay Ousterhout] Fixed parens for nanotime 2423555 [Kay Ousterhout] [SPARK-3570] Include time to open files in shuffle write time.

pwendell · 2015-06-04T06:38:09Z

This has gone stale so I'd like to close this issue pending further discussion.

kayousterhout mentioned this pull request Feb 12, 2015

[SPARK-3570] Include time to open files in shuffle write time. #4550

Closed

kayousterhout reviewed Feb 12, 2015
View reviewed changes

SPARK-3172 and SPARK-3577

b38fe51

sryza force-pushed the sandy-spark-3172 branch from c854514 to b38fe51 Compare February 19, 2015 07:44

sryza changed the title ~~SPARK-3172 and SPARK-3577~~ [SPARK-3172] [SPARK-3577] improve shuffle spill metrics Mar 9, 2015

sryza closed this Jun 4, 2015

[SPARK-3172] [SPARK-3577] improve shuffle spill metrics #2504

[SPARK-3172] [SPARK-3577] improve shuffle spill metrics #2504

Uh oh!

Conversation

sryza commented Sep 23, 2014

Uh oh!

SparkQA commented Sep 23, 2014

Uh oh!

SparkQA commented Sep 23, 2014

Uh oh!

SparkQA commented Sep 23, 2014

Uh oh!

andrewor14 commented Sep 25, 2014

Uh oh!

SparkQA commented Sep 25, 2014

Uh oh!

SparkQA commented Sep 25, 2014

Uh oh!

AmplabJenkins commented Sep 25, 2014

Uh oh!

andrewor14 commented Sep 25, 2014

Uh oh!

sryza commented Sep 25, 2014

Uh oh!

hammer commented Nov 10, 2014

Uh oh!

kayousterhout commented Nov 10, 2014

Uh oh!

kayousterhout commented Feb 12, 2015

Uh oh!

sryza commented Feb 12, 2015

Uh oh!

kayousterhout Feb 12, 2015

Choose a reason for hiding this comment

Uh oh!

sryza Feb 13, 2015

Choose a reason for hiding this comment

Uh oh!

kayousterhout commented Feb 12, 2015

Uh oh!

SparkQA commented Feb 19, 2015

Uh oh!

sryza commented Feb 19, 2015

Uh oh!

SparkQA commented Feb 19, 2015

Uh oh!

AmplabJenkins commented Feb 19, 2015

Uh oh!

kayousterhout commented Feb 19, 2015

Uh oh!

sryza commented Feb 23, 2015

Uh oh!

nchammas commented Feb 25, 2015

Uh oh!

pwendell commented Jun 4, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants