Skip to content

Conversation

@sryza
Copy link
Contributor

@sryza sryza commented Sep 23, 2014

The posted patch addresses both SPARK-3172 and SPARK-3577. It renames ShuffleWriteMetrics to WriteMetrics and uses it for tracking all three of shuffle write, spilling on the fetch side, and spilling on the write side (which only occurs during sort-based shuffle).

I'll fix and add tests if people think restructuring the metrics in this way makes sense.

I'm a little unsure about the name shuffleReadSpillMetrics, as spilling happens during aggregation, not read, but I had trouble coming up with something better.

I'm also unsure on what the most useful columns would be to display in the UI - I remember some pushback on adding new columns. Ultimately these metrics will be most helpful if they can inform users whether and how much they need to increase the number of partitions / increase spark.shuffle.memoryFraction. Reporting spill time informs users whether spilling is a significant impacting performance. Reporting memory size can help with understanding how much needs to be done to avoid spilling.

@pwendell any thoughts on this?

@SparkQA
Copy link

SparkQA commented Sep 23, 2014

QA tests have started for PR 2504 at commit c854514.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 23, 2014

QA tests have finished for PR 2504 at commit c854514.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class WriteMetrics extends Serializable

@SparkQA
Copy link

SparkQA commented Sep 23, 2014

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20689/

@andrewor14
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Sep 25, 2014

QA tests have started for PR 2504 at commit c854514.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 25, 2014

QA tests have finished for PR 2504 at commit c854514.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20815/

@andrewor14
Copy link
Contributor

Hey @sryza I haven't looked in detail at the changes but I have a few high-level comments. We only ever spill during a shuffle read or shuffle write, so I think it actually makes sense to just add a field in each of ShuffleWriteMetrics and ShuffleReadMetrics that reports the number of bytes spilled ("Bytes Spilled") and the time spent spilling ("Spill Time" or something). The existing code groups the bytes spilled from both shuffle read and shuffle write into ShuffleWriteMetrics, which I think is slightly incorrect.

As for the columns, I think it's OK to add more for these particular metrics. Eventually we'll have a mechanism for the user to toggle which columns he/she is interested in (through a series of checkboxes or something), and the default set could be a subset of all the columns. If you're worried about having too many columns, we could group the corresponding columns in ShuffleReadMetrics and ShuffleWriteMetrics and separate them with a slash or something.

@sryza
Copy link
Contributor Author

sryza commented Sep 25, 2014

I considered that approach as well, but found that this one sat more elegantly with the metrics collecting code.

DiskObjectWriter, which is used both when spilling and when writing out final shuffle data, accepts a WriteMetrics (nee ShuffleWriteMetrics) object, and increments it as it writes. Having a more complex ShuffleWriteMetrics object would require pushing down knowledge about the purpose of the write into DiskObjectWriter so it could increment the appropriate fields. Or we could make a distinction between the metric-collecting objects that DiskObjectWriters takes and the ShuffleWriteMetrics/ShuffleReadMetrics that go into the TaskMetrics, but this seems to me like a layer of complexity that's worth avoiding if we can.

@hammer
Copy link
Contributor

hammer commented Nov 10, 2014

@andrewor14 @sryza what happened with this patch? these metrics would be nice to have.

@kayousterhout
Copy link
Contributor

@sryza @andrewor14, @shivaram just sent me a pointer to this -- just wanted to comment that this should be totally OK to add to the UI now that we have functionality to show/hide columns (#2867), because these could be hidden by default and included in the "Show Additional Metrics" list.

kayousterhout added a commit to kayousterhout/spark-1 that referenced this pull request Feb 12, 2015
Opening shuffle files can be very significant when the disk is
contended, especially when using ext3. While writing data to
a file can avoid hitting disk (and instead hit the buffer
cache), opening a file always involves writing some metadata
about the file to disk, so the open time can be a very significant
portion of the shuffle write time. In one recent job, the time to
write shuffle data to the file was only 4ms for each task, but
the time to open the file was about 100x as long (~400ms).

When we added metrics about spilled data (apache#2504), we should ensure
that the file open time is also included there.
@kayousterhout
Copy link
Contributor

@sryza I'm wondering about SPARK-3172 -- is that a common source of confusion / debugging difficulty? Just wondering whether it's worthwhile to fix that, since this patch would be simpler if you could just have a single set of spill metrics for the task.

@sryza
Copy link
Contributor Author

sryza commented Feb 12, 2015

I think SPARK-3172 is probably the more useful portion of this patch. Ultimately, users want to be able to trace their shuffle resource consumption back to the transformations they used. A single shuffle spill metric makes it difficult to determine which of the stage boundaries bordering a task (and from there, the transformation that triggered the stage boundary) is the culprit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're doing map-side combining, this should be for the shuffle write, right? (Is there a way to distinguish between these cases?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, looks like this is a copy/paste mistake. In this path, I think we're always on the map side, so the call should just be getOrCreateShuffleWriteSpillMetrics.

@kayousterhout
Copy link
Contributor

Ok that makes sense...now my concern is whether it's possible to do that without changing the developer API (see my in-line comment).

A few other things:
-Can you fix the import ordering? :)
-It looks like, in external sorter, right now you just keep track of the spilled bytes. Can you add the spill time there too?

@SparkQA
Copy link

SparkQA commented Feb 19, 2015

Test build #27713 has started for PR 2504 at commit b38fe51.

  • This patch merges cleanly.

@sryza
Copy link
Contributor Author

sryza commented Feb 19, 2015

I rewrote this patch with more of an aim to maintain compatibility than the last version. This is still not completely done and likely will not past tests. I'm hoping to get validation on the general approach before polishing.

@SparkQA
Copy link

SparkQA commented Feb 19, 2015

Test build #27713 has finished for PR 2504 at commit b38fe51.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27713/
Test FAILed.

@kayousterhout
Copy link
Contributor

@sryza is the idea here to maintain backward compatibility by re-using the ShuffleWriteMetrics class for spill metrics?

I preferred the old approach, where you defined a DiskWriteMetrics class, and then ShuffleReadMetrics includes "val spillMetrics = new DiskWriteMetrics", and shuffle write metrics includes two DiskWriteMetrics (one for spill and one for normal write). To make everything backward compatible, I might just write the JsonMetrics in the same way as before for ShuffleWriteMetrics. I'm just concerned this approach is pretty hard to understand for a new person looking at this. Happy for others to weigh in though.

@sryza
Copy link
Contributor Author

sryza commented Feb 23, 2015

Both approaches are fine with me. I've had some discussions recently with @pwendell recently where he's stressed the importance of not breaking developer APIs. Patrick, care to weigh in?

@nchammas
Copy link
Contributor

Minor side comment: Can we update the title of this PR to something like:

[SPARK-3172] [SPARK-3577] Concise description of changes goes here

Right now this PR shows up weird on the PR dashboard.

@sryza sryza changed the title SPARK-3172 and SPARK-3577 [SPARK-3172] [SPARK-3577] improve shuffle spill metrics Mar 9, 2015
asfgit pushed a commit that referenced this pull request Mar 24, 2015
Opening shuffle files can be very significant when the disk is
contended, especially when using ext3. While writing data to
a file can avoid hitting disk (and instead hit the buffer
cache), opening a file always involves writing some metadata
about the file to disk, so the open time can be a very significant
portion of the shuffle write time. In one job I ran recently, the time to
write shuffle data to the file was only 4ms for each task, but
the time to open the file was about 100x as long (~400ms).

When we add metrics about spilled data (#2504), we should ensure
that the file open time is also included there.

Author: Kay Ousterhout <[email protected]>

Closes #4550 from kayousterhout/SPARK-3570 and squashes the following commits:

ea3a4ae [Kay Ousterhout] Added comment about excluded open time
fdc5185 [Kay Ousterhout] Improved comment
42b7e43 [Kay Ousterhout] Fixed parens for nanotime
2423555 [Kay Ousterhout] [SPARK-3570] Include time to open files in shuffle write time.

(cherry picked from commit d8ccf65)
Signed-off-by: Andrew Or <[email protected]>
asfgit pushed a commit that referenced this pull request Mar 24, 2015
Opening shuffle files can be very significant when the disk is
contended, especially when using ext3. While writing data to
a file can avoid hitting disk (and instead hit the buffer
cache), opening a file always involves writing some metadata
about the file to disk, so the open time can be a very significant
portion of the shuffle write time. In one job I ran recently, the time to
write shuffle data to the file was only 4ms for each task, but
the time to open the file was about 100x as long (~400ms).

When we add metrics about spilled data (#2504), we should ensure
that the file open time is also included there.

Author: Kay Ousterhout <[email protected]>

Closes #4550 from kayousterhout/SPARK-3570 and squashes the following commits:

ea3a4ae [Kay Ousterhout] Added comment about excluded open time
fdc5185 [Kay Ousterhout] Improved comment
42b7e43 [Kay Ousterhout] Fixed parens for nanotime
2423555 [Kay Ousterhout] [SPARK-3570] Include time to open files in shuffle write time.
@pwendell
Copy link
Contributor

pwendell commented Jun 4, 2015

This has gone stale so I'd like to close this issue pending further discussion.

@sryza sryza closed this Jun 4, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants