-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-3172] [SPARK-3577] improve shuffle spill metrics #2504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
QA tests have started for PR 2504 at commit
|
|
QA tests have finished for PR 2504 at commit
|
|
Test FAILed. |
|
retest this please |
|
QA tests have started for PR 2504 at commit
|
|
QA tests have finished for PR 2504 at commit
|
|
Test FAILed. |
|
Hey @sryza I haven't looked in detail at the changes but I have a few high-level comments. We only ever spill during a shuffle read or shuffle write, so I think it actually makes sense to just add a field in each of As for the columns, I think it's OK to add more for these particular metrics. Eventually we'll have a mechanism for the user to toggle which columns he/she is interested in (through a series of checkboxes or something), and the default set could be a subset of all the columns. If you're worried about having too many columns, we could group the corresponding columns in |
|
I considered that approach as well, but found that this one sat more elegantly with the metrics collecting code. DiskObjectWriter, which is used both when spilling and when writing out final shuffle data, accepts a WriteMetrics (nee ShuffleWriteMetrics) object, and increments it as it writes. Having a more complex ShuffleWriteMetrics object would require pushing down knowledge about the purpose of the write into DiskObjectWriter so it could increment the appropriate fields. Or we could make a distinction between the metric-collecting objects that DiskObjectWriters takes and the ShuffleWriteMetrics/ShuffleReadMetrics that go into the TaskMetrics, but this seems to me like a layer of complexity that's worth avoiding if we can. |
|
@andrewor14 @sryza what happened with this patch? these metrics would be nice to have. |
|
@sryza @andrewor14, @shivaram just sent me a pointer to this -- just wanted to comment that this should be totally OK to add to the UI now that we have functionality to show/hide columns (#2867), because these could be hidden by default and included in the "Show Additional Metrics" list. |
Opening shuffle files can be very significant when the disk is contended, especially when using ext3. While writing data to a file can avoid hitting disk (and instead hit the buffer cache), opening a file always involves writing some metadata about the file to disk, so the open time can be a very significant portion of the shuffle write time. In one recent job, the time to write shuffle data to the file was only 4ms for each task, but the time to open the file was about 100x as long (~400ms). When we added metrics about spilled data (apache#2504), we should ensure that the file open time is also included there.
|
@sryza I'm wondering about SPARK-3172 -- is that a common source of confusion / debugging difficulty? Just wondering whether it's worthwhile to fix that, since this patch would be simpler if you could just have a single set of spill metrics for the task. |
|
I think SPARK-3172 is probably the more useful portion of this patch. Ultimately, users want to be able to trace their shuffle resource consumption back to the transformations they used. A single shuffle spill metric makes it difficult to determine which of the stage boundaries bordering a task (and from there, the transformation that triggered the stage boundary) is the culprit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're doing map-side combining, this should be for the shuffle write, right? (Is there a way to distinguish between these cases?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah, looks like this is a copy/paste mistake. In this path, I think we're always on the map side, so the call should just be getOrCreateShuffleWriteSpillMetrics.
|
Ok that makes sense...now my concern is whether it's possible to do that without changing the developer API (see my in-line comment). A few other things: |
c854514 to
b38fe51
Compare
|
Test build #27713 has started for PR 2504 at commit
|
|
I rewrote this patch with more of an aim to maintain compatibility than the last version. This is still not completely done and likely will not past tests. I'm hoping to get validation on the general approach before polishing. |
|
Test build #27713 has finished for PR 2504 at commit
|
|
Test FAILed. |
|
@sryza is the idea here to maintain backward compatibility by re-using the ShuffleWriteMetrics class for spill metrics? I preferred the old approach, where you defined a DiskWriteMetrics class, and then ShuffleReadMetrics includes "val spillMetrics = new DiskWriteMetrics", and shuffle write metrics includes two DiskWriteMetrics (one for spill and one for normal write). To make everything backward compatible, I might just write the JsonMetrics in the same way as before for ShuffleWriteMetrics. I'm just concerned this approach is pretty hard to understand for a new person looking at this. Happy for others to weigh in though. |
|
Both approaches are fine with me. I've had some discussions recently with @pwendell recently where he's stressed the importance of not breaking developer APIs. Patrick, care to weigh in? |
|
Minor side comment: Can we update the title of this PR to something like: Right now this PR shows up weird on the PR dashboard. |
Opening shuffle files can be very significant when the disk is contended, especially when using ext3. While writing data to a file can avoid hitting disk (and instead hit the buffer cache), opening a file always involves writing some metadata about the file to disk, so the open time can be a very significant portion of the shuffle write time. In one job I ran recently, the time to write shuffle data to the file was only 4ms for each task, but the time to open the file was about 100x as long (~400ms). When we add metrics about spilled data (#2504), we should ensure that the file open time is also included there. Author: Kay Ousterhout <[email protected]> Closes #4550 from kayousterhout/SPARK-3570 and squashes the following commits: ea3a4ae [Kay Ousterhout] Added comment about excluded open time fdc5185 [Kay Ousterhout] Improved comment 42b7e43 [Kay Ousterhout] Fixed parens for nanotime 2423555 [Kay Ousterhout] [SPARK-3570] Include time to open files in shuffle write time. (cherry picked from commit d8ccf65) Signed-off-by: Andrew Or <[email protected]>
Opening shuffle files can be very significant when the disk is contended, especially when using ext3. While writing data to a file can avoid hitting disk (and instead hit the buffer cache), opening a file always involves writing some metadata about the file to disk, so the open time can be a very significant portion of the shuffle write time. In one job I ran recently, the time to write shuffle data to the file was only 4ms for each task, but the time to open the file was about 100x as long (~400ms). When we add metrics about spilled data (#2504), we should ensure that the file open time is also included there. Author: Kay Ousterhout <[email protected]> Closes #4550 from kayousterhout/SPARK-3570 and squashes the following commits: ea3a4ae [Kay Ousterhout] Added comment about excluded open time fdc5185 [Kay Ousterhout] Improved comment 42b7e43 [Kay Ousterhout] Fixed parens for nanotime 2423555 [Kay Ousterhout] [SPARK-3570] Include time to open files in shuffle write time.
|
This has gone stale so I'd like to close this issue pending further discussion. |
The posted patch addresses both SPARK-3172 and SPARK-3577. It renames ShuffleWriteMetrics to WriteMetrics and uses it for tracking all three of shuffle write, spilling on the fetch side, and spilling on the write side (which only occurs during sort-based shuffle).
I'll fix and add tests if people think restructuring the metrics in this way makes sense.
I'm a little unsure about the name shuffleReadSpillMetrics, as spilling happens during aggregation, not read, but I had trouble coming up with something better.
I'm also unsure on what the most useful columns would be to display in the UI - I remember some pushback on adding new columns. Ultimately these metrics will be most helpful if they can inform users whether and how much they need to increase the number of partitions / increase spark.shuffle.memoryFraction. Reporting spill time informs users whether spilling is a significant impacting performance. Reporting memory size can help with understanding how much needs to be done to avoid spilling.
@pwendell any thoughts on this?