-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[WIP][SPARK-19937] Collect metrics of block sizes when shuffle. #17276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d69f8f9 to
648ceaa
Compare
|
Test build #74449 has started for PR 17276 at commit |
|
Test build #74448 has started for PR 17276 at commit |
27d9ed5 to
e2e56d3
Compare
|
Test build #74515 has finished for PR 17276 at commit
|
|
Test build #74509 has finished for PR 17276 at commit
|
|
Test build #74605 has finished for PR 17276 at commit
|
|
Test build #74607 has finished for PR 17276 at commit
|
|
Test build #74665 has finished for PR 17276 at commit
|
|
@squito @kayousterhout |
|
Test build #74917 has finished for PR 17276 at commit
|
|
Test build #74939 has started for PR 17276 at commit |
|
Test build #74931 has finished for PR 17276 at commit
|
|
Test build #74938 has finished for PR 17276 at commit
|
|
no worries, I'm just not sure when to look again, with all the notifications from your commits. Committers tend to think that something is ready to review if its passing tests, so its helpful to add those labels if its not the case. |
|
You are so kind person. Thanks a lot again. |
|
Test build #75098 has finished for PR 17276 at commit
|
|
Test build #75145 has finished for PR 17276 at commit
|
|
Test build #75146 has finished for PR 17276 at commit
|
|
Test build #75163 has finished for PR 17276 at commit
|
|
Test build #75220 has finished for PR 17276 at commit
|
|
@squito
Think about this scenario: the maximum is small, but thousands of blocks are underestimated, thus I also added some log for debug in |
|
Test build #75229 has finished for PR 17276 at commit
|
| underestimatedBlocksSize += partitionLengths[i]; | ||
| } | ||
| } | ||
| writeMetrics.incUnderestimatedBlocksSize(underestimatedBlocksSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will essentially be sum of every block above average size - how is this supposed to be leveraged ?
For example:
1, 2, 3, 4, 5, 6 => 15
1, 2, 3, 4, 5, 10 => 15
(This ended up being a degenerate example - but in general, I am curious what the value is for this metric).
| taskContext.taskAttemptId(), hc.getAvgSize(), | ||
| underestimatedBlocksNum, underestimatedBlocksSize, distributionStr); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to handle case of mapStatus not being HighlyCompressedMapStatus also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In CompressedMapStatus, the blocks sizes are accurate, so I might hesitate to add that log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value is not accurate - it is a log 1.1 'compression' which converts the long to a byte : and caps the value at 255.
So there are two errors introduced; it over-estimates the actual block size when compressed value < 255 [1] (which is something this PR currently ignores), when block size goes above 34k mb or so, it under estimates the block size (which is higher than what spark currently supports due to 2G limitation).
[1] I did not realize it always over-estimates; if the current PR is targetting only blocks which are under estimated; I would agree that not handling CompressedMapStatus for time being might be ok - though would be good to add a comment to that effect on 'why' we dont need to handle it.
| taskContext.taskAttemptId(), hc.getAvgSize(), | ||
| underestimatedBlocksNum, underestimatedBlocksSize, distributionStr); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This computation seems repeated - we should refactor it out into a method of its own and not duplicate it across classes.
| s" (0, 0.25, 0.5, 0.75, 1.0) is $distributionStr.") | ||
| case None => // no-op | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn;t this not similar to what is in core/src/main/java/org/apache/spark/shuffle/sort/UnsafeShuffleWriter.java, etc above ? Or is it different ?
The code looked same, but written differently (and more expensive here).
| private[spark] def incRemoteBlocksFetched(v: Long): Unit = _remoteBlocksFetched.add(v) | ||
| private[spark] def incLocalBlocksFetched(v: Long): Unit = _localBlocksFetched.add(v) | ||
| private[spark] def incRemoteBytesRead(v: Long): Unit = _remoteBytesRead.add(v) | ||
| private[spark] def incRemoteBytesReadToMem(v: Long): Unit = _remoteBytesReadToMem.add(v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way it seems to be coded up, this will end up being everything fetched from shuffle - and we can already infer it : remote bytes read + local bytes read.
Or did I miss something here ?
|
@mridulm
The value of
Basically, the metrics is for evaluation of stability and performance when shuffle-read. I want to achieve that smaller |
|
Test build #75238 has finished for PR 17276 at commit
|
|
Test build #75239 has finished for PR 17276 at commit
|
|
@jinxing64 I am unclear about the intention btw - do you expect shuffle reads to be informed by metrics from the mapper side ? I probably got that wrong. |
|
@mridulm |
|
Test build #76454 has finished for PR 17276 at commit
|
What changes were proposed in this pull request?
Metrics of blocks sizes(when shuffle) should be collected for later analysis. This is helpful for analysis when skew situations or OOM happens(though maxBytesInFlight is set).
This is a preparation for SPARK-19659.
How was this patch tested?
Unit test in
HistoryServerSuiteandJsonProtocolSuite.