-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-27009][TEST] Add Standard Deviation to benchmark results #23914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Tagging people who have modified this file before: @MaxGekk @wangyum @gengliangwang @dongjoon-hyun thanks! |
|
Thank you for your first contribution, @yifeih . BTW, |
|
We're writing some microbenchmark tests to get ready for some refactoring changes that we're making to the spark shuffle code in order to plan for an API that allows for resilient external shuffle. We're planning on running these benchmarks for each PR change that we make, comparing each PR's benchmark numbers to the previous PR's. Having standard deviation would help us see whether we unexpectedly introduced any increases in variance at each stage, and whether the changes in the average times from one PR to the next are significant or not. More info on the shuffle change that we're planning for: Let me know if that makes sense or not, thanks! |
|
Maybe more context: Eventually, the goal is to merge the refactoring changes for SPARK-25299 upstream. The benchmarks can be part of the merge upstream, but don't necessarily have to. Their biggest purpose is to catch performance changes early on during the development process |
|
If these benchmarks are run in hosted systems, such as in Jenkins or CircleCI, it's useful to know if there was significant variance in the benchmark execution due to the unstable environment. |
|
I think it is good to show the standard deviation of benchmark results. In such case, should we update all the test benchmark results? |
|
Oh I didn't know that those files were checked in. @gengliangwang would you happen to know how those files are originally generated? We should probably run them in the same environment that they were originally generated in. |
|
Got it, @yifeih and @mccheah . To @gengliangwang . It's okay to leave them for now. We will run them all with 3.0.0+ releases. To @yifeih .The original intention was to keep track all benchmarks on AWS EC2 |
|
ok to test |
| "Per Row(ns)", "Relative") | ||
| out.println("-" * 96) | ||
| out.printf("%-40s %16s %12s %13s %10s %13s\n", name + ":", "Best/Avg Time(ms)", "Rate(M/s)", | ||
| "Per Row(ns)", "Relative", "Stdev (ms)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, this adds the new value at the end. Can we move this to Best/Avg Time(ms) group? For example, Best/Avg/Stdev Time(ms)?
Limiting: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative Stdev (ms)
--------------------------------------------------------------------------------------------------
Top-level column 231 / 240 4.3 230.7 1.0X 11
Nested column 1833 / 1957 0.5 1833.1 0.1X 68
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess Best/Avg/Stdev (ms) will be enough because we use Per Row(ns) already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun I thought about this. But then the readability of the numbers might be worse.
How about make each of them a single column? E.g.
Best Time(ms) Avg Time(ms) Stdev Time(ms)
I don't have a strong preference here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to add it, it doesn't make sense to do it separately at the end. I think best, avg, and stdev should be their own columns now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got it, @srowen .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I can separate it and place it after "avg" and before "rate"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok it looks like this now:
[info] agg w/o group: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] agg w/o group wholestage off 43309 43591 283 48.4 20.7 1.0X
[info] agg w/o group wholestage on 1032 1111 111 2032.4 0.5 42.0X
|
Test build #102852 has finished for PR 23914 at commit
|
|
retest this please. |
| val best = runTimes.min | ||
| val avg = runTimes.sum / runTimes.size | ||
| Result(avg / 1000000.0, num / (best / 1000.0), best / 1000000.0) | ||
| val stdev = math.sqrt(runTimes.map(time => math.pow(time - avg, 2)).sum / runTimes.size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that it really matters, but (time - avg) * (time - avg) is fine here and faster than pow.
Super nit but I'd suggest it's more reasonable to use the sample rather than population stdev: divide by runTimes.size - 1. I suppose this means also checking that there are at least 2 runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, I agree with you on both. If there aren't enough runs, should we just put "N/A" then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can assume (or assert) that there was at least 1 benchmarking run, or none of the metrics mean anything. (maybe it's already asserted)
While the sample stdev is not really defined for 1 run, "0" is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's asserted anywhere. i'll add it
| "Per Row(ns)", "Relative") | ||
| out.println("-" * 96) | ||
| out.printf("%-40s %16s %12s %13s %10s %13s\n", name + ":", "Best/Avg Time(ms)", "Rate(M/s)", | ||
| "Per Row(ns)", "Relative", "Stdev (ms)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to add it, it doesn't make sense to do it separately at the end. I think best, avg, and stdev should be their own columns now.
|
Test build #102857 has finished for PR 23914 at commit
|
|
Test build #102879 has finished for PR 23914 at commit
|
|
Test build #102878 has finished for PR 23914 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. And, I checked that all review comments are addressed.
Thank you so much, @yifeih , @srowen , @mccheah , @gengliangwang !
Merged to master.
What changes were proposed in this pull request?
Add standard deviation to the stats taken during benchmark testing.
How was this patch tested?
Manually ran a few benchmark tests locally and visually inspected the output