[SPARK-27009][TEST] Add Standard Deviation to benchmark results #23914

yifeih · 2019-02-27T22:34:52Z

What changes were proposed in this pull request?

Add standard deviation to the stats taken during benchmark testing.

How was this patch tested?

Manually ran a few benchmark tests locally and visually inspected the output

yifeih · 2019-02-27T22:39:49Z

Tagging people who have modified this file before: @MaxGekk @wangyum @gengliangwang @dongjoon-hyun thanks!

dongjoon-hyun · 2019-02-27T23:05:47Z

Thank you for your first contribution, @yifeih . BTW, standard deviation is a useful value in general, but I'm wondering if this is useful here in the row of the benchmark. What is your purpose for this?

yifeih · 2019-02-28T00:13:17Z

We're writing some microbenchmark tests to get ready for some refactoring changes that we're making to the spark shuffle code in order to plan for an API that allows for resilient external shuffle. We're planning on running these benchmarks for each PR change that we make, comparing each PR's benchmark numbers to the previous PR's. Having standard deviation would help us see whether we unexpectedly introduced any increases in variance at each stage, and whether the changes in the average times from one PR to the next are significant or not.

More info on the shuffle change that we're planning for:
Jira ticket here: https://issues.apache.org/jira/browse/SPARK-25299
Benchmarks PR here: palantir#498
Project plan here: https://docs.google.com/document/d/1NQW1XgJ6bwktjq5iPyxnvasV9g-XsauiRRayQcGLiik/edit#)

Let me know if that makes sense or not, thanks!

yifeih · 2019-02-28T00:17:37Z

Maybe more context:
We're making the changes in the palantir/spark repo now because it's already integrated with CircleCI's containerized build systems and makes it easier to test. Everything else in the PR is related to the specific shuffle code that we plan on refactoring for SPARK-25299, but I thought maybe this was something I could pull out since it seemed generally useful to have.

Eventually, the goal is to merge the refactoring changes for SPARK-25299 upstream. The benchmarks can be part of the merge upstream, but don't necessarily have to. Their biggest purpose is to catch performance changes early on during the development process

mccheah · 2019-02-28T03:01:37Z

If these benchmarks are run in hosted systems, such as in Jenkins or CircleCI, it's useful to know if there was significant variance in the benchmark execution due to the unstable environment.

gengliangwang · 2019-02-28T03:45:13Z

I think it is good to show the standard deviation of benchmark results. In such case, should we update all the test benchmark results?

$ find . -name *Benchmark-results.txt
./core/benchmarks/XORShiftRandomBenchmark-results.txt
./core/benchmarks/KryoSerializerBenchmark-results.txt
./core/benchmarks/KryoBenchmark-results.txt
./mllib/benchmarks/UDTSerializationBenchmark-results.txt
./external/avro/benchmarks/AvroReadBenchmark-results.txt
./external/avro/benchmarks/AvroWriteBenchmark-results.txt
./sql/core/benchmarks/MiscBenchmark-results.txt
./sql/core/benchmarks/ExternalAppendOnlyUnsafeRowArrayBenchmark-results.txt
./sql/core/benchmarks/JoinBenchmark-results.txt
./sql/core/benchmarks/WideSchemaBenchmark-results.txt
./sql/core/benchmarks/BloomFilterBenchmark-results.txt
./sql/core/benchmarks/InExpressionBenchmark-results.txt
./sql/core/benchmarks/CompressionSchemeBenchmark-results.txt
./sql/core/benchmarks/SortBenchmark-results.txt
./sql/core/benchmarks/DateTimeBenchmark-results.txt
./sql/core/benchmarks/UnsafeArrayDataBenchmark-results.txt
./sql/core/benchmarks/RangeBenchmark-results.txt
./sql/core/benchmarks/DatasetBenchmark-results.txt
./sql/core/benchmarks/BuiltInDataSourceWriteBenchmark-results.txt
./sql/core/benchmarks/DataSourceReadBenchmark-results.txt
./sql/core/benchmarks/CSVBenchmark-results.txt
./sql/core/benchmarks/PrimitiveArrayBenchmark-results.txt
./sql/core/benchmarks/HashedRelationMetricsBenchmark-results.txt
./sql/core/benchmarks/WideTableBenchmark-results.txt
./sql/core/benchmarks/AggregateBenchmark-results.txt
./sql/core/benchmarks/ColumnarBatchBenchmark-results.txt
./sql/core/benchmarks/JSONBenchmark-results.txt
./sql/core/benchmarks/FilterPushdownBenchmark-results.txt
./sql/catalyst/benchmarks/HashByteArrayBenchmark-results.txt
./sql/catalyst/benchmarks/HashBenchmark-results.txt
./sql/catalyst/benchmarks/UnsafeProjectionBenchmark-results.txt
./sql/hive/benchmarks/ObjectHashAggregateExecBenchmark-results.txt
./sql/hive/benchmarks/OrcReadBenchmark-results.txt

yifeih · 2019-02-28T04:30:51Z

Oh I didn't know that those files were checked in. @gengliangwang would you happen to know how those files are originally generated? We should probably run them in the same environment that they were originally generated in.

dongjoon-hyun · 2019-02-28T05:25:02Z

Got it, @yifeih and @mccheah .

To @gengliangwang . It's okay to leave them for now. We will run them all with 3.0.0+ releases.

To @yifeih .The original intention was to keep track all benchmarks on AWS EC2 r3.xlarge with the instance store disk. But, in the community, we usually focus on the ratio during reviews.

dongjoon-hyun · 2019-02-28T05:25:12Z

ok to test

dongjoon-hyun · 2019-02-28T05:42:46Z

core/src/test/scala/org/apache/spark/benchmark/Benchmark.scala

-      "Per Row(ns)", "Relative")
-    out.println("-" * 96)
+    out.printf("%-40s %16s %12s %13s %10s %13s\n", name + ":", "Best/Avg Time(ms)", "Rate(M/s)",
+      "Per Row(ns)", "Relative", "Stdev (ms)")


Currently, this adds the new value at the end. Can we move this to Best/Avg Time(ms) group? For example, Best/Avg/Stdev Time(ms)?

Limiting: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative Stdev (ms) -------------------------------------------------------------------------------------------------- Top-level column 231 / 240 4.3 230.7 1.0X 11 Nested column 1833 / 1957 0.5 1833.1 0.1X 68

I guess Best/Avg/Stdev (ms) will be enough because we use Per Row(ns) already.

@dongjoon-hyun I thought about this. But then the readability of the numbers might be worse.
How about make each of them a single column? E.g.
Best Time(ms) Avg Time(ms) Stdev Time(ms)
I don't have a strong preference here.

If we're going to add it, it doesn't make sense to do it separately at the end. I think best, avg, and stdev should be their own columns now.

I got it, @srowen .

Yup, I can separate it and place it after "avg" and before "rate"

ok it looks like this now:

[info] agg w/o group: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] agg w/o group wholestage off 43309 43591 283 48.4 20.7 1.0X [info] agg w/o group wholestage on 1032 1111 111 2032.4 0.5 42.0X

SparkQA · 2019-02-28T08:05:02Z

Test build #102852 has finished for PR 23914 at commit 85e0487.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-02-28T10:20:21Z

retest this please.

srowen · 2019-02-28T14:58:26Z

core/src/test/scala/org/apache/spark/benchmark/Benchmark.scala

    val best = runTimes.min
    val avg = runTimes.sum / runTimes.size
-    Result(avg / 1000000.0, num / (best / 1000.0), best / 1000000.0)
+    val stdev = math.sqrt(runTimes.map(time => math.pow(time - avg, 2)).sum / runTimes.size)


Not that it really matters, but (time - avg) * (time - avg) is fine here and faster than pow.
Super nit but I'd suggest it's more reasonable to use the sample rather than population stdev: divide by runTimes.size - 1. I suppose this means also checking that there are at least 2 runs.

Ah ok, I agree with you on both. If there aren't enough runs, should we just put "N/A" then?

You can assume (or assert) that there was at least 1 benchmarking run, or none of the metrics mean anything. (maybe it's already asserted)

While the sample stdev is not really defined for 1 run, "0" is fine.

I don't think it's asserted anywhere. i'll add it

srowen · 2019-02-28T14:59:11Z

core/src/test/scala/org/apache/spark/benchmark/Benchmark.scala

-      "Per Row(ns)", "Relative")
-    out.println("-" * 96)
+    out.printf("%-40s %16s %12s %13s %10s %13s\n", name + ":", "Best/Avg Time(ms)", "Rate(M/s)",
+      "Per Row(ns)", "Relative", "Stdev (ms)")


If we're going to add it, it doesn't make sense to do it separately at the end. I think best, avg, and stdev should be their own columns now.

SparkQA · 2019-02-28T15:11:28Z

Test build #102857 has finished for PR 23914 at commit 85e0487.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-01T02:26:46Z

Test build #102879 has finished for PR 23914 at commit 47990f4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-01T02:40:40Z

Test build #102878 has finished for PR 23914 at commit ef0066d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. And, I checked that all review comments are addressed.

Thank you so much, @yifeih , @srowen , @mccheah , @gengliangwang !

Merged to master.

yifeih added 2 commits February 27, 2019 14:30

add stdev

f77c9ce

fix stdev text

85e0487

dongjoon-hyun changed the title ~~[SPARK-27009] Add Standard Deviation to Benchmark tests~~ [SPARK-27009][TEST] Add Standard Deviation to benchmark results Feb 28, 2019

dongjoon-hyun reviewed Feb 28, 2019

View reviewed changes

srowen reviewed Feb 28, 2019

View reviewed changes

yifeih added 3 commits February 28, 2019 13:30

format

8c63745

use times and sample stdev

ef0066d

add assert for at least one iteration

47990f4

dongjoon-hyun approved these changes Mar 1, 2019

View reviewed changes

dongjoon-hyun closed this in bc7592b Mar 1, 2019

dongjoon-hyun mentioned this pull request Mar 3, 2019

[SPARK-26205][SQL] Optimize InSet Expression for bytes, shorts, ints, dates #23171

Closed

[SPARK-27009][TEST] Add Standard Deviation to benchmark results #23914

[SPARK-27009][TEST] Add Standard Deviation to benchmark results #23914

Uh oh!

Conversation

yifeih commented Feb 27, 2019 • edited by dongjoon-hyun Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

yifeih commented Feb 27, 2019

Uh oh!

dongjoon-hyun commented Feb 27, 2019

Uh oh!

yifeih commented Feb 28, 2019

Uh oh!

yifeih commented Feb 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mccheah commented Feb 28, 2019

Uh oh!

gengliangwang commented Feb 28, 2019

Uh oh!

yifeih commented Feb 28, 2019

Uh oh!

dongjoon-hyun commented Feb 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Feb 28, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gengliangwang Feb 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 28, 2019

Uh oh!

gengliangwang commented Feb 28, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 28, 2019

Uh oh!

SparkQA commented Mar 1, 2019

Uh oh!

SparkQA commented Mar 1, 2019

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yifeih commented Feb 27, 2019 •

edited by dongjoon-hyun

Loading

yifeih commented Feb 28, 2019 •

edited

Loading

dongjoon-hyun commented Feb 28, 2019 •

edited

Loading

gengliangwang Feb 28, 2019 •

edited

Loading