Skip to content

Conversation

@yifeih
Copy link
Contributor

@yifeih yifeih commented Feb 27, 2019

What changes were proposed in this pull request?

Add standard deviation to the stats taken during benchmark testing.

How was this patch tested?

Manually ran a few benchmark tests locally and visually inspected the output

@yifeih
Copy link
Contributor Author

yifeih commented Feb 27, 2019

Tagging people who have modified this file before: @MaxGekk @wangyum @gengliangwang @dongjoon-hyun thanks!

@dongjoon-hyun
Copy link
Member

Thank you for your first contribution, @yifeih . BTW, standard deviation is a useful value in general, but I'm wondering if this is useful here in the row of the benchmark. What is your purpose for this?

@yifeih
Copy link
Contributor Author

yifeih commented Feb 28, 2019

We're writing some microbenchmark tests to get ready for some refactoring changes that we're making to the spark shuffle code in order to plan for an API that allows for resilient external shuffle. We're planning on running these benchmarks for each PR change that we make, comparing each PR's benchmark numbers to the previous PR's. Having standard deviation would help us see whether we unexpectedly introduced any increases in variance at each stage, and whether the changes in the average times from one PR to the next are significant or not.

More info on the shuffle change that we're planning for:
Jira ticket here: https://issues.apache.org/jira/browse/SPARK-25299
Benchmarks PR here: palantir#498
Project plan here: https://docs.google.com/document/d/1NQW1XgJ6bwktjq5iPyxnvasV9g-XsauiRRayQcGLiik/edit#)

Let me know if that makes sense or not, thanks!

@yifeih
Copy link
Contributor Author

yifeih commented Feb 28, 2019

Maybe more context:
We're making the changes in the palantir/spark repo now because it's already integrated with CircleCI's containerized build systems and makes it easier to test. Everything else in the PR is related to the specific shuffle code that we plan on refactoring for SPARK-25299, but I thought maybe this was something I could pull out since it seemed generally useful to have.

Eventually, the goal is to merge the refactoring changes for SPARK-25299 upstream. The benchmarks can be part of the merge upstream, but don't necessarily have to. Their biggest purpose is to catch performance changes early on during the development process

@mccheah
Copy link
Contributor

mccheah commented Feb 28, 2019

If these benchmarks are run in hosted systems, such as in Jenkins or CircleCI, it's useful to know if there was significant variance in the benchmark execution due to the unstable environment.

@gengliangwang
Copy link
Member

I think it is good to show the standard deviation of benchmark results. In such case, should we update all the test benchmark results?

$ find . -name *Benchmark-results.txt
./core/benchmarks/XORShiftRandomBenchmark-results.txt
./core/benchmarks/KryoSerializerBenchmark-results.txt
./core/benchmarks/KryoBenchmark-results.txt
./mllib/benchmarks/UDTSerializationBenchmark-results.txt
./external/avro/benchmarks/AvroReadBenchmark-results.txt
./external/avro/benchmarks/AvroWriteBenchmark-results.txt
./sql/core/benchmarks/MiscBenchmark-results.txt
./sql/core/benchmarks/ExternalAppendOnlyUnsafeRowArrayBenchmark-results.txt
./sql/core/benchmarks/JoinBenchmark-results.txt
./sql/core/benchmarks/WideSchemaBenchmark-results.txt
./sql/core/benchmarks/BloomFilterBenchmark-results.txt
./sql/core/benchmarks/InExpressionBenchmark-results.txt
./sql/core/benchmarks/CompressionSchemeBenchmark-results.txt
./sql/core/benchmarks/SortBenchmark-results.txt
./sql/core/benchmarks/DateTimeBenchmark-results.txt
./sql/core/benchmarks/UnsafeArrayDataBenchmark-results.txt
./sql/core/benchmarks/RangeBenchmark-results.txt
./sql/core/benchmarks/DatasetBenchmark-results.txt
./sql/core/benchmarks/BuiltInDataSourceWriteBenchmark-results.txt
./sql/core/benchmarks/DataSourceReadBenchmark-results.txt
./sql/core/benchmarks/CSVBenchmark-results.txt
./sql/core/benchmarks/PrimitiveArrayBenchmark-results.txt
./sql/core/benchmarks/HashedRelationMetricsBenchmark-results.txt
./sql/core/benchmarks/WideTableBenchmark-results.txt
./sql/core/benchmarks/AggregateBenchmark-results.txt
./sql/core/benchmarks/ColumnarBatchBenchmark-results.txt
./sql/core/benchmarks/JSONBenchmark-results.txt
./sql/core/benchmarks/FilterPushdownBenchmark-results.txt
./sql/catalyst/benchmarks/HashByteArrayBenchmark-results.txt
./sql/catalyst/benchmarks/HashBenchmark-results.txt
./sql/catalyst/benchmarks/UnsafeProjectionBenchmark-results.txt
./sql/hive/benchmarks/ObjectHashAggregateExecBenchmark-results.txt
./sql/hive/benchmarks/OrcReadBenchmark-results.txt

@yifeih
Copy link
Contributor Author

yifeih commented Feb 28, 2019

Oh I didn't know that those files were checked in. @gengliangwang would you happen to know how those files are originally generated? We should probably run them in the same environment that they were originally generated in.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Feb 28, 2019

Got it, @yifeih and @mccheah .

To @gengliangwang . It's okay to leave them for now. We will run them all with 3.0.0+ releases.

To @yifeih .The original intention was to keep track all benchmarks on AWS EC2 r3.xlarge with the instance store disk. But, in the community, we usually focus on the ratio during reviews.

@dongjoon-hyun
Copy link
Member

ok to test

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-27009] Add Standard Deviation to Benchmark tests [SPARK-27009][TEST] Add Standard Deviation to benchmark results Feb 28, 2019
"Per Row(ns)", "Relative")
out.println("-" * 96)
out.printf("%-40s %16s %12s %13s %10s %13s\n", name + ":", "Best/Avg Time(ms)", "Rate(M/s)",
"Per Row(ns)", "Relative", "Stdev (ms)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this adds the new value at the end. Can we move this to Best/Avg Time(ms) group? For example, Best/Avg/Stdev Time(ms)?

Limiting:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative    Stdev (ms)
--------------------------------------------------------------------------------------------------
Top-level column                   231 /  240          4.3         230.7       1.0X            11
Nested column                     1833 / 1957          0.5        1833.1       0.1X            68

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess Best/Avg/Stdev (ms) will be enough because we use Per Row(ns) already.

Copy link
Member

@gengliangwang gengliangwang Feb 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun I thought about this. But then the readability of the numbers might be worse.
How about make each of them a single column? E.g.
Best Time(ms) Avg Time(ms) Stdev Time(ms)
I don't have a strong preference here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to add it, it doesn't make sense to do it separately at the end. I think best, avg, and stdev should be their own columns now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it, @srowen .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I can separate it and place it after "avg" and before "rate"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok it looks like this now:

[info] agg w/o group:                            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] agg w/o group wholestage off                      43309          43591         283         48.4          20.7       1.0X
[info] agg w/o group wholestage on                        1032           1111         111       2032.4           0.5      42.0X

@SparkQA
Copy link

SparkQA commented Feb 28, 2019

Test build #102852 has finished for PR 23914 at commit 85e0487.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member

retest this please.

val best = runTimes.min
val avg = runTimes.sum / runTimes.size
Result(avg / 1000000.0, num / (best / 1000.0), best / 1000000.0)
val stdev = math.sqrt(runTimes.map(time => math.pow(time - avg, 2)).sum / runTimes.size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that it really matters, but (time - avg) * (time - avg) is fine here and faster than pow.
Super nit but I'd suggest it's more reasonable to use the sample rather than population stdev: divide by runTimes.size - 1. I suppose this means also checking that there are at least 2 runs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, I agree with you on both. If there aren't enough runs, should we just put "N/A" then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can assume (or assert) that there was at least 1 benchmarking run, or none of the metrics mean anything. (maybe it's already asserted)

While the sample stdev is not really defined for 1 run, "0" is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's asserted anywhere. i'll add it

"Per Row(ns)", "Relative")
out.println("-" * 96)
out.printf("%-40s %16s %12s %13s %10s %13s\n", name + ":", "Best/Avg Time(ms)", "Rate(M/s)",
"Per Row(ns)", "Relative", "Stdev (ms)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to add it, it doesn't make sense to do it separately at the end. I think best, avg, and stdev should be their own columns now.

@SparkQA
Copy link

SparkQA commented Feb 28, 2019

Test build #102857 has finished for PR 23914 at commit 85e0487.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 1, 2019

Test build #102879 has finished for PR 23914 at commit 47990f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 1, 2019

Test build #102878 has finished for PR 23914 at commit ef0066d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. And, I checked that all review comments are addressed.

Thank you so much, @yifeih , @srowen , @mccheah , @gengliangwang !

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants