-
Notifications
You must be signed in to change notification settings - Fork 52
Description
We are still seeing unexpected results in the pystats diffs.
@markshannon suggested I look at a recent PR to add a globals to constants pass where there should be some changes, but not to the level that we are seeing. The original results stats diff for that PR didn't have the per-benchmark results, so I re-ran it.
These two sets of results (Mark's run, and my later run of the same commits) are in strong agreement, so there doesn't seem to be anything attributable to randomness or things that change between runs. I also ruled out problems with summation (i.e. the totals across all benchmarks not being equal to the sum of all benchmarks). I also don't think there is cross-benchmark contamination -- each benchmark is run with a separate invocation of pyperformance
, and the /tmp/py_stats
directory is empty in between (I added some asserts to the run to confirm this).
Drilling down on the numbers, the most changed uop in terms of execution count is TO_BOOL_ALWAYS_TRUE
:
Name | Base Count | Head Count | Change |
---|---|---|---|
TO_BOOL_ALWAYS_TRUE | 12,145,706 | 30,824,186 | 153.8% |
This difference is entirely attributable to two benchmarks:
Benchmark | Base | Head |
---|---|---|
go | 5,840 | 129,400 |
pycparser | 11,120,400 | 29675320 |
The go
one is nice to work with because it has no dependencies. Running that benchmark 10 times against the head and base branches produces these numbers exactly every time, so I don't think there is anything non-deterministic in the benchmark.
The other thing that I think @markshannon mentioned should be completely unchanged by the PR is the optimization attempts.
There are many more benchmarks that contribute to this change:
Benchmark | Base | Head |
---|---|---|
async_generators | 1060 | 1260 |
asyncio_websockets | 420 | 480 |
concurrent_imap | 4462 | 4465 |
dask | 4274 | 4249 |
deltablue | 440 | 18900 |
docutils | 11920 | 11980 |
genshi | 35560 | 35640 |
go | 860 | 74920 |
html5lib | 1020 | 1040 |
mypy2 | 16536 | 16597 |
pycparser | 1200 | 3320 |
regex_v8 | 1560 | 2340 |
sqlglot | 3280 | 3320 |
sqlglot_optimize | 5160 | 5220 |
sqlglot_parse | 380 | 440 |
sqlglot_transpile | 1340 | 1400 |
sympy | 13798 | 13903 |
tornado_http | 1080 | 1140 |
typing_runtime_protocols | 700 | 780 |
Again, looking at the go
benchmark, I can reproduce these numbers exactly locally in isolation.
Since "optimization attempts" are counted in "JUMP_BACKWARD" (when reaching a threshold), I also compared that, and I get the following Tier 1 counts for "JUMP_BACKWARD":
Base | Head | |
---|---|---|
Optimization attempts | 860 | 74920 |
JUMP_BACKWARD | 14860 | 28402880 |
These numbers are not proportional, but they do at least move in the same direction.
I did confirm the obvious that the benchmark is doing the same amount of work and running the same number of times in both cases (just with adding print
s and counting).
I'm completely stumped as to why that PR changes the number of JUMP_BACKWARD and thus optimization attempts -- it doesn't seem like that should be affected at all. But it does seem like that could be the cause of a lot of changes "downstream".
I've created a gist to reproduce this that may be helpful. Provided a path to a CPython checkout with a --enable-pystats
build, it runs the go
benchmark and reports on the optimization attempts and number of executions of JUMP_BACKWARD
.