Skip to content

LIBGAV1 perf regression trunk vs. Clang 9 #43884

@adibiagio

Description

@adibiagio
Bugzilla Link 44539
Resolution FIXED
Resolved on Feb 19, 2020 05:09
Version trunk
OS Windows NT
Blocks #43756 #43900
CC @weiguozhi,@topperc,@davidbolvansky,@efriedma-quic,@gregbedwell,@zmodem,@LebedevRI,@RKSimon,@nikic,@rotateright
Fixed by commit(s) 5eb19bf

Extended Description

This is related to bug 44411.

There is a significant perf regression in benchmark LIBGAV1.


-O3 -march=znver1

Numbers are FPS (frames per second); more is better.

single thread -- 2000 frames
========
                         |  GCC 7.4 |  CLANG 9.x  |  CLANG Master 
chimera_8b_1080p.ivf     |  22.77   |  21.86      |  18.71
chimera_10b_1080p.ivf    |  11.31   |  12.68      |  11.67   
summer_nature_1080p.ivf  |  21.10   |  21.02      |  18.29
summer_nature_4K.ivf     |   4.74   |   4.57      |   3.94


multi threaded (8) -- no frame limit
========
                         |  GCC 7.4 |  CLANG 9.x  |  CLANG master
chimera_8b_1080p.ivf     |  43.51   |  42.76      |  34.80
chimera_10b_1080p.ivf    |  16.18   |  18.98      |  17.12
summer_nature_1080p.ivf  |  64.22   |  63.66      |  53.89
summer_nature_4K.ivf     |  17.57   |  17.17      |  14.70


multi threaded (16) -- no frame limit
========
                         |  GCC 7.4 |  CLANG 9.x  |  CLANG master
chimera_8b_1080p.ivf     |  43.40   |  43.05      |  38.73
chimera_10b_1080p.ivf    |  16.54   |  19.68      |  18.67
summer_nature_1080p.ivf  |  62.72   |  62.20      |  54.96
summer_nature_4K.ivf     |  19.31   |  19.11      |  17.13

The single threaded execution is ~14% slower on master vs clang 9.x.

Later I will post a full description of the underlying issue that caused this perf regression.

tl;dr: performance degradation in libgav is caused by poor decisions made by pass "x86 cmov converter". In particular, a bunch of CMOVs from a hot loop are now sub-optimally expanded into if-then blocks. Those CMOVs weren't expanded by the Clang 9 compiler (that was the correct decision).

If we disable that pass then we fully get back the performance loss. For example, decoding "chimera_8b_1080p.ivf" with a single thread gives us an average of 22.14 fps.

As I wrote, I plan to post all my findings in a follow-up comment.

NOTE: this is unlikely to be AMD specific. For example, I can reproduce the poor CMOV expansions if I generate code for Skylake.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions