Skip to content
This repository was archived by the owner on Feb 25, 2025. It is now read-only.

Conversation

@ColdPaleLight
Copy link
Member

@ColdPaleLight ColdPaleLight commented Jul 26, 2022

benchmarks for display list builder/display list.
fixes flutter/flutter#108265

In the process of writing the benchmark, I found that generating the rtree itself is not expensive. However, the call matrix().mapRect(&bounds) is very expensive.

Also I think we should use the second set of data (main + PR #34835) as the base data, because its function is correct.

cc @flar

1. main

Run on (10 X 2400 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x10)
  L1 Instruction 128 KiB (x10)
  L2 Unified 4096 KiB (x5)
Load Average: 3.48, 3.72, 4.67
***WARNING*** Library was built as DEBUG. Timings may be affected.
----------------------------------------------------------------------------------------------------------
Benchmark                                                                Time             CPU   Iterations
----------------------------------------------------------------------------------------------------------
BM_DisplayListBuiderDefault/kDefault                                 0.023 ms        0.023 ms        29784
BM_DisplayListBuiderDefault/kBounds                                  0.102 ms        0.101 ms         6934
BM_DisplayListBuiderDefault/kRtree                                   0.183 ms        0.182 ms         3822
BM_DisplayListBuiderDefault/kBoundsAndRtree                          0.253 ms        0.252 ms         2685
BM_DisplayListBuiderWithScaleAndTranslate/kDefault                   0.024 ms        0.024 ms        29750
BM_DisplayListBuiderWithScaleAndTranslate/kBounds                    0.109 ms        0.109 ms         6391
BM_DisplayListBuiderWithScaleAndTranslate/kRtree                     0.188 ms        0.188 ms         3727
BM_DisplayListBuiderWithScaleAndTranslate/kBoundsAndRtree            0.271 ms        0.271 ms         2568
BM_DisplayListBuiderWithPerspective/kDefault                         0.025 ms        0.025 ms        28018
BM_DisplayListBuiderWithPerspective/kBounds                          0.737 ms        0.736 ms          909
BM_DisplayListBuiderWithPerspective/kRtree                           0.814 ms        0.814 ms          856
BM_DisplayListBuiderWithPerspective/kBoundsAndRtree                   1.52 ms         1.52 ms          457
BM_DisplayListBuiderWithClipRect/kDefault                            0.023 ms        0.023 ms        30626
BM_DisplayListBuiderWithClipRect/kBounds                             0.099 ms        0.099 ms         7006
BM_DisplayListBuiderWithClipRect/kRtree                              0.178 ms        0.178 ms         3783
BM_DisplayListBuiderWithClipRect/kBoundsAndRtree                     0.253 ms        0.253 ms         2777
BM_DisplayListBuiderWithSaveLayer/kDefault                           0.049 ms        0.049 ms        13875
BM_DisplayListBuiderWithSaveLayer/kBounds                            0.201 ms        0.201 ms         3491
BM_DisplayListBuiderWithSaveLayer/kRtree                             0.277 ms        0.276 ms         2561
BM_DisplayListBuiderWithSaveLayer/kBoundsAndRtree                    0.422 ms        0.422 ms         1656
BM_DisplayListBuiderWithSaveLayerAndImageFilter/kDefault             0.073 ms        0.072 ms         9593
BM_DisplayListBuiderWithSaveLayerAndImageFilter/kBounds              0.359 ms        0.358 ms         1950
BM_DisplayListBuiderWithSaveLayerAndImageFilter/kRtree               0.439 ms        0.438 ms         1617
BM_DisplayListBuiderWithSaveLayerAndImageFilter/kBoundsAndRtree      0.718 ms        0.716 ms          968

2. main + PR #34365

Run on (10 X 2400 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x10)
  L1 Instruction 128 KiB (x10)
  L2 Unified 4096 KiB (x5)
Load Average: 4.76, 6.12, 7.81
***WARNING*** Library was built as DEBUG. Timings may be affected.
----------------------------------------------------------------------------------------------------------
Benchmark                                                                Time             CPU   Iterations
----------------------------------------------------------------------------------------------------------
BM_DisplayListBuiderDefault/kDefault                                 0.114 ms        0.114 ms         6239
BM_DisplayListBuiderDefault/kBounds                                  0.115 ms        0.114 ms         6253
BM_DisplayListBuiderDefault/kRtree                                   0.188 ms        0.187 ms         3786
BM_DisplayListBuiderDefault/kBoundsAndRtree                          0.187 ms        0.187 ms         3721
BM_DisplayListBuiderWithScaleAndTranslate/kDefault                   0.129 ms        0.129 ms         5418
BM_DisplayListBuiderWithScaleAndTranslate/kBounds                    0.128 ms        0.128 ms         5332
BM_DisplayListBuiderWithScaleAndTranslate/kRtree                     0.200 ms        0.200 ms         3427
BM_DisplayListBuiderWithScaleAndTranslate/kBoundsAndRtree            0.199 ms        0.199 ms         3457
BM_DisplayListBuiderWithPerspective/kDefault                         0.781 ms        0.780 ms          910
BM_DisplayListBuiderWithPerspective/kBounds                          0.782 ms        0.780 ms          909
BM_DisplayListBuiderWithPerspective/kRtree                           0.847 ms        0.846 ms          828
BM_DisplayListBuiderWithPerspective/kBoundsAndRtree                  0.844 ms        0.843 ms          803
BM_DisplayListBuiderWithClipRect/kDefault                            0.115 ms        0.115 ms         5987
BM_DisplayListBuiderWithClipRect/kBounds                             0.115 ms        0.115 ms         6128
BM_DisplayListBuiderWithClipRect/kRtree                              0.188 ms        0.187 ms         3788
BM_DisplayListBuiderWithClipRect/kBoundsAndRtree                     0.188 ms        0.187 ms         3772
BM_DisplayListBuiderWithSaveLayer/kDefault                           0.191 ms        0.191 ms         3735
BM_DisplayListBuiderWithSaveLayer/kBounds                            0.191 ms        0.190 ms         3714
BM_DisplayListBuiderWithSaveLayer/kRtree                             0.263 ms        0.263 ms         2662
BM_DisplayListBuiderWithSaveLayer/kBoundsAndRtree                    0.263 ms        0.262 ms         2687
BM_DisplayListBuiderWithSaveLayerAndImageFilter/kDefault             0.467 ms        0.467 ms         1529
BM_DisplayListBuiderWithSaveLayerAndImageFilter/kBounds              0.465 ms        0.464 ms         1512
BM_DisplayListBuiderWithSaveLayerAndImageFilter/kRtree               0.542 ms        0.541 ms         1303
BM_DisplayListBuiderWithSaveLayerAndImageFilter/kBoundsAndRtree      0.538 ms        0.537 ms         1307

Pre-launch Checklist

  • I read the [Contributor Guide] and followed the process outlined there for submitting PRs.
  • I read the [Tree Hygiene] wiki page, which explains my responsibilities.
  • I read and followed the [Flutter Style Guide] and the [C++, Objective-C, Java style guides].
  • I listed at least one issue that this PR fixes in the description above.
  • I added new tests to check the change I am making or feature I am adding, or Hixie said the PR is test-exempt. See [testing the engine] for instructions on
    writing and running engine tests.
  • I updated/added relevant documentation (doc comments with ///).
  • I signed the [CLA].
  • All existing and new tests are passing.

@ColdPaleLight ColdPaleLight requested a review from flar July 26, 2022 12:48
@flar
Copy link
Contributor

flar commented Jul 26, 2022

I think we need to tailor how we apply all of the bits and pieces. Running all variants of drawRect/drawAtlas/etc. is probably fine, but on each build out we run all of the transform, all of the clip, and all of the saveLayer calls together. Plus, we have pretty much all of the attributes set every which way. I think a good reason that the mapRect call was having such an impact is that we probably left the DL with a nasty perspective matrix for all the rest of the ops to be built on, but perspective is such a corner case in practice.

One technique might be to break out the big array into smaller groups and then we can mix and match then in more controlled ways:

  • One vector for all of the rendering ops
  • One vector for transform variations
  • One vector for clip variations
  • Perhaps different versions of the transform/clip/attribute/save vectors for benchmarking just a few options vs the full testing done for the correctness testing.
  • "allGroups" (maybe with a better name) would then be a concatenation of the "every variation" vectors in each category.
  • benchmarking would mainly work with "allRenderingOps" vector and then have a few variations of the other state to combine that vector with

We then could have (the following is a full list, but we might want to start with a subset, see below):

  • time for unmodified render ops construction
  • time for render ops under (?translate?), scale, (?rotate?), perspective matrices
  • time for render ops under clipRect/(?RR/Path?)
  • with and without saveLayer/restore
  • perhaps with non-Layer-save/restore sprinkled about ridiculously (we may want to optimize those out eventually)
  • time for render ops with a couple of stroke variations (PathEffect very rarely used, not worth benchmarking?)
  • time for render ops with an ImageFilter (most typically a Blur, others are rarely used)
  • possibly combine some of those latter categories with a simple scale, but not a whole suite of transform types
  • (any other attributes that will complicate construction?)

For starters I'd like to see a simple small number of benchmarks with the most common variations:

  • identity (most ui.Picture objects don't have a DPI scale built in)
  • scale/translate (most LayerTrees have a DPI scale at the root level and then a translate here and there for layout)
  • (maybe perspective for shock value)
  • clipRect (I think clipRRect is also used in a few cases, but probably doesn't add much time)
  • (bunch of ops) then saveLayer then (bunch of ops) then restore
  • saveLayer suite as above, but with blur image filter on saveLayer

I'm not sure what to do about attributes for now as I do want to know how long they take, but I don't want to have lots of random settings through most of the benchmarks. The reason they should be benchmarked is that they can do deep copies of complex attributes and it would be good to know how much that costs us in the long run, but I don't think that is urgent at this time, something we can add over time.

@ColdPaleLight
Copy link
Member Author

@flar
I tweak the code as you suggested and updated the benchmarks data on the top.

@flar
Copy link
Contributor

flar commented Jul 27, 2022

(Comment updated with new numbers on 7/29/22)

While the results graphed and discussed here represent considerations for #34365, they actually point out the value of the benchmarks being added in this PR.

It looks like the new code can construct DL+bounds (the most common case by far) between 10-20% slower than the previous code, possibly because of having to maintain all of the extra rectangles. DL + RTree is similarly a little slower but the gap is 5-10%.

Also, the timing with the ImageFilter saveLayer is much worse - 20-30%.

Where the new code shines, though, is in computing both bounds and RTree on a new DL which is quite a bit faster across the board - a 25-45% savings.

Graph of 2 pass vs 1 pass, construct + bounds

DL construct + Bounds new
DL impact construct + bounds new

Graph of 2 pass vs 1 pass, construct + rtree

DL Construct + RTree new
DL impact construct + rtree new

Graph of 2 pass vs 1 pass, construct + bounds + rtree

DL Construct + bounds + rtree new
DL impact construct + bounds + rtree new

@flar
Copy link
Contributor

flar commented Jul 27, 2022

The data originally graphed in this comment has been updated and the new graphs are now included in the previous comment.

Another thing to consider is that this didn't help with the case of constructing a DL and then computing an RTree without also computing Bounds. You didn't collect benchmark data for that case, but I inferred it using "result2 - result1 + result0" and came up with the following graph.

Graph of obsolete data, do not look ;)

DL Construct + RTree

The only target we currently have for computing an RTree will be when DL gets used to collect pieces of scenes that are in front of, behind, and between external embedded platform views. In those cases, I don't think we would compute the bounds, just the RTree.

But, future work might benefit from the fact that the trifecta of "construct + bounds + rtree" is faster now if we add fast-rtree-culling to the dispatch of a DisplayList.

@flar
Copy link
Contributor

flar commented Jul 27, 2022

Those last 2 comments are really about the benefit of the work done in #34365

This PR, which simply adds the benchmark will be important regardless of that other work, so I'll review it with intent to land it as work progresses on the subsequent developments of DL/bounds/RTree...

@ColdPaleLight ColdPaleLight requested a review from flar July 29, 2022 07:45
@ColdPaleLight
Copy link
Member Author

I updated the benchmarks data on the top again.

Copy link
Contributor

@flar flar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything LGTM, autosubmit-ing now to start generating data.

@flar flar added the autosubmit Merge PR when tree becomes green via auto submit App label Jul 29, 2022
@auto-submit auto-submit bot merged commit 56c6c18 into flutter:main Jul 29, 2022
engine-flutter-autoroll added a commit to engine-flutter-autoroll/flutter that referenced this pull request Jul 30, 2022
@flar
Copy link
Contributor

flar commented Aug 2, 2022

I was having trouble finding the results in Skia perf until I realized that the test name was misspelled:

https://flutter-engine-perf.skia.org/e/?queries=executable%3D._display_list_builder_benchmarks

Fix: #35104

emilyabest pushed a commit to emilyabest/engine that referenced this pull request Aug 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

autosubmit Merge PR when tree becomes green via auto submit App

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create benchmark for DisplayList operations

2 participants