Only rasterize one layer tree per vsync frame #16175

flar · 2020-01-29T04:19:47Z

These changes are attempting to avoid the condition we are having mainly on iOS where we will get hung up on one frame and then enter a period of time where all frames are blocked by OpenGL waiting for the next vsync. To avoid this, the changes here attempt to reschedule the render pass for a future vsync time interval (based on the last layer_tree generated or guessing on a delay if it has to - more work could be done to be more clever about that heuristic).

Issue: flutter/flutter#75540

Before this change, the flutter_gallery_ios__transition_perf benchmark was posting 16+ms timings for the rasterization time of a lot of its frames. For example:

    "average_frame_rasterizer_time_millis": 6.832514690982775,
    "90th_percentile_frame_rasterizer_time_millis": 16.652,
    "99th_percentile_frame_rasterizer_time_millis": 17.377,
    "worst_frame_rasterizer_time_millis": 19.322,

After the changes, the timings for average and 90% frames are much better. For example:

    "average_frame_rasterizer_time_millis": 2.412388779527557,
    "90th_percentile_frame_rasterizer_time_millis": 3.704,
    "99th_percentile_frame_rasterizer_time_millis": 16.282,
    "worst_frame_rasterizer_time_millis": 18.165,

With this fix, hopefully we can expect more consistent results from our benchmark output.

This is currently a WIP as much more testing is needed in many more cases, but I wanted to get the diffs out there to solicit feedback on the basic approach before I get too far with the testing.

… per vsync frame

liyuqian

The metrics look promising but I'm not 100% sure how the code works. When this PR is out of WIP, can you please add more comments, and maybe attach the traces before and after this PR for comparison?

liyuqian · 2020-01-29T08:52:57Z

flow/layers/layer_tree.h

  float frame_device_pixel_ratio() const { return frame_device_pixel_ratio_; }

-  void RecordBuildTime(fml::TimePoint begin_start);
+  void RecordBuildTime(fml::TimePoint begin_start, fml::TimePoint begin_target);


Typo? "begin_start -> build_start, begin_target -> build_target"

These times are measured based on the upcall to Window._beginFrame and so they are documenting the timeline of the "begin" part of the frame, thus their names. I just needed to add "target" to the existing mix and went with names to blend in, but I agree that the names are somewhat strained.

In our posted benchmark statistics we have the concept of "build" vs "rasterize" times and so your suggestion is in keeping with that nomenclature.

We also have the concept of "Produce" (meaning upcall to Dart to build a scene) vs "Consume" (meaning transfer a LayerTree to the GPU thread to be rendered) in the Engine "Pipeline" object.

I do like build_start, build_target, build_completion, build_duration as names better than what is there, but is there some other terminology in play in any of these systems we should consider before we make a change?

shell/common/animator.cc

liyuqian · 2020-01-29T09:20:11Z

shell/common/pipeline.h


    {
      std::scoped_lock lock(queue_mutex_);
+      if (queue_.size() == 0) {


I wonder why queue_.size() == 0 doesn't have to be handled previously? It should cause a crash in queue_.pop_front() when queue is empty?

We always had a 1:1 correlation between putting a LayerTree on the queue and taking it off, but I wanted to eliminate the queue and so now the task can be scheduled when there is something in the queue, but by the time it actually runs we might have eliminated something in the queue.

When this fix goes beyond the WIP stage, I'll see if I can throttle the task posting so we only have at most one outstanding task and only when we have a waiting LayerTree to process, but it's a lot easier to just let the posted tasks go away on their own.

Minimally, this might need to change to if (<nextTreePointer> == null) when I delete the queue, whether or not I make the task posting more conditional.

liyuqian · 2020-01-29T09:21:29Z

shell/common/pipeline.h

  void ProducerCommit(ResourcePtr resource, size_t trace_id) {
    {
      std::scoped_lock lock(queue_mutex_);
+      if (queue_.size() > 0) {


Is this limiting the queue to only have size no more than 1? If it always has size no more than 1, why do we need a queue?

Correct. I don't see the need to have more than one LayerTree waiting for us to process and stacking them up in a queue just provides more ways to fall behind in the GPU thread. This is a minimal impact way of pretending we don't have a queue. A larger change to switch to just a pointer will happen if we believe that this approach is correct.

I also need to find out how this impacts other teams that might be using our code to do more than just the buffering we do on iOS. In particular, I heard of a team that was thinking that their scrolling might get smoother if they switched to quadruple buffering. They may actually be happier with this form of throttling the queue, or not. If they really do want to quadruple buffer, then we might need to keep a queue and the throttling to a size of 1 might be iOS specific.

I also haven't tested the impact of eliminating the queue on Android which seems less bothered by presenting more than one frame per vsync interval. Does this help or hurt on Android (rhetorical question for myself to investigate)?

liyuqian · 2020-01-29T09:22:44Z

shell/common/pipeline.h

-      while (queue_.size() > depth_) {
-        queue_.pop_back();
+      if (queue_.size() > 0) {
+        FML_LOG(INFO) << "ignoring CommitFront in favor of existing queue entry";


Is there any consequence or downside of throwing away the resource when the queue is nonempty?

This is in the ProduceToFront (of queue) call. It is called when we go to rasterize a LayerTree and are told to resubmit it instead. When we had a queue, that LayerTree came off the front of the queue and so that's where it should go back on. In the new system I'm assuming that something in the queue will be newer than what we had just been processing and so I just let it fall on the floor.

These two methods to add something to the queue could all be replaced by a single version of "ProduceCommit" that just compares the target times of the incoming LayerTree with the target time of the one that is already waiting and keeps only the newest version.

I'll add that the only source of "kResubmit" in the code I could find came from the iOS platform view code and looked to be related to a change in threading resulting from a change in the structure of a frame (due to a change in the embedding?)...

shell/platform/darwin/ios/framework/Source/FlutterPlatformViews.mm: return PostPrerollResult::kResubmitFrame;

liyuqian · 2020-01-29T09:31:16Z

shell/common/rasterizer.cc

                              });
 }

+fml::TimePoint Rasterizer::DetermineRasterizationTime(LayerTree& layer_tree) {


Is this the time to start rasterization, and thus usually the time to finish build?

This implements the "only one per vsync interval" part of the solution. It is computing "when do we want to start this rasterization" based on whether we have already presented a LayerTree in this interval or not. If it is in the future, we will want to postpone our current task until this calculated time.

liyuqian · 2020-01-29T09:32:42Z

shell/common/rasterizer.cc

+    }
+    return now + (target - start);
+  }
+  return start;


I don't fully understand why this function needs to work in this way. Returning build start time as rasterization time looks strange.

I should add comments, but these computations may change over time under this WIP.

If the GPU thread was directly informed of the latest reported vsync times from the vsync_waiter, then it could make better guesses as to when to postpone the current task. It doesn't get that information because sometimes the vsync is paused (when we get too far behind), and the info is delivered to another thread through a separate callback mechanism, so getting that information to the rasterizer involves a communication channel that I haven't identified (or created?) yet. For now, it will base its decision on the posted LayerTree that it is planning to handle.

This particular computation can be beefed up more in the future to make better estimates. It doesn't fully prevent the condition in some cases, but it keeps us from entering the condition if we are only one frame behind, possibly more and it also keeps us from perpetuation the condition for more than a single frame when we are multiple frames behind.

Here is the current heuristic...

If we start processing a LayerTree and a new one gets posted while we are processing it, then the time we finished rendering the current one will be after the "begin" time of the new one. That seems to work pretty well to detect the condition.

If our last rasterization end time is before the target time of this LayerTree then that is an ideal case where we are only 1 frame behind and we can schedule for the target time of that frame and that is the best estimate we have of when the next vsync interval starts. Note that because we will start processing the LayerTree exactly on the leading edge of the next vsync interval then we don't give the produce end of the pipeline any time to compute a more recent frame. Further on in this comment I talk about future changes to the production end that can help alleviate this lag, but for now this lag is no worse than when we get blocked in the kernel anyway.

If we are later than even the target time of the LayerTree we are processing, then we need to pick some point in the future that is after the current vsync interval for which we have no direct start/target times. I could linearly extrapolate the vsync intervals from the historic information in the LayerTree I have at hand, but for now I chose "now+frame_duration" (as computed from the target-start times in the LayerTree. This could actually be better than computing the next real vsync time because it means when that vsync interval comes to happen, we might have a chance to quickly compute a new scene and post it before the GPU thread task wakes up - which will mean we render a more "recent" scene. If we start processing immediately at the next vsync time then we will be rendering in parallel with a scene build and we'll end up presenting an out of date scene just as a fresh scene becomes available - at which point we can't render that scene until the next interval.

Eventually I want to instrument the code that fields a vsync notification and if it detects that we are already rasterizing a LayerTree, then have it either punt back and wait for the next vysnc notification to synch up, or to just go ahead and compute the next frame. I originally tried that strategy to see if it would help, but it didn't solve the problem. It might serve a role in the current "fix", though. To be clear, I am thinking of the following in the vsync callback:

if (<gpu runner is rasterizing>) { duration = target - start; start = target; target += duration; }

That code would compute "frame+1" and start that processing now rather than wait for the next vsync notification. This gives us headroom in case the build times are taking longer than a single frame to compute (which likely correlates with the rasterization times going long as well).

Eventually we'd want some really sophisticated prediction algorithms that try to figure out "If I start computing a scene now, how long do I think it will take to do that, and how long do I think we'd take to render it, and what time do I actually think it will show up on the screen", but even the baby step I just outlined would be 1 better than our current system of always rendering based on the reported times. Also, such a heuristic calculation might get caught off guard if its predictions suddenly change - after all, who knows how much complexity got added or removed at the app level? Nothing in this code has that information. But, at least it can decide not to compute a scene that it can show won't get rendered until after the reported target time.

Also, a more sophisticated prediction algorithm would come in 2 parts - the build part that tries to target the sum of "build + raster" timing, and then when the task is posted to the GPU runner it will have to compute when to start processing the frame so that it posts approximately when it was computed for. There is a lot of magic to consider, and be worried about, there...

flar · 2020-01-29T17:39:04Z

(liyuqian's comments on the code made me realize I should have added a summary of the fix to explain what you are seeing here, apologies.)

Here is an overview of the fix...

In breaking down how we handle frames I discovered a couple of things:

The problem I'm addressing is that we get into a state where each frame might only need 1-2ms of processing to rasterize, but the rasterizer thread immediately gets blocked on its first GPU operation until the next vsync interval. We don't waste energy in that state, but it ruins our accounting of time spent doing actual rasterization, and we lose flexibility in reacting to changes in the scene if we are stuck in a kernel wait for close to 16ms at a time.
We know that pipeline depth=1 prevents the condition that I'm trying to solve so I used that as a guide/barometer on what properties of this frame pumping "pipeline" (plus associated mechanisms) could be responsible
The condition always happened right after the GPU thread rasterized more than one scene in a given reported vsync interval so I wanted to prevent that condition specifically. Indeed, while other parts of the fix may have statistically made the condition less likely, it never truly went away until I coded up the GPU postponement mechanism, and the "double context-present" action was always present before the condition even when I was throttling other parts of the pump - verifying the hypothesis.
The other parts of the fix are also helpful to the main part of the fix in a few ways...
Recording the target time of a frame in the layer_tree helps the GPU thread figure out "am I in the same vsync interval as my previous render completion?" it also helps with deciding when to schedule the next invocation
A queue of LayerTree objects doesn't make much sense. I don't think we get a benefit of rendering LayerTree(frame - 2) when LayerTree(frame - 1) is already computed. If we receive a new LayerTree, that is the next one to process whether we already have one waiting or not. Additionally, if we push a LayerTree back to the front of the queue because we got a "kResubmit" condition from trying to render it, then it doesn't make sense to reprocess it if we already have a new one - just switch to the new one instead. Chinmay and I verified that the resubmit comes from the need to switch threads when embedded and so it shouldn't matter which frame we process in that case, we just want to make sure that one is available for rendering.
I can eliminate the queue entirely, but my first pass at seeing if reducing it to a "queue of 1" was localized to just clearing it out when we fill it as a "proof of concept" hack. If the fixes here pass muster, then I'll replace it in a non-WIP proposal with just a "nextTree" pointer or something similar.

I'll respond to liyuqian's comments individually now...

liyuqian · 2020-01-30T10:15:45Z

@chinmaygarde @dnfield : are we double-buffered on iOS? This PR seems to suggest that we only have one buffer on iOS so rendering the second frame in one vsync would cause it to wait.

@flar : what would happen if you simply set pipeline depth to 1 (see, e.g., #3909) ?

dnfield · 2020-01-30T17:05:21Z

@flar - is this an accurate representation of the problem?

We try to queue up unrasterized layers. In theory, this helps us be ready to render next frames more quickly.
Unfortunately, actually rasterizing the layer can cause a wait for next vsync on iOS - in other words, we really can only rasterize one layer per vsync, because otherwise we end up getting stuck in the GL driver and actually making things worse.

If that's correct, I have the following questions:

What about other platforms? This change affects all platforms, it really needs some good tests there to prove that we're not bringing everyone down to a lowest common denominator.
What about Metal? Metal APIs should give us some more knobs to turn. If we make this change for a deprecated API that we're looking to move away from, would our efforts be better spent on finding the right Metal setup and finishing the iOS Metal migration?
If this is true of OpenGL on other platforms, what about Vulkan?

dnfield · 2020-01-30T17:09:32Z

@liyuqian I believe we are double buffered at least when there is a BDF in the layer tree. But it seems like the problem is that the iOS GL driver might be angry at us for trying to make it do more than it thinks is appropriate per vsync.

flar · 2020-01-30T21:29:00Z

@dnfield BDF has us insert a layer, but it doesn't change the way that the frames are pumped. I have more to say about the terms "double buffering" and "triple buffering" because I don't think we are using them properly, but I do consider the saveLayer when there is a BDF to not be "double buffering". (Basically "buffering" in the case of double vs triple is an interface between various agents involved in moving the graphical data along. Because the same thread is constructing the saveLayer, filling it, and rendering it to the scene, it doesn't impact our "N buffering" designation.)

@liyuqian iOS appears to let us render 2 frames in one vsync, but then we are always hit with a block on a following frame. So, maybe there are 2 renderable frames, but when we make use of them we then get in a bad state, so I'm preventing that. That is partially why I think more discussion may be warranted because while limiting ourselves to 1 per frame does eliminate the problem, there was obviously some wiggle room that we don't yet understand and perhaps could make use of if we knew what it's parameters and limits were.

@liyuqian - depth=1 is one fix for this problem, but it prevents us from "building ahead" so I'm attempting to preserve "build-ahead" while also preventing the blocking state.

All - yes, indeed, if you read my statements above - this may be an iOS only solution, but I'm not going to go through the trouble to add the flags necessary to limit it at this stage of the WIP. First, let's discuss how these changes to the rendering model affect how we do work before we deal with limiting their scope to just the problem areas.

I'm debating a brain dump in github comments vs. writing a doc about some of these issues. For instance, my description of what "double buffering" really is above in my answer to Dan really is about 3 or 4 paragraphs with diagrams. And then we get to the question of how to apply that to our architecture. (And I'm still trying to get it all straight in my head first as well.)

flar · 2020-01-30T21:36:55Z

I also want to add that underneath this all, whether or not we can find a way to render more than one frame per interval on iOS...

I don't see a need to have a queue of LayerTrees when the ones that are in the queue are obviously obsoleted by the one we are about to put there. It may be that some platforms don't mind if we spend time rendering some extra old LayerTrees to catch up, but we shouldn't be doing that in the first place. Unfortunately, just gutting the queue doesn't completely solve our problem on iOS.

dnfield · 2020-01-30T22:04:50Z

I would find it helpful to have a design doc on this, which we could comment on a bit more coherently and discuss the various implications/tradeoffs :)

chinmaygarde · 2021-02-04T22:50:16Z

This WIP PR seems stalled. Can we close it?

flar · 2021-02-05T22:46:49Z

There was other work being done by @iskakaushik along these lines that was hoping to separate the production and rendering of frames. This implementation may be obsoleted by that work?

flar · 2021-02-05T23:04:15Z

We still suffer from this problem today. I run into it all the time when running the macrobenchmarks -> ImageFiltered Transform Animation page. If I switch from one mode to another I can get stuck in this 16ms/frame mode. I have to unselect the mode and reselect it to get down to the actual performance metric. It's very repeatable.

We've sat on that issue for quite a long time and we need to get to the bottom of it and solve it.

I notice that there is no official issue for the problem so I filed one: flutter/flutter#75540

flutter-dashboard · 2021-02-05T23:04:48Z

It looks like this pull request may not have tests. Please make sure to add tests before merging. If you need an exemption to this rule, contact Hixie on the #hackers channel in Chat.

If you are not sure if you need tests, consider this rule of thumb: the purpose of a test is to make sure someone doesn't accidentally revert the fix. Ask yourself, is there anything in your PR that you feel it is important we not accidentally revert back to how it was before your fix?

Reviewers: Read the Tree Hygiene page and make sure this patch meets those guidelines before LGTMing.

flar · 2021-02-17T00:57:35Z

This PR may have been obsoleted by #24387

That PR originally had to be reverted due to failures in some CL tests, so waiting for that PR to be fully rolled up the chain before closing this one.

flar · 2021-07-14T20:13:36Z

This PR is so far behind the current state of the engine that it doesn't really even serve as a good example of a prototype. Closing. The related PR should be the one to watch moving forward and the issue is definitely on the engine radar.

Prevent GPU thread from trying to rasterizer more than one layer tree…

9e35937

… per vsync frame

flar added the Work in progress (WIP) Not ready (yet) for review! label Jan 29, 2020

flar requested review from cbracken, chinmaygarde and liyuqian January 29, 2020 04:19

googlebot added the cla: yes label Jan 29, 2020

liyuqian reviewed Jan 29, 2020

View reviewed changes

flar changed the title ~~Only rasterizer more than one layer tree per vsync frame~~ Only rasterize one layer tree per vsync frame Jan 29, 2020

This was referenced Feb 25, 2020

Various post_backdrop* benchmarks have regressed. flutter/flutter#48579

Closed

Reland "Enable iOS platform views for Flutter Gallery" flutter/flutter#51791

Closed

kf6gpe mentioned this pull request Mar 2, 2020

flutter_gallery_ios32__transition_perf 90th_percentile_frame_rasterizer_time_millis and flutter_gallery_ios__transition_perf 90th_percentile_frame_rasterizer_time_millis regression flutter/flutter#49300

Closed

liyuqian mentioned this pull request Mar 9, 2020

iOS apps on OpenGL can sometimes get blocked while rendering frames flutter/flutter#51809

Closed

flar mentioned this pull request Feb 5, 2021

Rasterizer thread can get stuck waiting for vsync rather than completing its frame rendering flutter/flutter#75540

Open

cbracken removed their request for review February 23, 2021 21:10

flar closed this Jul 14, 2021

Only rasterize one layer tree per vsync frame #16175

Only rasterize one layer tree per vsync frame #16175

Uh oh!

Conversation

flar commented Jan 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liyuqian left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flar Jan 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flar commented Jan 29, 2020

Uh oh!

liyuqian commented Jan 30, 2020

Uh oh!

dnfield commented Jan 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dnfield commented Jan 30, 2020

Uh oh!

flar commented Jan 30, 2020

Uh oh!

flar commented Jan 30, 2020

Uh oh!

dnfield commented Jan 30, 2020

Uh oh!

chinmaygarde commented Feb 4, 2021

Uh oh!

flar commented Feb 5, 2021

Uh oh!

flar commented Feb 5, 2021

Uh oh!

flutter-dashboard bot commented Feb 5, 2021

Uh oh!

flar commented Feb 17, 2021

Uh oh!

flar commented Jul 14, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

flar commented Jan 29, 2020 •

edited

Loading

flar Jan 29, 2020 •

edited

Loading

dnfield commented Jan 30, 2020 •

edited

Loading