Skip to content
This repository was archived by the owner on Feb 25, 2025. It is now read-only.

Conversation

@flar
Copy link
Contributor

@flar flar commented Jan 29, 2020

These changes are attempting to avoid the condition we are having mainly on iOS where we will get hung up on one frame and then enter a period of time where all frames are blocked by OpenGL waiting for the next vsync. To avoid this, the changes here attempt to reschedule the render pass for a future vsync time interval (based on the last layer_tree generated or guessing on a delay if it has to - more work could be done to be more clever about that heuristic).

Issue: flutter/flutter#75540

Before this change, the flutter_gallery_ios__transition_perf benchmark was posting 16+ms timings for the rasterization time of a lot of its frames. For example:

    "average_frame_rasterizer_time_millis": 6.832514690982775,
    "90th_percentile_frame_rasterizer_time_millis": 16.652,
    "99th_percentile_frame_rasterizer_time_millis": 17.377,
    "worst_frame_rasterizer_time_millis": 19.322,

After the changes, the timings for average and 90% frames are much better. For example:

    "average_frame_rasterizer_time_millis": 2.412388779527557,
    "90th_percentile_frame_rasterizer_time_millis": 3.704,
    "99th_percentile_frame_rasterizer_time_millis": 16.282,
    "worst_frame_rasterizer_time_millis": 18.165,

With this fix, hopefully we can expect more consistent results from our benchmark output.

This is currently a WIP as much more testing is needed in many more cases, but I wanted to get the diffs out there to solicit feedback on the basic approach before I get too far with the testing.

@flar flar added the Work in progress (WIP) Not ready (yet) for review! label Jan 29, 2020
Copy link
Contributor

@liyuqian liyuqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metrics look promising but I'm not 100% sure how the code works. When this PR is out of WIP, can you please add more comments, and maybe attach the traces before and after this PR for comparison?

float frame_device_pixel_ratio() const { return frame_device_pixel_ratio_; }

void RecordBuildTime(fml::TimePoint begin_start);
void RecordBuildTime(fml::TimePoint begin_start, fml::TimePoint begin_target);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo? "begin_start -> build_start, begin_target -> build_target"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These times are measured based on the upcall to Window._beginFrame and so they are documenting the timeline of the "begin" part of the frame, thus their names. I just needed to add "target" to the existing mix and went with names to blend in, but I agree that the names are somewhat strained.

In our posted benchmark statistics we have the concept of "build" vs "rasterize" times and so your suggestion is in keeping with that nomenclature.

We also have the concept of "Produce" (meaning upcall to Dart to build a scene) vs "Consume" (meaning transfer a LayerTree to the GPU thread to be rendered) in the Engine "Pipeline" object.

I do like build_start, build_target, build_completion, build_duration as names better than what is there, but is there some other terminology in play in any of these systems we should consider before we make a change?


{
std::scoped_lock lock(queue_mutex_);
if (queue_.size() == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why queue_.size() == 0 doesn't have to be handled previously? It should cause a crash in queue_.pop_front() when queue is empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always had a 1:1 correlation between putting a LayerTree on the queue and taking it off, but I wanted to eliminate the queue and so now the task can be scheduled when there is something in the queue, but by the time it actually runs we might have eliminated something in the queue.

When this fix goes beyond the WIP stage, I'll see if I can throttle the task posting so we only have at most one outstanding task and only when we have a waiting LayerTree to process, but it's a lot easier to just let the posted tasks go away on their own.

Minimally, this might need to change to if (<nextTreePointer> == null) when I delete the queue, whether or not I make the task posting more conditional.

void ProducerCommit(ResourcePtr resource, size_t trace_id) {
{
std::scoped_lock lock(queue_mutex_);
if (queue_.size() > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this limiting the queue to only have size no more than 1? If it always has size no more than 1, why do we need a queue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I don't see the need to have more than one LayerTree waiting for us to process and stacking them up in a queue just provides more ways to fall behind in the GPU thread. This is a minimal impact way of pretending we don't have a queue. A larger change to switch to just a pointer will happen if we believe that this approach is correct.

Copy link
Contributor Author

@flar flar Jan 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also need to find out how this impacts other teams that might be using our code to do more than just the buffering we do on iOS. In particular, I heard of a team that was thinking that their scrolling might get smoother if they switched to quadruple buffering. They may actually be happier with this form of throttling the queue, or not. If they really do want to quadruple buffer, then we might need to keep a queue and the throttling to a size of 1 might be iOS specific.

I also haven't tested the impact of eliminating the queue on Android which seems less bothered by presenting more than one frame per vsync interval. Does this help or hurt on Android (rhetorical question for myself to investigate)?

while (queue_.size() > depth_) {
queue_.pop_back();
if (queue_.size() > 0) {
FML_LOG(INFO) << "ignoring CommitFront in favor of existing queue entry";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any consequence or downside of throwing away the resource when the queue is nonempty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in the ProduceToFront (of queue) call. It is called when we go to rasterize a LayerTree and are told to resubmit it instead. When we had a queue, that LayerTree came off the front of the queue and so that's where it should go back on. In the new system I'm assuming that something in the queue will be newer than what we had just been processing and so I just let it fall on the floor.

These two methods to add something to the queue could all be replaced by a single version of "ProduceCommit" that just compares the target times of the incoming LayerTree with the target time of the one that is already waiting and keeps only the newest version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add that the only source of "kResubmit" in the code I could find came from the iOS platform view code and looked to be related to a change in threading resulting from a change in the structure of a frame (due to a change in the embedding?)...

shell/platform/darwin/ios/framework/Source/FlutterPlatformViews.mm: return PostPrerollResult::kResubmitFrame;

});
}

fml::TimePoint Rasterizer::DetermineRasterizationTime(LayerTree& layer_tree) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the time to start rasterization, and thus usually the time to finish build?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implements the "only one per vsync interval" part of the solution. It is computing "when do we want to start this rasterization" based on whether we have already presented a LayerTree in this interval or not. If it is in the future, we will want to postpone our current task until this calculated time.

}
return now + (target - start);
}
return start;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand why this function needs to work in this way. Returning build start time as rasterization time looks strange.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should add comments, but these computations may change over time under this WIP.

If the GPU thread was directly informed of the latest reported vsync times from the vsync_waiter, then it could make better guesses as to when to postpone the current task. It doesn't get that information because sometimes the vsync is paused (when we get too far behind), and the info is delivered to another thread through a separate callback mechanism, so getting that information to the rasterizer involves a communication channel that I haven't identified (or created?) yet. For now, it will base its decision on the posted LayerTree that it is planning to handle.

This particular computation can be beefed up more in the future to make better estimates. It doesn't fully prevent the condition in some cases, but it keeps us from entering the condition if we are only one frame behind, possibly more and it also keeps us from perpetuation the condition for more than a single frame when we are multiple frames behind.

Here is the current heuristic...

If we start processing a LayerTree and a new one gets posted while we are processing it, then the time we finished rendering the current one will be after the "begin" time of the new one. That seems to work pretty well to detect the condition.

If our last rasterization end time is before the target time of this LayerTree then that is an ideal case where we are only 1 frame behind and we can schedule for the target time of that frame and that is the best estimate we have of when the next vsync interval starts. Note that because we will start processing the LayerTree exactly on the leading edge of the next vsync interval then we don't give the produce end of the pipeline any time to compute a more recent frame. Further on in this comment I talk about future changes to the production end that can help alleviate this lag, but for now this lag is no worse than when we get blocked in the kernel anyway.

If we are later than even the target time of the LayerTree we are processing, then we need to pick some point in the future that is after the current vsync interval for which we have no direct start/target times. I could linearly extrapolate the vsync intervals from the historic information in the LayerTree I have at hand, but for now I chose "now+frame_duration" (as computed from the target-start times in the LayerTree. This could actually be better than computing the next real vsync time because it means when that vsync interval comes to happen, we might have a chance to quickly compute a new scene and post it before the GPU thread task wakes up - which will mean we render a more "recent" scene. If we start processing immediately at the next vsync time then we will be rendering in parallel with a scene build and we'll end up presenting an out of date scene just as a fresh scene becomes available - at which point we can't render that scene until the next interval.

Eventually I want to instrument the code that fields a vsync notification and if it detects that we are already rasterizing a LayerTree, then have it either punt back and wait for the next vysnc notification to synch up, or to just go ahead and compute the next frame. I originally tried that strategy to see if it would help, but it didn't solve the problem. It might serve a role in the current "fix", though. To be clear, I am thinking of the following in the vsync callback:

  if (<gpu runner is rasterizing>) {
    duration = target - start;
    start = target;
    target += duration;
  }

That code would compute "frame+1" and start that processing now rather than wait for the next vsync notification. This gives us headroom in case the build times are taking longer than a single frame to compute (which likely correlates with the rasterization times going long as well).

Eventually we'd want some really sophisticated prediction algorithms that try to figure out "If I start computing a scene now, how long do I think it will take to do that, and how long do I think we'd take to render it, and what time do I actually think it will show up on the screen", but even the baby step I just outlined would be 1 better than our current system of always rendering based on the reported times. Also, such a heuristic calculation might get caught off guard if its predictions suddenly change - after all, who knows how much complexity got added or removed at the app level? Nothing in this code has that information. But, at least it can decide not to compute a scene that it can show won't get rendered until after the reported target time.

Also, a more sophisticated prediction algorithm would come in 2 parts - the build part that tries to target the sum of "build + raster" timing, and then when the task is posted to the GPU runner it will have to compute when to start processing the frame so that it posts approximately when it was computed for. There is a lot of magic to consider, and be worried about, there...

@flar flar changed the title Only rasterizer more than one layer tree per vsync frame Only rasterize one layer tree per vsync frame Jan 29, 2020
@flar
Copy link
Contributor Author

flar commented Jan 29, 2020

(liyuqian's comments on the code made me realize I should have added a summary of the fix to explain what you are seeing here, apologies.)

Here is an overview of the fix...

In breaking down how we handle frames I discovered a couple of things:

  • The problem I'm addressing is that we get into a state where each frame might only need 1-2ms of processing to rasterize, but the rasterizer thread immediately gets blocked on its first GPU operation until the next vsync interval. We don't waste energy in that state, but it ruins our accounting of time spent doing actual rasterization, and we lose flexibility in reacting to changes in the scene if we are stuck in a kernel wait for close to 16ms at a time.

  • We know that pipeline depth=1 prevents the condition that I'm trying to solve so I used that as a guide/barometer on what properties of this frame pumping "pipeline" (plus associated mechanisms) could be responsible

  • The condition always happened right after the GPU thread rasterized more than one scene in a given reported vsync interval so I wanted to prevent that condition specifically. Indeed, while other parts of the fix may have statistically made the condition less likely, it never truly went away until I coded up the GPU postponement mechanism, and the "double context-present" action was always present before the condition even when I was throttling other parts of the pump - verifying the hypothesis.

  • The other parts of the fix are also helpful to the main part of the fix in a few ways...

  • Recording the target time of a frame in the layer_tree helps the GPU thread figure out "am I in the same vsync interval as my previous render completion?" it also helps with deciding when to schedule the next invocation

  • A queue of LayerTree objects doesn't make much sense. I don't think we get a benefit of rendering LayerTree(frame - 2) when LayerTree(frame - 1) is already computed. If we receive a new LayerTree, that is the next one to process whether we already have one waiting or not. Additionally, if we push a LayerTree back to the front of the queue because we got a "kResubmit" condition from trying to render it, then it doesn't make sense to reprocess it if we already have a new one - just switch to the new one instead. Chinmay and I verified that the resubmit comes from the need to switch threads when embedded and so it shouldn't matter which frame we process in that case, we just want to make sure that one is available for rendering.

  • I can eliminate the queue entirely, but my first pass at seeing if reducing it to a "queue of 1" was localized to just clearing it out when we fill it as a "proof of concept" hack. If the fixes here pass muster, then I'll replace it in a non-WIP proposal with just a "nextTree" pointer or something similar.

I'll respond to liyuqian's comments individually now...

@liyuqian
Copy link
Contributor

@chinmaygarde @dnfield : are we double-buffered on iOS? This PR seems to suggest that we only have one buffer on iOS so rendering the second frame in one vsync would cause it to wait.

@flar : what would happen if you simply set pipeline depth to 1 (see, e.g., #3909) ?

@dnfield
Copy link
Contributor

dnfield commented Jan 30, 2020

@flar - is this an accurate representation of the problem?

  1. We try to queue up unrasterized layers. In theory, this helps us be ready to render next frames more quickly.
  2. Unfortunately, actually rasterizing the layer can cause a wait for next vsync on iOS - in other words, we really can only rasterize one layer per vsync, because otherwise we end up getting stuck in the GL driver and actually making things worse.

If that's correct, I have the following questions:

  • What about other platforms? This change affects all platforms, it really needs some good tests there to prove that we're not bringing everyone down to a lowest common denominator.
  • What about Metal? Metal APIs should give us some more knobs to turn. If we make this change for a deprecated API that we're looking to move away from, would our efforts be better spent on finding the right Metal setup and finishing the iOS Metal migration?
  • If this is true of OpenGL on other platforms, what about Vulkan?

@dnfield
Copy link
Contributor

dnfield commented Jan 30, 2020

@liyuqian I believe we are double buffered at least when there is a BDF in the layer tree. But it seems like the problem is that the iOS GL driver might be angry at us for trying to make it do more than it thinks is appropriate per vsync.

@flar
Copy link
Contributor Author

flar commented Jan 30, 2020

@dnfield BDF has us insert a layer, but it doesn't change the way that the frames are pumped. I have more to say about the terms "double buffering" and "triple buffering" because I don't think we are using them properly, but I do consider the saveLayer when there is a BDF to not be "double buffering". (Basically "buffering" in the case of double vs triple is an interface between various agents involved in moving the graphical data along. Because the same thread is constructing the saveLayer, filling it, and rendering it to the scene, it doesn't impact our "N buffering" designation.)

@liyuqian iOS appears to let us render 2 frames in one vsync, but then we are always hit with a block on a following frame. So, maybe there are 2 renderable frames, but when we make use of them we then get in a bad state, so I'm preventing that. That is partially why I think more discussion may be warranted because while limiting ourselves to 1 per frame does eliminate the problem, there was obviously some wiggle room that we don't yet understand and perhaps could make use of if we knew what it's parameters and limits were.

@liyuqian - depth=1 is one fix for this problem, but it prevents us from "building ahead" so I'm attempting to preserve "build-ahead" while also preventing the blocking state.

All - yes, indeed, if you read my statements above - this may be an iOS only solution, but I'm not going to go through the trouble to add the flags necessary to limit it at this stage of the WIP. First, let's discuss how these changes to the rendering model affect how we do work before we deal with limiting their scope to just the problem areas.

I'm debating a brain dump in github comments vs. writing a doc about some of these issues. For instance, my description of what "double buffering" really is above in my answer to Dan really is about 3 or 4 paragraphs with diagrams. And then we get to the question of how to apply that to our architecture. (And I'm still trying to get it all straight in my head first as well.)

@flar
Copy link
Contributor Author

flar commented Jan 30, 2020

I also want to add that underneath this all, whether or not we can find a way to render more than one frame per interval on iOS...

I don't see a need to have a queue of LayerTrees when the ones that are in the queue are obviously obsoleted by the one we are about to put there. It may be that some platforms don't mind if we spend time rendering some extra old LayerTrees to catch up, but we shouldn't be doing that in the first place. Unfortunately, just gutting the queue doesn't completely solve our problem on iOS.

@dnfield
Copy link
Contributor

dnfield commented Jan 30, 2020

I would find it helpful to have a design doc on this, which we could comment on a bit more coherently and discuss the various implications/tradeoffs :)

@chinmaygarde
Copy link
Member

This WIP PR seems stalled. Can we close it?

@flar
Copy link
Contributor Author

flar commented Feb 5, 2021

There was other work being done by @iskakaushik along these lines that was hoping to separate the production and rendering of frames. This implementation may be obsoleted by that work?

@flar
Copy link
Contributor Author

flar commented Feb 5, 2021

We still suffer from this problem today. I run into it all the time when running the macrobenchmarks -> ImageFiltered Transform Animation page. If I switch from one mode to another I can get stuck in this 16ms/frame mode. I have to unselect the mode and reselect it to get down to the actual performance metric. It's very repeatable.

We've sat on that issue for quite a long time and we need to get to the bottom of it and solve it.

I notice that there is no official issue for the problem so I filed one: flutter/flutter#75540

@flutter-dashboard
Copy link

It looks like this pull request may not have tests. Please make sure to add tests before merging. If you need an exemption to this rule, contact Hixie on the #hackers channel in Chat.

If you are not sure if you need tests, consider this rule of thumb: the purpose of a test is to make sure someone doesn't accidentally revert the fix. Ask yourself, is there anything in your PR that you feel it is important we not accidentally revert back to how it was before your fix?

Reviewers: Read the Tree Hygiene page and make sure this patch meets those guidelines before LGTMing.

@flar
Copy link
Contributor Author

flar commented Feb 17, 2021

This PR may have been obsoleted by #24387

That PR originally had to be reverted due to failures in some CL tests, so waiting for that PR to be fully rolled up the chain before closing this one.

@cbracken cbracken removed their request for review February 23, 2021 21:10
@flar
Copy link
Contributor Author

flar commented Jul 14, 2021

This PR is so far behind the current state of the engine that it doesn't really even serve as a good example of a prototype. Closing. The related PR should be the one to watch moving forward and the issue is definitely on the engine radar.

@flar flar closed this Jul 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

cla: yes Work in progress (WIP) Not ready (yet) for review!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants