-
Notifications
You must be signed in to change notification settings - Fork 6k
Add synchronization mechanism to avoid deadlock caused by thread merging #30314
Conversation
fml/raster_thread_merger.cc
Outdated
| } | ||
| // the critical section of thead merging | ||
| { | ||
| ScopedCriticalSection lock(critical_section_.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than this, could we just flush the task runners to make sure nothing else has added itself behind our back? Am I missing something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your rapid reply. :)
The platform thread needs to wait for the raster thread to complete some tasks synchronously, for example, platform_view->CreateRenderingSurface. During the synchronous call, if the two threads are merged, this means the task queue of the raster thread also needs to be processed by the platform thread. But, the platform thread is already locked by itself and never gets a task from the task queue of the raster thread.
So, I added a synchronization mechanism here to avoid deadlock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example:
void TaskRunner::RunNowOrPostTask(fml::RefPtr<fml::TaskRunner> runner,
const fml::closure& task) {
FML_DCHECK(runner);
if (runner->RunsTasksOnCurrentThread()) { /// -------->>> A
task();
} else {
runner->PostTask(std::move(task)); /// ----------->> B
}
}void PlatformView::NotifyCreated() {
...
fml::ManualResetWaitableEvent latch;
fml::TaskRunner::RunNowOrPostTask(
task_runners_.GetRasterTaskRunner(),
[platform_view, &surface, &latch]() {
surface = platform_view->CreateRenderingSurface();
if (surface && !surface->IsValid()) {
surface.reset();
}
latch.Signal();
});
latch.Wait(); ////------>>>> C
...
}when the platform thread executes after A point and before B point, the raster thread preempts the CPU and obtains the right to execute, and starts thread merging (Assuming the raster task queue is empty now). When the platform thread gets the CPU again, the task continues to be put into the raster task queue, waiting for the platform thread to be processed later. But the platform thread will block at the C point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if we put a mutex around RunNowOrPostTask that was shared with the thread merger, would that simplify things at all?
IOW, before A lock a mutex that MergeThreads also has to lock to make progress, and as long as you use RunNowOrPostTask (which we do) you should be ok - if you don't, you need to opt in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above is only part of the problem. Please consider another scenario:
After the platform thread posts the sync task to the raster task queue, the raster thread gets the CPU and starts thread merging immediately, and no chance to get the previous sync task to execute.
For this scenario, we maybe need to flush the raster task runner before merging to make sure the sync task the platform thread post has been executed.
However, due to the possible order dependency between tasks, we cannot simply flush all the main task queues. Maybe we can add a new queue type for these sync tasks in TaskSource similar to MicroTasks but with a higher priority, or set a tag for these sync tasks. We try to flush all tasks in the sync task queue before the raster merging thread.
(The result of these sync tasks, eg. platform_view->CreateRenderingSurface, are created in the raster thread, but are used in the platform thread after thread merging.)
@dnfield @chinmaygarde @iskakaushik @cyanglaz Any advice? Thanks.
|
|
||
| namespace fml { | ||
|
|
||
| class CriticalSection : public fml::RefCountedThreadSafe<CriticalSection> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this just a scoped_lock or lock_guard around a std::shared_ptr<std::mutex>?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I didn't find a suitable lock at the beginning, so I added one. :) It may be redundant.
If you all think this solution is work, I will refactor the code and improve the test case. thanks.
|
It looks like this pull request may not have tests. Please make sure to add tests before merging. If you need an exemption to this rule, contact Hixie on the #hackers channel in Chat. If you are not sure if you need tests, consider this rule of thumb: the purpose of a test is to make sure someone doesn't accidentally revert the fix. Ask yourself, is there anything in your PR that you feel it is important we not accidentally revert back to how it was before your fix? Reviewers: Read the Tree Hygiene page and make sure this patch meets those guidelines before LGTMing. |
0b909cc to
3a0f6c7
Compare
|
@dnfield @chinmaygarde @iskakaushik @cyanglaz please have another look, thanks. |
65f7fec to
1f9df55
Compare
| // TODO(0xZOne): Maybe we should only flush expired sync tasks posted by the | ||
| // platform thread. | ||
| if (MessageLoop::IsInitializedForCurrentThread()) { | ||
| if (!IsOnPlatformThread()) { | ||
| MessageLoop::GetCurrent().RunExpiredTasksNow(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iskakaushik @cyanglaz - do we want to specialize this to the PlatformThread?
Do we not also need to flush any raster tasks?
Should we just be flushing both queues first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about the delay. I just cycled back to this today.
We actually cannot flush raster tasks before thread merging. Here's why:
Let's say we have 2 tasks in the raster task queue waiting to be executed. Frame 1 and Frame 2.
Each task represents a frame that will be drawn on the screen. (For example, a scroll animation or a new page transition animation)
Now frame 1 gave us the signal that a thread merge needs to happen (there is a platform view appearing in frame 1).
What we do today is that we drop frame 1 after pre-roll (that's when we know a platform view is in the layer_tree), create a new copy of frame 1, merge the thread, run frame 1 and frame 2 on platform thread.
So we cannot just flush frame 1 and frame 2 on the raster thread because they have to be run on the platform thread. Even when frame 2 doesn't need to run on platform thread (like how the example in the issue is), we still need to run frame 1 on the platform thread.
fml/task_runner.h
Outdated
| /// Similar to |RunNowOrPostTask|, but it needs to synchronize with thread | ||
| /// merging operation. | ||
| static void RunNowOrPostSyncTask(fml::RefPtr<fml::TaskRunner> runner, | ||
| const fml::closure& task); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would avoid adding this new method - someone is likely to use the wrong one not realizing they were supposed to use this one because of thread merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here add a lock to work around data race caused by the raster thread performing thread merging in the process of the platform thread post sync task to the raster thread.
Considering that most scenarios where RunNowOrPostTask is called do not require the lock and should not suffer a performance penalty, finally I choose to add this new method. It would be best if there was a workaround that could take care of both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @dnfield , this method can be misused. Actually this was discussed before.
#28159 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @dnfield , this method can be misused. Actually this was discussed before. #28159 (comment)
It seems a good choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void TaskRunner::RunNowOrPostTaskSync(fml::RefPtr<fml::TaskRunner> runner,
const fml::closure& task) {
FML_DCHECK(runner);
std::scoped_lock lock(g_thread_merging_lock);
if (runner->RunsTasksOnCurrentThread()) {
task();
} else {
fml::AutoResetWaitableEvent latch;
runner->PostTask([&] {
task();
latch.Signal();
});
latch.Wait();
}
}What about this sync method? It encapsulates the latch synchronization operation into the function. It may mitigate the probability of misuse. After all, it is quite different from RunNowOrPostTask. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the bigger problem is how do I know when I have to use this method, and why should I ever choose RunNowOrPostTask over this one?
/cc @chinmaygarde
1f9df55 to
328208b
Compare
328208b to
da35807
Compare
chinmaygarde
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't fully dug into the specifics of this patch but we should not have global locks in the base task runner interfaces.
|
|
||
| namespace fml { | ||
|
|
||
| std::mutex g_thread_merging_lock; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't have global locks when working with non-singleton instances.
|
|
||
| /// To avoid data races between the platform thread posting sync tasks to the | ||
| /// raster thread and the raster thread merging. | ||
| extern std::mutex g_thread_merging_lock; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When adding a mutex, I would ask myself what the critical section is. From the documentation, it is not clear what that is.
Thank you for your reply. I know using a global lock here is indeed not an elegant solution, but I haven't thought of a better one for a while. If we don't do thread merging, there will be no deadlock. But can we? :) I have done tests and didn't notice any obvious issues related to platform view after disabling thread merging on android. engine/pull/31393 @chinmaygarde Could you explain why thread merging is needed? Thanks. |
|
Is it to solve the out-of-sync issue between Flutter UI and platform view? |
|
@cyanglaz (who self assigned the linked issue) or @iskakaushik Can you provide an alternative that does require global locks? |
|
@chinmaygarde Sorry I didn't know that I was still assigned to the issue. I don't have bandwidth to take a look at this issue at the moment so I unassigned myself. That being said, I took a look at the PR (sorry for the delay), left a comment following up Dan's comment. I also would prefer a global solution in the thread merger instead of just fixing for a single instance, but I don't think we can just flush the tasks before merging. The first task in the queue is where we know the merge needs to happen, so they can't be flushed on the raster thread. See above for detailed comment. |
|
Based on the discussion here, we need a solution that doesn't introduce global locks. This needs more nuanced design work that we we currently don't have the cycles to undertake. Perhaps a localized fixed is the way forward in the meantime. |
To avoid deadlock caused by thread merging:
Fixes issue: #94524
/cc @dnfield
Pre-launch Checklist
writing and running engine tests.
///).If you need help, consider asking for advice on the #hackers-new channel on Discord.