Add synchronization mechanism to avoid deadlock caused by thread merging #30314

0xZOne · 2021-12-14T11:43:07Z

To avoid deadlock caused by thread merging:

Add synchronization mechanism to ensure that during the platform thread posting synchronization messages to the raster thread, the raster thread will not perform the thread merging;
Before the thread merging, flush the expired task to ensure that the synchronization message has been executed.

Fixes issue: #94524

/cc @dnfield

Pre-launch Checklist

I read the Contributor Guide and followed the process outlined there for submitting PRs.
I read the Tree Hygiene wiki page, which explains my responsibilities.
I read and followed the Flutter Style Guide and the C++, Objective-C, Java style guides.
I listed at least one issue that this PR fixes in the description above.
I added new tests to check the change I am making or feature I am adding, or Hixie said the PR is test-exempt. See testing the engine for instructions on
writing and running engine tests.
I updated/added relevant documentation (doc comments with ///).
I signed the CLA.
All existing and new tests are passing.

If you need help, consider asking for advice on the #hackers-new channel on Discord.

dnfield · 2021-12-14T19:10:28Z

fml/raster_thread_merger.cc

-  }
+  // the critical section of thead merging
+  {
+    ScopedCriticalSection lock(critical_section_.get());


Rather than this, could we just flush the task runners to make sure nothing else has added itself behind our back? Am I missing something here?

Thanks for your rapid reply. :)

The platform thread needs to wait for the raster thread to complete some tasks synchronously, for example, platform_view->CreateRenderingSurface. During the synchronous call, if the two threads are merged, this means the task queue of the raster thread also needs to be processed by the platform thread. But, the platform thread is already locked by itself and never gets a task from the task queue of the raster thread.

So, I added a synchronization mechanism here to avoid deadlock.

For example:

void TaskRunner::RunNowOrPostTask(fml::RefPtr<fml::TaskRunner> runner, const fml::closure& task) { FML_DCHECK(runner); if (runner->RunsTasksOnCurrentThread()) { /// -------->>> A task(); } else { runner->PostTask(std::move(task)); /// ----------->> B } }

void PlatformView::NotifyCreated() { ... fml::ManualResetWaitableEvent latch; fml::TaskRunner::RunNowOrPostTask( task_runners_.GetRasterTaskRunner(), [platform_view, &surface, &latch]() { surface = platform_view->CreateRenderingSurface(); if (surface && !surface->IsValid()) { surface.reset(); } latch.Signal(); }); latch.Wait(); ////------>>>> C ... }

when the platform thread executes after A point and before B point, the raster thread preempts the CPU and obtains the right to execute, and starts thread merging (Assuming the raster task queue is empty now). When the platform thread gets the CPU again, the task continues to be put into the raster task queue, waiting for the platform thread to be processed later. But the platform thread will block at the C point.

So if we put a mutex around RunNowOrPostTask that was shared with the thread merger, would that simplify things at all?

IOW, before A lock a mutex that MergeThreads also has to lock to make progress, and as long as you use RunNowOrPostTask (which we do) you should be ok - if you don't, you need to opt in.

The above is only part of the problem. Please consider another scenario:

After the platform thread posts the sync task to the raster task queue, the raster thread gets the CPU and starts thread merging immediately, and no chance to get the previous sync task to execute.

For this scenario, we maybe need to flush the raster task runner before merging to make sure the sync task the platform thread post has been executed.

However, due to the possible order dependency between tasks, we cannot simply flush all the main task queues. Maybe we can add a new queue type for these sync tasks in TaskSource similar to MicroTasks but with a higher priority, or set a tag for these sync tasks. We try to flush all tasks in the sync task queue before the raster merging thread.

(The result of these sync tasks, eg. platform_view->CreateRenderingSurface, are created in the raster thread, but are used in the platform thread after thread merging.)

@dnfield @chinmaygarde @iskakaushik @cyanglaz Any advice? Thanks.

chinmaygarde · 2021-12-14T19:18:31Z

fml/synchronization/critical_section.h

+
+namespace fml {
+
+class CriticalSection : public fml::RefCountedThreadSafe<CriticalSection> {


Isn't this just a scoped_lock or lock_guard around a std::shared_ptr<std::mutex>?

Yes, I didn't find a suitable lock at the beginning, so I added one. :) It may be redundant.

If you all think this solution is work, I will refactor the code and improve the test case. thanks.

flutter-dashboard · 2021-12-21T11:13:41Z

It looks like this pull request may not have tests. Please make sure to add tests before merging. If you need an exemption to this rule, contact Hixie on the #hackers channel in Chat.

If you are not sure if you need tests, consider this rule of thumb: the purpose of a test is to make sure someone doesn't accidentally revert the fix. Ask yourself, is there anything in your PR that you feel it is important we not accidentally revert back to how it was before your fix?

Reviewers: Read the Tree Hygiene page and make sure this patch meets those guidelines before LGTMing.

0xZOne · 2021-12-24T04:40:14Z

@dnfield @chinmaygarde @iskakaushik @cyanglaz please have another look, thanks.

dnfield · 2022-01-11T00:20:23Z

fml/raster_thread_merger.cc

+  // TODO(0xZOne): Maybe we should only flush expired sync tasks posted by the
+  // platform thread.
+  if (MessageLoop::IsInitializedForCurrentThread()) {
+    if (!IsOnPlatformThread()) {
+      MessageLoop::GetCurrent().RunExpiredTasksNow();
+    }


@iskakaushik @cyanglaz - do we want to specialize this to the PlatformThread?

Do we not also need to flush any raster tasks?

Should we just be flushing both queues first?

Sorry about the delay. I just cycled back to this today.
We actually cannot flush raster tasks before thread merging. Here's why:

Let's say we have 2 tasks in the raster task queue waiting to be executed. Frame 1 and Frame 2.
Each task represents a frame that will be drawn on the screen. (For example, a scroll animation or a new page transition animation)

Now frame 1 gave us the signal that a thread merge needs to happen (there is a platform view appearing in frame 1).

What we do today is that we drop frame 1 after pre-roll (that's when we know a platform view is in the layer_tree), create a new copy of frame 1, merge the thread, run frame 1 and frame 2 on platform thread.

So we cannot just flush frame 1 and frame 2 on the raster thread because they have to be run on the platform thread. Even when frame 2 doesn't need to run on platform thread (like how the example in the issue is), we still need to run frame 1 on the platform thread.

dnfield · 2022-01-11T00:21:31Z

fml/task_runner.h

+  /// Similar to |RunNowOrPostTask|, but it needs to synchronize with thread
+  /// merging operation.
+  static void RunNowOrPostSyncTask(fml::RefPtr<fml::TaskRunner> runner,
+                                   const fml::closure& task);
+


I would avoid adding this new method - someone is likely to use the wrong one not realizing they were supposed to use this one because of thread merging.

Here add a lock to work around data race caused by the raster thread performing thread merging in the process of the platform thread post sync task to the raster thread.

Considering that most scenarios where RunNowOrPostTask is called do not require the lock and should not suffer a performance penalty, finally I choose to add this new method. It would be best if there was a workaround that could take care of both.

I agree with @dnfield , this method can be misused. Actually this was discussed before.
#28159 (comment)

I agree with @dnfield , this method can be misused. Actually this was discussed before. #28159 (comment)

It seems a good choice.

void TaskRunner::RunNowOrPostTaskSync(fml::RefPtr<fml::TaskRunner> runner, const fml::closure& task) { FML_DCHECK(runner); std::scoped_lock lock(g_thread_merging_lock); if (runner->RunsTasksOnCurrentThread()) { task(); } else { fml::AutoResetWaitableEvent latch; runner->PostTask([&] { task(); latch.Signal(); }); latch.Wait(); } }

@dnfield @ColdPaleLight

What about this sync method? It encapsulates the latch synchronization operation into the function. It may mitigate the probability of misuse. After all, it is quite different from RunNowOrPostTask. :)

I think the bigger problem is how do I know when I have to use this method, and why should I ever choose RunNowOrPostTask over this one?

/cc @chinmaygarde

chinmaygarde

I haven't fully dug into the specifics of this patch but we should not have global locks in the base task runner interfaces.

chinmaygarde · 2022-02-03T21:16:06Z

fml/task_runner.cc


 namespace fml {

+std::mutex g_thread_merging_lock;


We can't have global locks when working with non-singleton instances.

chinmaygarde · 2022-02-03T21:18:09Z

fml/task_runner.h


+/// To avoid data races between the platform thread posting sync tasks to the
+/// raster thread and the raster thread merging.
+extern std::mutex g_thread_merging_lock;


When adding a mutex, I would ask myself what the critical section is. From the documentation, it is not clear what that is.

0xZOne · 2022-02-11T07:53:43Z

I haven't fully dug into the specifics of this patch but we should not have global locks in the base task runner interfaces.

Thank you for your reply. I know using a global lock here is indeed not an elegant solution, but I haven't thought of a better one for a while.

If we don't do thread merging, there will be no deadlock. But can we? :)

I have done tests and didn't notice any obvious issues related to platform view after disabling thread merging on android. engine/pull/31393

@chinmaygarde Could you explain why thread merging is needed? Thanks.

0xZOne · 2022-02-11T08:18:25Z

Is it to solve the out-of-sync issue between Flutter UI and platform view?

chinmaygarde · 2022-02-17T21:22:15Z

@cyanglaz (who self assigned the linked issue) or @iskakaushik Can you provide an alternative that does require global locks?

cyanglaz · 2022-02-17T21:59:34Z

@chinmaygarde Sorry I didn't know that I was still assigned to the issue. I don't have bandwidth to take a look at this issue at the moment so I unassigned myself.

That being said, I took a look at the PR (sorry for the delay), left a comment following up Dan's comment. I also would prefer a global solution in the thread merger instead of just fixing for a single instance, but I don't think we can just flush the tasks before merging. The first task in the queue is where we know the merge needs to happen, so they can't be flushed on the raster thread. See above for detailed comment.

chinmaygarde · 2022-03-03T21:25:47Z

Based on the discussion here, we need a solution that doesn't introduce global locks. This needs more nuanced design work that we we currently don't have the cycles to undertake. Perhaps a localized fixed is the way forward in the meantime.

flutter-dashboard bot added the platform-android label Dec 14, 2021

0xZOne marked this pull request as draft December 14, 2021 11:46

0xZOne marked this pull request as ready for review December 14, 2021 11:57

dnfield reviewed Dec 14, 2021

View reviewed changes

dnfield requested review from chinmaygarde, cyanglaz and iskakaushik December 14, 2021 19:10

chinmaygarde reviewed Dec 14, 2021

View reviewed changes

0xZOne marked this pull request as draft December 15, 2021 01:59

0xZOne changed the title ~~Add synchronization mechanism to fix deadlock caused by thread merging~~ Add synchronization mechanism to avoid deadlock caused by thread merging Dec 15, 2021

0xZOne marked this pull request as ready for review December 15, 2021 04:39

flutter-dashboard bot added the platform-fuchsia label Dec 15, 2021

0xZOne requested review from chinmaygarde and dnfield December 15, 2021 04:39

0xZOne marked this pull request as draft December 15, 2021 11:00

0xZOne marked this pull request as ready for review December 21, 2021 11:13

flutter-dashboard bot added the embedder Related to the embedder API label Dec 21, 2021

flutter-dashboard bot added the needs tests label Dec 21, 2021

0xZOne marked this pull request as draft December 22, 2021 08:45

0xZOne force-pushed the task/platform-view-deadlock2 branch 2 times, most recently from 0b909cc to 3a0f6c7 Compare December 24, 2021 02:36

0xZOne marked this pull request as ready for review December 24, 2021 03:27

0xZOne mentioned this pull request Dec 24, 2021

Fix deadlock caused by thread merging #30067

Closed

8 tasks

0xZOne force-pushed the task/platform-view-deadlock2 branch from 65f7fec to 1f9df55 Compare December 24, 2021 10:50

ColdPaleLight added the customer: alibaba label Jan 6, 2022

dnfield reviewed Jan 11, 2022

View reviewed changes

0xZOne force-pushed the task/platform-view-deadlock2 branch from 1f9df55 to 328208b Compare January 14, 2022 05:50

0xZOne requested review from ColdPaleLight and dnfield January 14, 2022 05:53

0xZOne added 3 commits January 14, 2022 22:09

Fix platformview deadlock.

c0c591f

rename global lock

7973865

Add synchronous |PostTask| function.

da35807

0xZOne force-pushed the task/platform-view-deadlock2 branch from 328208b to da35807 Compare January 14, 2022 14:09

chinmaygarde requested review from gaaclarke and jason-simmons February 3, 2022 21:13

chinmaygarde suggested changes Feb 3, 2022

View reviewed changes

chinmaygarde closed this Mar 3, 2022


		namespace fml {

		class CriticalSection : public fml::RefCountedThreadSafe<CriticalSection> {

Add synchronization mechanism to avoid deadlock caused by thread merging #30314

Add synchronization mechanism to avoid deadlock caused by thread merging #30314

Uh oh!

Conversation

0xZOne commented Dec 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pre-launch Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0xZOne Dec 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0xZOne Dec 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0xZOne Dec 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flutter-dashboard bot commented Dec 21, 2021

Uh oh!

0xZOne commented Dec 24, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chinmaygarde left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0xZOne commented Feb 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xZOne commented Feb 11, 2022

Uh oh!

chinmaygarde commented Feb 17, 2022

Uh oh!

cyanglaz commented Feb 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chinmaygarde commented Mar 3, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

0xZOne commented Dec 14, 2021 •

edited

Loading

0xZOne Dec 15, 2021 •

edited

Loading

0xZOne Dec 16, 2021 •

edited

Loading

0xZOne Dec 15, 2021 •

edited

Loading

0xZOne commented Feb 11, 2022 •

edited

Loading

cyanglaz commented Feb 17, 2022 •

edited

Loading