Skip to content
This repository was archived by the owner on Feb 25, 2025. It is now read-only.

Conversation

@0xZOne
Copy link
Member

@0xZOne 0xZOne commented Dec 14, 2021

To avoid deadlock caused by thread merging:

  1. Add synchronization mechanism to ensure that during the platform thread posting synchronization messages to the raster thread, the raster thread will not perform the thread merging;
  2. Before the thread merging, flush the expired task to ensure that the synchronization message has been executed.

Fixes issue: #94524

/cc @dnfield

Pre-launch Checklist

  • I read the Contributor Guide and followed the process outlined there for submitting PRs.
  • I read the Tree Hygiene wiki page, which explains my responsibilities.
  • I read and followed the Flutter Style Guide and the C++, Objective-C, Java style guides.
  • I listed at least one issue that this PR fixes in the description above.
  • I added new tests to check the change I am making or feature I am adding, or Hixie said the PR is test-exempt. See testing the engine for instructions on
    writing and running engine tests.
  • I updated/added relevant documentation (doc comments with ///).
  • I signed the CLA.
  • All existing and new tests are passing.

If you need help, consider asking for advice on the #hackers-new channel on Discord.

@0xZOne 0xZOne marked this pull request as draft December 14, 2021 11:46
@0xZOne 0xZOne marked this pull request as ready for review December 14, 2021 11:57
}
// the critical section of thead merging
{
ScopedCriticalSection lock(critical_section_.get());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than this, could we just flush the task runners to make sure nothing else has added itself behind our back? Am I missing something here?

Copy link
Member Author

@0xZOne 0xZOne Dec 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your rapid reply. :)

The platform thread needs to wait for the raster thread to complete some tasks synchronously, for example, platform_view->CreateRenderingSurface. During the synchronous call, if the two threads are merged, this means the task queue of the raster thread also needs to be processed by the platform thread. But, the platform thread is already locked by itself and never gets a task from the task queue of the raster thread.

So, I added a synchronization mechanism here to avoid deadlock.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example:

void TaskRunner::RunNowOrPostTask(fml::RefPtr<fml::TaskRunner> runner,
                                  const fml::closure& task) {
  FML_DCHECK(runner);
  if (runner->RunsTasksOnCurrentThread()) { /// -------->>> A
    task();
  } else {
    runner->PostTask(std::move(task));   /// ----------->> B
  }
}
void PlatformView::NotifyCreated() {
  ...
  fml::ManualResetWaitableEvent latch;
  fml::TaskRunner::RunNowOrPostTask(
      task_runners_.GetRasterTaskRunner(),
      [platform_view, &surface, &latch]() {
        surface = platform_view->CreateRenderingSurface();
        if (surface && !surface->IsValid()) {
          surface.reset();
        }
        latch.Signal();
      });
  latch.Wait();   ////------>>>> C
  ...
}

when the platform thread executes after A point and before B point, the raster thread preempts the CPU and obtains the right to execute, and starts thread merging (Assuming the raster task queue is empty now). When the platform thread gets the CPU again, the task continues to be put into the raster task queue, waiting for the platform thread to be processed later. But the platform thread will block at the C point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we put a mutex around RunNowOrPostTask that was shared with the thread merger, would that simplify things at all?

IOW, before A lock a mutex that MergeThreads also has to lock to make progress, and as long as you use RunNowOrPostTask (which we do) you should be ok - if you don't, you need to opt in.

Copy link
Member Author

@0xZOne 0xZOne Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above is only part of the problem. Please consider another scenario:

After the platform thread posts the sync task to the raster task queue, the raster thread gets the CPU and starts thread merging immediately, and no chance to get the previous sync task to execute.

For this scenario, we maybe need to flush the raster task runner before merging to make sure the sync task the platform thread post has been executed.

However, due to the possible order dependency between tasks, we cannot simply flush all the main task queues. Maybe we can add a new queue type for these sync tasks in TaskSource similar to MicroTasks but with a higher priority, or set a tag for these sync tasks. We try to flush all tasks in the sync task queue before the raster merging thread.

(The result of these sync tasks, eg. platform_view->CreateRenderingSurface, are created in the raster thread, but are used in the platform thread after thread merging.)

@dnfield @chinmaygarde @iskakaushik @cyanglaz Any advice? Thanks.


namespace fml {

class CriticalSection : public fml::RefCountedThreadSafe<CriticalSection> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this just a scoped_lock or lock_guard around a std::shared_ptr<std::mutex>?

Copy link
Member Author

@0xZOne 0xZOne Dec 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I didn't find a suitable lock at the beginning, so I added one. :) It may be redundant.

If you all think this solution is work, I will refactor the code and improve the test case. thanks.

@0xZOne 0xZOne marked this pull request as draft December 15, 2021 01:59
@0xZOne 0xZOne changed the title Add synchronization mechanism to fix deadlock caused by thread merging Add synchronization mechanism to avoid deadlock caused by thread merging Dec 15, 2021
@0xZOne 0xZOne marked this pull request as ready for review December 15, 2021 04:39
@0xZOne 0xZOne marked this pull request as draft December 15, 2021 11:00
@0xZOne 0xZOne marked this pull request as ready for review December 21, 2021 11:13
@flutter-dashboard flutter-dashboard bot added the embedder Related to the embedder API label Dec 21, 2021
@flutter-dashboard
Copy link

It looks like this pull request may not have tests. Please make sure to add tests before merging. If you need an exemption to this rule, contact Hixie on the #hackers channel in Chat.

If you are not sure if you need tests, consider this rule of thumb: the purpose of a test is to make sure someone doesn't accidentally revert the fix. Ask yourself, is there anything in your PR that you feel it is important we not accidentally revert back to how it was before your fix?

Reviewers: Read the Tree Hygiene page and make sure this patch meets those guidelines before LGTMing.

@0xZOne 0xZOne marked this pull request as draft December 22, 2021 08:45
@0xZOne 0xZOne force-pushed the task/platform-view-deadlock2 branch 2 times, most recently from 0b909cc to 3a0f6c7 Compare December 24, 2021 02:36
@0xZOne 0xZOne marked this pull request as ready for review December 24, 2021 03:27
@0xZOne
Copy link
Member Author

0xZOne commented Dec 24, 2021

@dnfield @chinmaygarde @iskakaushik @cyanglaz please have another look, thanks.

Comment on lines +72 to +77
// TODO(0xZOne): Maybe we should only flush expired sync tasks posted by the
// platform thread.
if (MessageLoop::IsInitializedForCurrentThread()) {
if (!IsOnPlatformThread()) {
MessageLoop::GetCurrent().RunExpiredTasksNow();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iskakaushik @cyanglaz - do we want to specialize this to the PlatformThread?

Do we not also need to flush any raster tasks?

Should we just be flushing both queues first?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the delay. I just cycled back to this today.
We actually cannot flush raster tasks before thread merging. Here's why:

Let's say we have 2 tasks in the raster task queue waiting to be executed. Frame 1 and Frame 2.
Each task represents a frame that will be drawn on the screen. (For example, a scroll animation or a new page transition animation)

Now frame 1 gave us the signal that a thread merge needs to happen (there is a platform view appearing in frame 1).

What we do today is that we drop frame 1 after pre-roll (that's when we know a platform view is in the layer_tree), create a new copy of frame 1, merge the thread, run frame 1 and frame 2 on platform thread.

So we cannot just flush frame 1 and frame 2 on the raster thread because they have to be run on the platform thread. Even when frame 2 doesn't need to run on platform thread (like how the example in the issue is), we still need to run frame 1 on the platform thread.

Comment on lines 69 to 73
/// Similar to |RunNowOrPostTask|, but it needs to synchronize with thread
/// merging operation.
static void RunNowOrPostSyncTask(fml::RefPtr<fml::TaskRunner> runner,
const fml::closure& task);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid adding this new method - someone is likely to use the wrong one not realizing they were supposed to use this one because of thread merging.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here add a lock to work around data race caused by the raster thread performing thread merging in the process of the platform thread post sync task to the raster thread.

Considering that most scenarios where RunNowOrPostTask is called do not require the lock and should not suffer a performance penalty, finally I choose to add this new method. It would be best if there was a workaround that could take care of both.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @dnfield , this method can be misused. Actually this was discussed before.
#28159 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @dnfield , this method can be misused. Actually this was discussed before. #28159 (comment)

It seems a good choice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

void TaskRunner::RunNowOrPostTaskSync(fml::RefPtr<fml::TaskRunner> runner,
                                      const fml::closure& task) {
  FML_DCHECK(runner);
  std::scoped_lock lock(g_thread_merging_lock);
  if (runner->RunsTasksOnCurrentThread()) {
    task();
  } else {
    fml::AutoResetWaitableEvent latch;
    runner->PostTask([&] {
      task();
      latch.Signal();
    });
    latch.Wait();
  }
}

@dnfield @ColdPaleLight

What about this sync method? It encapsulates the latch synchronization operation into the function. It may mitigate the probability of misuse. After all, it is quite different from RunNowOrPostTask. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the bigger problem is how do I know when I have to use this method, and why should I ever choose RunNowOrPostTask over this one?

/cc @chinmaygarde

@0xZOne 0xZOne force-pushed the task/platform-view-deadlock2 branch from 1f9df55 to 328208b Compare January 14, 2022 05:50
@0xZOne 0xZOne force-pushed the task/platform-view-deadlock2 branch from 328208b to da35807 Compare January 14, 2022 14:09
Copy link
Member

@chinmaygarde chinmaygarde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't fully dug into the specifics of this patch but we should not have global locks in the base task runner interfaces.


namespace fml {

std::mutex g_thread_merging_lock;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't have global locks when working with non-singleton instances.


/// To avoid data races between the platform thread posting sync tasks to the
/// raster thread and the raster thread merging.
extern std::mutex g_thread_merging_lock;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When adding a mutex, I would ask myself what the critical section is. From the documentation, it is not clear what that is.

@0xZOne
Copy link
Member Author

0xZOne commented Feb 11, 2022

I haven't fully dug into the specifics of this patch but we should not have global locks in the base task runner interfaces.

Thank you for your reply. I know using a global lock here is indeed not an elegant solution, but I haven't thought of a better one for a while.

If we don't do thread merging, there will be no deadlock. But can we? :)

I have done tests and didn't notice any obvious issues related to platform view after disabling thread merging on android. engine/pull/31393

@chinmaygarde Could you explain why thread merging is needed? Thanks.

@0xZOne
Copy link
Member Author

0xZOne commented Feb 11, 2022

Is it to solve the out-of-sync issue between Flutter UI and platform view?

@chinmaygarde
Copy link
Member

@cyanglaz (who self assigned the linked issue) or @iskakaushik Can you provide an alternative that does require global locks?

@cyanglaz
Copy link
Contributor

cyanglaz commented Feb 17, 2022

@chinmaygarde Sorry I didn't know that I was still assigned to the issue. I don't have bandwidth to take a look at this issue at the moment so I unassigned myself.

That being said, I took a look at the PR (sorry for the delay), left a comment following up Dan's comment. I also would prefer a global solution in the thread merger instead of just fixing for a single instance, but I don't think we can just flush the tasks before merging. The first task in the queue is where we know the merge needs to happen, so they can't be flushed on the raster thread. See above for detailed comment.

@chinmaygarde
Copy link
Member

Based on the discussion here, we need a solution that doesn't introduce global locks. This needs more nuanced design work that we we currently don't have the cycles to undertake. Perhaps a localized fixed is the way forward in the meantime.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants