-
Notifications
You must be signed in to change notification settings - Fork 8.2k
Add ARM "nonatomic swap" framework for workarounds, fix new case in timeslicing #12400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ARM "nonatomic swap" framework for workarounds, fix new case in timeslicing #12400
Conversation
Codecov Report
@@ Coverage Diff @@
## master #12400 +/- ##
=======================================
Coverage 48.27% 48.27%
=======================================
Files 295 295
Lines 44281 44281
Branches 10601 10601
=======================================
Hits 21376 21376
Misses 18636 18636
Partials 4269 4269
Continue to review full report at Codecov.
|
On ARM, _Swap() isn't atomic and a hardware interrupt can land after the (irq_locked) caller has entered _Swap() but before the context switch actually happens. This will require some platform-specific workarounds in a few places in the scheduler. This commit is just the Kconfig and selection on ARM. Signed-off-by: Andy Ross <[email protected]>
This is a refactoring of the fix in commit 6c95daf to limit its application to affected platforms now that the root cause is understood. Note that the bug that fix was addressing was rare and seen only on after multi-hour sessions on Michael Scott's test rig. So if something regresses, this is where to look! Signed-off-by: Andy Ross <[email protected]>
Timeslicing works by removing the _current thread from the run queue and re-adding it at the end of its priority. On systems with a _Swap() that can be preempted by a timer interrupt, that means it's possible for the timeslice to try to slice out a thread that had already pended itself! This behavior used to be benign (or at least undetectable) as the duplicated list operations were idempotent. But now the dlist code is stricter about correctness and has exposed the bug -- it will blow up if you try to remove an already-removed list node. Fix (on affected platforms) by stashing the _current pointer in _pend_current_thread() that is checked and cleared in the timer interrupt. If we discover we're trying to interrupt a thread that's already interrupted itself, we can safely exit z_time_slice() as a noop. The timeslicing bookeeping was already done for us underneath the pend code. Signed-off-by: Andy Ross <[email protected]>
61027d2 to
845e383
Compare
|
Sorry, pushed the same stale version I messed up before. Remembered to adjust the commit message a bit this time too. |
pabigot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matches the patches I used when testing PR #12248 so looks good to me.
This is a workaround framework for the issue detailed in #12342, stemming directly from code submitted in #12248 . It adds a kconfig to flag the fact that _Swap() may be interrupted before the context switch code gets a chance to update the _current pointer, refactors one existing scheduler workaround to use it, and adds a new case in timeslicing that was exposed by the dlist changes in #12248.