Skip to content

Conversation

@andyross
Copy link
Contributor

@andyross andyross commented Jan 9, 2019

This is a workaround framework for the issue detailed in #12342, stemming directly from code submitted in #12248 . It adds a kconfig to flag the fact that _Swap() may be interrupted before the context switch code gets a chance to update the _current pointer, refactors one existing scheduler workaround to use it, and adds a new case in timeslicing that was exposed by the dlist changes in #12248.

@codecov-io
Copy link

codecov-io commented Jan 9, 2019

Codecov Report

Merging #12400 into master will not change coverage.
The diff coverage is 50%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #12400   +/-   ##
=======================================
  Coverage   48.27%   48.27%           
=======================================
  Files         295      295           
  Lines       44281    44281           
  Branches    10601    10601           
=======================================
  Hits        21376    21376           
  Misses      18636    18636           
  Partials     4269     4269
Impacted Files Coverage Δ
kernel/sched.c 92.83% <50%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 67af71e...845e383. Read the comment docs.

Andy Ross added 3 commits January 9, 2019 11:42
On ARM, _Swap() isn't atomic and a hardware interrupt can land after
the (irq_locked) caller has entered _Swap() but before the context
switch actually happens.  This will require some platform-specific
workarounds in a few places in the scheduler.

This commit is just the Kconfig and selection on ARM.

Signed-off-by: Andy Ross <[email protected]>
This is a refactoring of the fix in commit 6c95daf to limit its
application to affected platforms now that the root cause is
understood.

Note that the bug that fix was addressing was rare and seen only on
after multi-hour sessions on Michael Scott's test rig.  So if
something regresses, this is where to look!

Signed-off-by: Andy Ross <[email protected]>
Timeslicing works by removing the _current thread from the run queue
and re-adding it at the end of its priority.  On systems with a
_Swap() that can be preempted by a timer interrupt, that means it's
possible for the timeslice to try to slice out a thread that had
already pended itself!

This behavior used to be benign (or at least undetectable) as the
duplicated list operations were idempotent.  But now the dlist code is
stricter about correctness and has exposed the bug -- it will blow up
if you try to remove an already-removed list node.

Fix (on affected platforms) by stashing the _current pointer in
_pend_current_thread() that is checked and cleared in the timer
interrupt.  If we discover we're trying to interrupt a thread that's
already interrupted itself, we can safely exit z_time_slice() as a
noop.  The timeslicing bookeeping was already done for us underneath
the pend code.

Signed-off-by: Andy Ross <[email protected]>
@andyross
Copy link
Contributor Author

andyross commented Jan 9, 2019

Sorry, pushed the same stale version I messed up before. Remembered to adjust the commit message a bit this time too.

Copy link
Contributor

@pabigot pabigot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matches the patches I used when testing PR #12248 so looks good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants