extend dlist to detect unlinked nodes and avoid double-remove bugs #12248

pabigot · 2018-12-30T16:50:27Z

This PR marks sys_dnode_t instances that are not in a list so that code does not depend on overloaded external state to assess this condition before potentially violating dlist API requirements.

A second commit uses that API to avoid a double-remove in timeout that could result in a corrupted list of timers, as an alternative solution to a commit in #12204 (for full context see this discussion if that doesn't disappear when the commit goes away.)

A third commit works around an unknown double-remove bug in the kernel poll infrastructure. Why the events are being removed multiple times should probably be investigated by a kernel maintainer.

kernel/timeout.c

andyross

Generally in favor, though it would be good to get a feel for how many list operations are being affected and what the performance and code size impact (sys_dlist_remove is generally inlined) is likely to be.

But yeah, we need to fix and not work around that spot in timer that's failing.

andyross · 2018-12-30T17:24:00Z

kernel/poll.c

This is a good catch, and it's surprising that it didn't cause failures already (I guess the existing tests would all exercise the double remove only in circumstances where the list hadn't otherwise changed and the action would be idempotent). I wonder idly if it's not a better idea to put that test into sys_dlist_remove() as long as we're eating the extra cycles...

sys_dlist_remove already documents that the node must be part of the list, and in most situations that's true (and known at the call-site). So I don't think there's reason to add a check in that function.

kernel/timeout.c

pabigot · 2019-01-03T18:41:17Z

Rebased on top of merged #12204, added commit to remove use of _INACTIVE as a flag value to detect timeouts that are not active.

andyross

One spot I think is wrong, a few other suggestions.

And I do still think we should pull up some size comparisons on stuff like tests/benchmarks/footprint to see how much this affects code size. I'd guess that this is a worthwhile tradeoff too, but it'd be nice to prove that.

andyross · 2019-01-03T19:16:21Z

kernel/sched.c

Nitpick on behalf of @ceolin : MISRA rules disallow implicit conversion to bool

But I think this is deeper. The old code would store the "was linked" state implicit in the timeout, but that's no longer ture, right? You don't know that a zero return here means "inactive", I think? Might need to be reworked to do a is_linked call first and then use that as the predicate for the if()?

Actually a quick grep shows that this one of two places actually using the return value of _abort_(thread_?)timeout(). Maybe the simpler thing to do would be to just remove that return value and/or swap it to a boolean that indicates "was active".

The old code returned zero on successful abort, and _INACTIVE on failure. Returning -EINVAL is more consistent with indicating that the abort failed. I really think this is doing the right thing.

I'll change the test to detect a negative result, though, instead of implicit bool although there's an awful lot of implicit bool in Zephyr.

andyross · 2019-01-03T19:20:04Z

kernel/include/timeout_q.h

Not particularly relevant, but this was always a bad API (a true return indicates a false state) that never got fixed up during all the refactoring. I wouldn't cry to see this patch change it to z_timeout_active() or whatever.

This was partly because of the legacy of _EXPIRED: timeouts could be in three states, not two. This function tests that the state is exactly what _INACTIVE indicated.

A true result indicates that the predicate identified by the function name is true. Why is that a bad API? We use sys_dlist_is_empty(), not sys_dlist_is_nonempty().

The whole timeout API uses prefix underscore. While I'd rather that wasn't the case (violates C reserved identifiers for the non-static functions) I wouldn't want to change just one function, nor change all of them, in this PR.

A true result indicates that the predicate identified by the function name is true

I don't think that not incorrect, exactly, but it's not innapropriate to argue against the ellision of unommitted negatives either.

No seriously, double negatives in API choices are bad, especially in predicates. The base adjective for that state is "active". True should mean "active", false should mean "inactive".

andyross · 2019-01-03T19:21:31Z

kernel/timeout.c

As mentioned elsewhere, probably best to just eliminate the return value here or make it a boolean to indicate "was active", as that's all existing callers care about.

The return indicates the attempt to abort the timeout failed because the timeout wasn't active (valid for abort). I believe this is the preferred way to indicate the result of the operation.

pabigot · 2019-01-03T20:26:11Z

@andyross Thanks for the review; I've responded to some of the comments explaining the rationale, and made the one change for implicit bool. Let me know if I'm convincing about keeping the existing API.

BTW: With this patch set mps2_an385 intermittently fails tests/kernel/mem_slab/mslab_threadsafe/kernel.memory_slab due to an attempt to remove an already-removed dnode. I'm still looking for that.

andrewboie · 2019-01-03T20:33:52Z

Good stuff here, although I share Andy's concerns about performance impact since dlists tend to be pretty hot code paths, perhaps consider ifdefing the extra checking around CONFIG_ASSERT or something like that?

pabigot · 2019-01-03T21:08:23Z

There are two concerns: impact on code size from the added null assignments, and impact on performance of the null assignments (and, rarely, checks; most of the time the linked state does not need to be checked).

I can run comparisons for code size, if somebody tells me what platform(s) and what test application(s) to use. What are the criteria for acceptable/unacceptable impact of the change?

Performance is a different issue. First, I'd need instruction on how that would be evaluated in a meaningful way.

Second, the changes here have two effects: bugs where lists would be corrupted by double removes are converted from undetected corruption to faults at point-of-use; and a way to reliably identify nodes that are not in a list, as is needed with timers. The first is a quality issue, and we could certainly turn it off in some cases. The second is a functional capability, which would break if it were optional.

Note that the Linux dlist implementation invalidates the pointers when nodes are removed. They seem to believe any performance impact is negligible relative to the value of detecting failures.

What do you want me to do?

andrewboie · 2019-01-03T21:48:51Z

Note that the Linux dlist implementation invalidates the pointers when nodes are removed. They seem to believe any performance impact is negligible relative to the value of detecting failures.

FWIW, Linux isn't targeting microcontrollers with memory on the order of 64K (or less)

way to reliably identify nodes that are not in a list, as is needed with timers.

Fair enough!

andyross

Removing -1, as I'm convinced the behavior looks correct. I'd still prefer to see _abort_timeout()'s signature matched to the needs to the needs of its actual callers instead of trying to return an error code (it's an internal function, not an API for users outside of the kernel -- hell, even within the kernel it's used only at the lowest level), but that can be fixed whenever.

pabigot · 2019-01-03T21:56:42Z

@andrewboie @andyross Thanks for the approvals. Please take the following into account before merging:

The intermittent failure of mps2_an385 on tests/kernel/mem_slab/mslab_threadsafe/kernel.memory_slabs is due to MPU or BUS errors derefencing invalidated node link pointers during a call to _priq_dumb_remove.

By instrumenting the code I can tell that thread->base.qnode_dlist (which is actually used as dnode) has non-null next and prev pointers prior to invoking sys_dlist_remove but at least one null next or prev pointer within sys_dlist_remove. (Neither is a pointer-to-self, either.)

This appears to be a bug identified by this PR, not one introduced by this PR, although the cause is unclear. Unlike the poll bug no workaround seems to be effective. I don't know how to debug this further.

andyross

Oh, wait. Sorry, I missed that warning earlier. Yeah, we for sure need to get any introduced failures addressed before this merges. If you want me to look at it, can you put together a reproducer and file it as a bug?

pabigot · 2019-01-03T22:13:27Z

Your approvals came in while I was doing final edits on the warning; glad you caught it.

https://github.com/pabigot/zephyr/commits/diag/20181230a has a diagnostic commit on top of this PR. The commit message describes how to reproduce the problem. (The 10563 error code indicates the failing call came from line 563 of the instrumented sched.c under branch condition 10000.)

Thanks for looking at this.

andyross · 2019-01-04T01:29:54Z

Spent a bit on this. Turns out you can make it happen just by putting the test suite in a while(1), which is helpful. It happens less if you disable memory protection (not just userspace, interestingly) but doesn't disappear.

Some work wrapping functions with macros so I can use LINE to get the caller (i.e. the poor man's stack trace) showed that the specific failure you're seeing is in the timeslicing code that is trying to move the active thread to the end of the run queue (which explains why this is the test that fails: it's somewhat inexplicably configured to timeslice at 1kHz!). But _current has nulls in the list entries when it SHOULD be in the run queue! Backing out from that to the very top of the timer ISR and I see the same thing, timeslice expiration or no (which makes the bug much easier to hit).

So something is either interrupting a spot where the list is in flux (which shouldn't happen, and AFAICT isn't) or something is scribbling on the list node, which just happens to live at the very beginning of the thread object... So... I added some dummy buffer fields in the thread struct in front of the list node to try to catch the overrun, and the system doesn't boot.

So at this point I'm reasonably certain there's something in the ARM code that isn't properly respecting the thread layout and writing over its first 8 bytes with nulls. Probably would be easy enough to find with a debugger and some scripted watchpoints, but I'm useless with tooling. Will play more tomorrow.

pabigot · 2019-01-04T06:09:06Z

@andyross Thanks for looking at this. Your description provided some helpful clues.

I was able to replicate the observation of _current having null links in the context of the timer ISR. In my case it occurred because the main thread had exited (removing itself from all run queues) but was still recorded as the value of _current. The tick ISR was still active and when it fired tried moved _current to the back of the queue, which caused havoc since it wasn't in a queue at that point. There might be a better fix, assuming this situation can ever occur in an unhacked kernel; nonetheless I resolved it by changing sliceable to ignore unrunnable threads.

After that red herring was properly tinned I finally found the real problem: mem_slab retains a queue of threads that want to allocate from it that should be scheduled on memory release, but the operation to pull threads from a wait queue didn't handle the case where the queue was empty, at which point it queued garbage. That's also been fixed.

Finally, I corrected the name of the link field in _thread_base so we no longer appear to be appending lists to lists.

The problematic test runs reliably now.

(Actually, as I think about it, the original CONTAINER_OF(NULL, struct, field) from an empty queue would have returned a null pointer and shouldn't have caused problems without the addition of padding to tje start of _thread_base, which instead made it return 0xfffffffc. So the test failure might really have been related to the first problem of switching to a thread that was blocked on the slab wait queue, if those threads were detected as sliceable. If so, maybe the behavior of _pend_current_thread needs to be examined.

Whatever, it works now)

pabigot · 2019-01-04T11:49:34Z

My last patch set solved the problem, by breaking timeslicing. The proper explanation and solution is in this commit:

kernel: sched: inhibit timeslice from unqueued thread

If timeslicing is enabled z_clock_announce will attempt to swap the
current thread for another. The implementation of z_time_slice
assumes that _current is always in the queue, and so moves it, which
relies on valid links in the dnode structure.

When the current thread is in the process of pending itself the thread
may have already been removed from the queue at the point the tick ISR
runs. This violates the requirement in _move_thread_to_end_of_prio_q
that the thread already be in the queue.

If the current thread is not queued treat it as unsliceable: it's in the
process of being swapped anyway.

I'm leaving this one in place for now:

kernel: sched: avoid potential null pointer dereference in wait queue

The existing code has a too-subtle reliance on the sys_dnode_t instance
being located at the start of its container: this is the only situation
where CONTAINER_OF(NULL, struct, tag) will be guaranteed to produce a
null pointer.

When this required layout is violated, e.g. during diagnostics, the
system fails horribly: for example an empty wait queue will result in an
attempt to operate on a thread that does not exist.

Update all _priq_foo_best implementations to explicitly check for a null
pointer rather than relying on the dnode being first in the thread
structure.

because the original code was way too subtle, wasting time for both @andyross and me. However, the checks cost 12 bytes to add under SDK 0.9.5, and 16 bytes under gcc 8.2.1, so you can tell me to remove it.

carlescufi · 2019-01-15T12:08:18Z

@pabigot please rebase

carlescufi · 2019-01-15T12:09:17Z

@andyross if you are happy with this please approve so we can move this forward

pabigot · 2019-01-15T12:23:40Z

@andrewboie @andyross in the churn of this PR the following additional commit got added after the original review process. If this is OK I'd like to leave it, if not I can pull it out and resubmit it in another PR that's blocked by the dlist features.

kernel: timeout: detect inactive timeouts using dnode linked state

Whether a timeout is linked into the timeout queue can be determined
from the corresponding sys_dnode_t linked state. This removes the need
to use a special flag value in dticks to determine that the timeout is
inactive.

Update _abort_timeout to return an error code, rather than the flag
value, when the timeout to be aborted was not active.

Remove the _INACTIVE flag value, and replace its external uses with an
internal API function that checks whether a timeout is inactive.

pabigot · 2019-01-15T12:33:57Z

@andrewboie @andyross @carlescufi Actually, hold on: to make things easier I'm going to close this one, and submit the one that has the combined dlist and timer updates for #12332 together.

pabigot · 2019-01-15T18:58:42Z

Reopening since Andy doesn't like the part of #12485 that motivated putting it in a separate PR. So this is back to being the original content, plus the removal of the magic _INACTIVE flag value.

Although sys_dnode_t and sys_dlist_t are aliases, their roles are different and they appear in different positions in dlist API calls. Signed-off-by: Peter A. Bigot <[email protected]>

CONTAINER_OF() on a NULL pointer returns some offset around NULL and not another NULL pointer. We have to check for that ourselves. This only worked because the dnode happened to be at the start of the struct. Signed-off-by: Peter A. Bigot <[email protected]>

The original implementation allows a list to be corrupted by list operations on the removed node. Existing code attempts to avoid this by using external state to determine whether a node is in a list, but this is fragile and fails when the state that holds the flag value is changed after the node is removed, e.g. in preparation for re-using the node. Follow Linux in invalidating the link pointers in a removed node. Add API so that detection of particpation in a list is available at the node abstraction. This solution relies on the following steady-state invariants: * A node (as opposed to a list) will never be adjacent to itself in a list; * The next and prev pointers of a node are always either both null or both non-null. Signed-off-by: Peter A. Bigot <[email protected]>

Use the new generic capability to detect unlinked sys_dnode_t instances. Signed-off-by: Peter A. Bigot <[email protected]>

k_poll events are registered in a linked list when their signal condition has been met. The code to clear event registration did not account for events that were not registered, resulting in double-removes that produced core dumps on native-posix sanitycheck. Signed-off-by: Peter A. Bigot <[email protected]>

Whether a timeout is linked into the timeout queue can be determined from the corresponding sys_dnode_t linked state. This removes the need to use a special flag value in dticks to determine that the timeout is inactive. Update _abort_timeout to return an error code, rather than the flag value, when the timeout to be aborted was not active. Remove the _INACTIVE flag value, and replace its external uses with an internal API function that checks whether a timeout is inactive. Signed-off-by: Peter A. Bigot <[email protected]>

pabigot · 2019-01-15T19:12:42Z

@andyross Reverted the change of the structure tag per #12485 (comment)

pabigot · 2019-01-18T11:20:38Z

@andyross this is back to the original submitted content, plus the removal of the inactive magic value and restoring the qnode_dlist spelling. Is it OK now?

carlescufi · 2019-01-22T18:46:58Z

@andyross can you please review the last commit in this series?

carlescufi · 2019-01-23T19:46:58Z

Thanks @andyross

pabigot requested review from andrewboie and andyross as code owners December 30, 2018 16:50

pabigot commented Dec 30, 2018

View reviewed changes

kernel/timeout.c Outdated Show resolved Hide resolved

andyross reviewed Dec 30, 2018

View reviewed changes

pabigot force-pushed the pr/20181230a branch 2 times, most recently from 5303d22 to 4833ddf Compare January 2, 2019 19:38

pabigot requested review from nniranjhana, pswarnak and spoorthik as code owners January 2, 2019 19:38

pabigot mentioned this pull request Jan 3, 2019

tests/kernel/timer/timer_api fails on nrf52_pca10040 board #11722

Closed

pabigot force-pushed the pr/20181230a branch from 4833ddf to 29ffd80 Compare January 3, 2019 18:40

andyross requested changes Jan 3, 2019

View reviewed changes

pabigot force-pushed the pr/20181230a branch 2 times, most recently from ef85219 to 7719457 Compare January 3, 2019 20:19

andrewboie approved these changes Jan 3, 2019

View reviewed changes

andyross approved these changes Jan 3, 2019

View reviewed changes

andyross requested changes Jan 3, 2019

View reviewed changes

pabigot force-pushed the pr/20181230a branch from 7719457 to 1cf6512 Compare January 4, 2019 05:53

pabigot force-pushed the pr/20181230a branch from 1cf6512 to 3fa973c Compare January 4, 2019 11:42

pabigot force-pushed the pr/20181230a branch from 3fa973c to 3f0724d Compare January 4, 2019 13:02

pizi-nordic self-requested a review January 11, 2019 11:48

andyross mentioned this pull request Jan 15, 2019

Add ARM "nonatomic swap" framework for workarounds, fix new case in timeslicing #12400

Merged

pabigot mentioned this pull request Jan 15, 2019

Undesirable behaviour of z_clock_announce() when ticks argument is greater by 1 #12332

Open

pabigot force-pushed the pr/20181230a branch from c68a2ee to fb0d742 Compare January 15, 2019 12:17

pabigot removed the DNM This PR should not be merged (Do Not Merge) label Jan 15, 2019

pabigot changed the title ~~[DNM] extend dlist to detect unlinked nodes and avoid double-remove bugs~~ extend dlist to detect unlinked nodes and avoid double-remove bugs Jan 15, 2019

pabigot closed this Jan 15, 2019

pabigot mentioned this pull request Jan 15, 2019

dlist API enhancements supporting timer cleanup and scheduling fix #12485

Closed

pabigot reopened this Jan 15, 2019

pabigot added 6 commits January 15, 2019 13:09

kernel: thread: correct type of dlist node

ee84022

Although sys_dnode_t and sys_dlist_t are aliases, their roles are different and they appear in different positions in dlist API calls. Signed-off-by: Peter A. Bigot <[email protected]>

kernel: timeout: remove local fix for double-remove

ba1f4a2

Use the new generic capability to detect unlinked sys_dnode_t instances. Signed-off-by: Peter A. Bigot <[email protected]>

pabigot force-pushed the pr/20181230a branch from fb0d742 to f364e2a Compare January 15, 2019 19:10

carlescufi added the dev-review To be discussed in dev-review meeting label Jan 22, 2019

andyross approved these changes Jan 23, 2019

View reviewed changes

carlescufi merged commit b4ece0a into zephyrproject-rtos:master Jan 23, 2019

pabigot deleted the pr/20181230a branch January 23, 2019 21:04

pabigot removed the dev-review To be discussed in dev-review meeting label Jan 24, 2019

extend dlist to detect unlinked nodes and avoid double-remove bugs #12248

extend dlist to detect unlinked nodes and avoid double-remove bugs #12248

Uh oh!

Conversation

pabigot commented Dec 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

andyross left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pabigot commented Jan 3, 2019

Uh oh!

andyross left a comment

Choose a reason for hiding this comment

Uh oh!

andyross Jan 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pabigot commented Jan 3, 2019

Uh oh!

andrewboie commented Jan 3, 2019

Uh oh!

pabigot commented Jan 3, 2019

Uh oh!

andrewboie commented Jan 3, 2019

Uh oh!

andyross left a comment

Choose a reason for hiding this comment

Uh oh!

pabigot commented Jan 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andyross left a comment

Choose a reason for hiding this comment

Uh oh!

pabigot commented Jan 3, 2019

Uh oh!

andyross commented Jan 4, 2019

Uh oh!

pabigot commented Jan 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pabigot commented Jan 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carlescufi commented Jan 15, 2019

Uh oh!

carlescufi commented Jan 15, 2019

Uh oh!

pabigot commented Jan 15, 2019

kernel: timeout: detect inactive timeouts using dnode linked state

Uh oh!

pabigot commented Jan 15, 2019

Uh oh!

pabigot commented Jan 15, 2019

Uh oh!

pabigot commented Jan 15, 2019

Uh oh!

pabigot commented Jan 18, 2019

Uh oh!

carlescufi commented Jan 22, 2019

pabigot commented Dec 30, 2018 •

edited

Loading

andyross Jan 3, 2019 •

edited

Loading

pabigot commented Jan 3, 2019 •

edited

Loading

pabigot commented Jan 4, 2019 •

edited

Loading

pabigot commented Jan 4, 2019 •

edited

Loading