-
Notifications
You must be signed in to change notification settings - Fork 8.2k
extend dlist to detect unlinked nodes and avoid double-remove bugs #12248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
andyross
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally in favor, though it would be good to get a feel for how many list operations are being affected and what the performance and code size impact (sys_dlist_remove is generally inlined) is likely to be.
But yeah, we need to fix and not work around that spot in timer that's failing.
kernel/poll.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good catch, and it's surprising that it didn't cause failures already (I guess the existing tests would all exercise the double remove only in circumstances where the list hadn't otherwise changed and the action would be idempotent). I wonder idly if it's not a better idea to put that test into sys_dlist_remove() as long as we're eating the extra cycles...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sys_dlist_remove already documents that the node must be part of the list, and in most situations that's true (and known at the call-site). So I don't think there's reason to add a check in that function.
5303d22 to
4833ddf
Compare
|
Rebased on top of merged #12204, added commit to remove use of |
andyross
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One spot I think is wrong, a few other suggestions.
And I do still think we should pull up some size comparisons on stuff like tests/benchmarks/footprint to see how much this affects code size. I'd guess that this is a worthwhile tradeoff too, but it'd be nice to prove that.
kernel/sched.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick on behalf of @ceolin : MISRA rules disallow implicit conversion to bool
But I think this is deeper. The old code would store the "was linked" state implicit in the timeout, but that's no longer ture, right? You don't know that a zero return here means "inactive", I think? Might need to be reworked to do a is_linked call first and then use that as the predicate for the if()?
Actually a quick grep shows that this one of two places actually using the return value of _abort_(thread_?)timeout(). Maybe the simpler thing to do would be to just remove that return value and/or swap it to a boolean that indicates "was active".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old code returned zero on successful abort, and _INACTIVE on failure. Returning -EINVAL is more consistent with indicating that the abort failed. I really think this is doing the right thing.
I'll change the test to detect a negative result, though, instead of implicit bool although there's an awful lot of implicit bool in Zephyr.
kernel/include/timeout_q.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not particularly relevant, but this was always a bad API (a true return indicates a false state) that never got fixed up during all the refactoring. I wouldn't cry to see this patch change it to z_timeout_active() or whatever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was partly because of the legacy of _EXPIRED: timeouts could be in three states, not two. This function tests that the state is exactly what _INACTIVE indicated.
A true result indicates that the predicate identified by the function name is true. Why is that a bad API? We use sys_dlist_is_empty(), not sys_dlist_is_nonempty().
The whole timeout API uses prefix underscore. While I'd rather that wasn't the case (violates C reserved identifiers for the non-static functions) I wouldn't want to change just one function, nor change all of them, in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A true result indicates that the predicate identified by the function name is true
I don't think that not incorrect, exactly, but it's not innapropriate to argue against the ellision of unommitted negatives either.
No seriously, double negatives in API choices are bad, especially in predicates. The base adjective for that state is "active". True should mean "active", false should mean "inactive".
kernel/timeout.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned elsewhere, probably best to just eliminate the return value here or make it a boolean to indicate "was active", as that's all existing callers care about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return indicates the attempt to abort the timeout failed because the timeout wasn't active (valid for abort). I believe this is the preferred way to indicate the result of the operation.
ef85219 to
7719457
Compare
|
@andyross Thanks for the review; I've responded to some of the comments explaining the rationale, and made the one change for implicit bool. Let me know if I'm convincing about keeping the existing API. BTW: With this patch set mps2_an385 intermittently fails |
|
Good stuff here, although I share Andy's concerns about performance impact since dlists tend to be pretty hot code paths, perhaps consider ifdefing the extra checking around CONFIG_ASSERT or something like that? |
|
There are two concerns: impact on code size from the added null assignments, and impact on performance of the null assignments (and, rarely, checks; most of the time the linked state does not need to be checked). I can run comparisons for code size, if somebody tells me what platform(s) and what test application(s) to use. What are the criteria for acceptable/unacceptable impact of the change? Performance is a different issue. First, I'd need instruction on how that would be evaluated in a meaningful way. Second, the changes here have two effects: bugs where lists would be corrupted by double removes are converted from undetected corruption to faults at point-of-use; and a way to reliably identify nodes that are not in a list, as is needed with timers. The first is a quality issue, and we could certainly turn it off in some cases. The second is a functional capability, which would break if it were optional. Note that the Linux dlist implementation invalidates the pointers when nodes are removed. They seem to believe any performance impact is negligible relative to the value of detecting failures. What do you want me to do? |
FWIW, Linux isn't targeting microcontrollers with memory on the order of 64K (or less)
Fair enough! |
andyross
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing -1, as I'm convinced the behavior looks correct. I'd still prefer to see _abort_timeout()'s signature matched to the needs to the needs of its actual callers instead of trying to return an error code (it's an internal function, not an API for users outside of the kernel -- hell, even within the kernel it's used only at the lowest level), but that can be fixed whenever.
|
@andrewboie @andyross Thanks for the approvals. Please take the following into account before merging: The intermittent failure of mps2_an385 on By instrumenting the code I can tell that This appears to be a bug identified by this PR, not one introduced by this PR, although the cause is unclear. Unlike the poll bug no workaround seems to be effective. I don't know how to debug this further. |
andyross
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, wait. Sorry, I missed that warning earlier. Yeah, we for sure need to get any introduced failures addressed before this merges. If you want me to look at it, can you put together a reproducer and file it as a bug?
|
Your approvals came in while I was doing final edits on the warning; glad you caught it. https://github.com/pabigot/zephyr/commits/diag/20181230a has a diagnostic commit on top of this PR. The commit message describes how to reproduce the problem. (The Thanks for looking at this. |
|
Spent a bit on this. Turns out you can make it happen just by putting the test suite in a while(1), which is helpful. It happens less if you disable memory protection (not just userspace, interestingly) but doesn't disappear. Some work wrapping functions with macros so I can use LINE to get the caller (i.e. the poor man's stack trace) showed that the specific failure you're seeing is in the timeslicing code that is trying to move the active thread to the end of the run queue (which explains why this is the test that fails: it's somewhat inexplicably configured to timeslice at 1kHz!). But _current has nulls in the list entries when it SHOULD be in the run queue! Backing out from that to the very top of the timer ISR and I see the same thing, timeslice expiration or no (which makes the bug much easier to hit). So something is either interrupting a spot where the list is in flux (which shouldn't happen, and AFAICT isn't) or something is scribbling on the list node, which just happens to live at the very beginning of the thread object... So... I added some dummy buffer fields in the thread struct in front of the list node to try to catch the overrun, and the system doesn't boot. So at this point I'm reasonably certain there's something in the ARM code that isn't properly respecting the thread layout and writing over its first 8 bytes with nulls. Probably would be easy enough to find with a debugger and some scripted watchpoints, but I'm useless with tooling. Will play more tomorrow. |
|
@andyross Thanks for looking at this. Your description provided some helpful clues. I was able to replicate the observation of After that red herring was properly tinned I finally found the real problem: mem_slab retains a queue of threads that want to allocate from it that should be scheduled on memory release, but the operation to pull threads from a wait queue didn't handle the case where the queue was empty, at which point it queued garbage. That's also been fixed. Finally, I corrected the name of the link field in The problematic test runs reliably now. (Actually, as I think about it, the original Whatever, it works now) |
|
My last patch set solved the problem, by breaking timeslicing. The proper explanation and solution is in this commit:
I'm leaving this one in place for now:
because the original code was way too subtle, wasting time for both @andyross and me. However, the checks cost 12 bytes to add under SDK 0.9.5, and 16 bytes under gcc 8.2.1, so you can tell me to remove it. |
|
@pabigot please rebase |
|
@andyross if you are happy with this please approve so we can move this forward |
|
@andrewboie @andyross in the churn of this PR the following additional commit got added after the original review process. If this is OK I'd like to leave it, if not I can pull it out and resubmit it in another PR that's blocked by the dlist features. kernel: timeout: detect inactive timeouts using dnode linked stateWhether a timeout is linked into the timeout queue can be determined Update _abort_timeout to return an error code, rather than the flag Remove the _INACTIVE flag value, and replace its external uses with an |
|
@andrewboie @andyross @carlescufi Actually, hold on: to make things easier I'm going to close this one, and submit the one that has the combined dlist and timer updates for #12332 together. |
|
Reopening since Andy doesn't like the part of #12485 that motivated putting it in a separate PR. So this is back to being the original content, plus the removal of the magic |
Although sys_dnode_t and sys_dlist_t are aliases, their roles are different and they appear in different positions in dlist API calls. Signed-off-by: Peter A. Bigot <[email protected]>
CONTAINER_OF() on a NULL pointer returns some offset around NULL and not another NULL pointer. We have to check for that ourselves. This only worked because the dnode happened to be at the start of the struct. Signed-off-by: Peter A. Bigot <[email protected]>
The original implementation allows a list to be corrupted by list operations on the removed node. Existing code attempts to avoid this by using external state to determine whether a node is in a list, but this is fragile and fails when the state that holds the flag value is changed after the node is removed, e.g. in preparation for re-using the node. Follow Linux in invalidating the link pointers in a removed node. Add API so that detection of particpation in a list is available at the node abstraction. This solution relies on the following steady-state invariants: * A node (as opposed to a list) will never be adjacent to itself in a list; * The next and prev pointers of a node are always either both null or both non-null. Signed-off-by: Peter A. Bigot <[email protected]>
Use the new generic capability to detect unlinked sys_dnode_t instances. Signed-off-by: Peter A. Bigot <[email protected]>
k_poll events are registered in a linked list when their signal condition has been met. The code to clear event registration did not account for events that were not registered, resulting in double-removes that produced core dumps on native-posix sanitycheck. Signed-off-by: Peter A. Bigot <[email protected]>
Whether a timeout is linked into the timeout queue can be determined from the corresponding sys_dnode_t linked state. This removes the need to use a special flag value in dticks to determine that the timeout is inactive. Update _abort_timeout to return an error code, rather than the flag value, when the timeout to be aborted was not active. Remove the _INACTIVE flag value, and replace its external uses with an internal API function that checks whether a timeout is inactive. Signed-off-by: Peter A. Bigot <[email protected]>
|
@andyross Reverted the change of the structure tag per #12485 (comment) |
|
@andyross this is back to the original submitted content, plus the removal of the inactive magic value and restoring the |
|
@andyross can you please review the last commit in this series? |
|
Thanks @andyross |
This PR marks
sys_dnode_tinstances that are not in a list so that code does not depend on overloaded external state to assess this condition before potentially violating dlist API requirements.A second commit uses that API to avoid a double-remove in timeout that could result in a corrupted list of timers, as an alternative solution to a commit in #12204 (for full context see this discussion if that doesn't disappear when the commit goes away.)
A third commit works around an unknown double-remove bug in the kernel poll infrastructure. Why the events are being removed multiple times should probably be investigated by a kernel maintainer.