-
Notifications
You must be signed in to change notification settings - Fork 8.2k
Implement net_context locking #8674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #8674 +/- ##
==========================================
- Coverage 48.05% 48.03% -0.02%
==========================================
Files 281 281
Lines 43414 43477 +63
Branches 10404 10404
==========================================
+ Hits 20862 20885 +23
- Misses 18403 18434 +31
- Partials 4149 4158 +9
Continue to review full report at Codecov.
|
|
It would be interesting to get a few comments how you arrived at this solution and which other alternatives (besides enforcing COOP prio) were considered. For example, why k_sched_lock() approach was discounted? |
|
I'd also like to hear from @andrewboie whether this would address his concerns for reentrancy/concurrency guarantees needed for userspace-vs-kernel syscalls (so we avoided 2 layers of synchro primitives). |
|
Finally, I'd like to hear from some kernel maintainers how mutex and delayed work would interact. Because while there're many places where invalid concurrent access happens, one I experienced the most so far, and thus concerned with, is interaction of tcp.c's delayed work handlers and main code (e.g. handling of un-acked packet list). |
Anas had a question in #8386 about moving to preemptive model, in which case calling k_sched_lock() does not make much sense, as it disables preemption. Thus I started to experiment a bit with mutexes. |
It doesn't "disable preemption" in a general sense, it's effectively a rather coarse mutex. Fine-grained mutexes are "better", but they require more code to flip them back and forth, and potentially different mutexes here and there (== require more RAM). I personally think that finer-grained approach is the right one, but we need to think out the whole schedule to assess it. E.g. I'm talking about locking if context->tcp structure, and #8131 seemingly talks about pkt level at all. So, looking forward for more RFC/WIP commits to this RFC/WIP PR. |
|
I'm assuming this works as expected (i.e. doesn't explode) if a thread calls net_context_put() while another net operation is in progress? |
We should not be locking the scheduler. |
pfl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Locks and unlocks seem to be in balance here. Whether these changes make the stack much quicker remains to be seen.
subsys/net/ip/net_context.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we here follow net_context_bind() below, and unlock and return 0? Then the functions would be cosistent, at least for these two functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather keep the out: label here so that we have the unlock called at the end of the function. Note that the out: label is called also from offloading case few lines above.
subsys/net/ip/net_context.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, here a goto out is used, this is as good as the above. 'out' could be renamed 'unlock', as that is what it does (bikeshedding...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As there was the "out:" label already in that function, I did not wanted to rename that just for this patch.
subsys/net/ip/tcp.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A goto unlock would describe the action better also here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, the "out:" existed before this commit thus renaming this seemed a bit pointless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually my comment was wrong, the "out:" label is new. The unlock label is actually more descriptive so I will change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually my comment was wrong, the "out:" label is new. The unlock label is actually more descriptive so I will change it.
Actually, I'd say the "out" name was better: it's the difference between interface and implementation. The high-level action is "out", it's current implementation is unlocking, but maybe later it'll update some stats counters, or do sth else completely.
Are you now referring to this PR saying that there should be more locks and unlocks, or are you making a general comment (to which I agree)? |
Note that the purpose was not make the stack quicker, actually they might make the thing slower, but to avoid access to the net core stack when accessed from different threads that are run in different "mode" (preemptive vs. cooperative). |
|
Updated the code according to comments. |
db996d3 to
d3e9a03
Compare
I'm referring to the problem in general, and suggest (in #8674 (comment)) that we should consider different strategies, their pros and cons, and then choose (seemingly) the best (or easiest to start with, while still adequate), while keeping some "plan B" in mind. |
|
is this PR going anywhere? still valid? if not, please close. |
|
oops |
This is experimental stuff, not sure yet whether this would be merged or not. |
If the net_context functions are accessed from preemptive priority, then we need to protect various internal resources. Signed-off-by: Jukka Rissanen <[email protected]>
There was a false timeout error because we did not check the return value correctly. This issue is seen now because code flow in core IP stack is happening in different order than before. Signed-off-by: Jukka Rissanen <[email protected]>
|
fixed the merge conflict |
|
I still think that this adds a bunch of bloat, which is god-knows if it's really needed or not. In the name of going forward, let's go with it, but what about addressing my concern with defining separate ops like CONTEXT_LOCK()/CONTEXT_UNLOCK(), so we can define then to null when not needed? |
tbursztyka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR #11775 was triggering the described bug all the time on Maxwell's TCP calibration step.
Once applied this PR, it worked. So lgtm.
|
@dgloeck: Perhaps you could have a look at this too. |
Good to know you saw a case where this PR definitely helps. With my similar PR, #9819, I never was able to see effect of it, apparently because other issues in the stack prevailed, so my testing failed before I could see cases where this helps. We now have more people looking actively into the stack, so I guess it would be a good idea to get acks from them too re: this PR. Otherwise let's target to merge it before the code freeze on Fri. |
This is initial attempt to allow re-entrant access to a given net_context struct. This is still work in progress and needs more testing. See also related issue at #8131 and #8386