Implement net_context locking #8674

jukkar · 2018-07-02T10:18:12Z

This is initial attempt to allow re-entrant access to a given net_context struct. This is still work in progress and needs more testing. See also related issue at #8131 and #8386

codecov-io · 2018-07-02T10:45:37Z

Codecov Report

Merging #8674 into master will decrease coverage by 0.01%.
The diff coverage is 62%.

@@            Coverage Diff             @@
##           master    #8674      +/-   ##
==========================================
- Coverage   48.05%   48.03%   -0.02%     
==========================================
  Files         281      281              
  Lines       43414    43477      +63     
  Branches    10404    10404              
==========================================
+ Hits        20862    20885      +23     
- Misses      18403    18434      +31     
- Partials     4149     4158       +9

Impacted Files	Coverage Δ
include/net/net_context.h	`78.94% <ø> (ø)`	⬆️
subsys/net/ip/tcp.c	`54.34% <12.5%> (-2.65%)`	⬇️
subsys/net/ip/net_context.c	`62.19% <71.42%> (+2.35%)`	⬆️
drivers/net/loopback.c	`68% <0%> (-12%)`	⬇️
subsys/net/l2/dummy/dummy.c	`94.73% <0%> (-5.27%)`	⬇️
subsys/net/ip/net_pkt.c	`67.16% <0%> (-0.61%)`	⬇️
subsys/net/ip/net_if.c	`62.57% <0%> (-0.32%)`	⬇️
kernel/timeout.c	`87.61% <0%> (+0.95%)`	⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 523acef...64062a2. Read the comment docs.

pfalcon · 2018-07-02T11:31:55Z

It would be interesting to get a few comments how you arrived at this solution and which other alternatives (besides enforcing COOP prio) were considered. For example, why k_sched_lock() approach was discounted?

pfalcon · 2018-07-02T11:33:39Z

I'd also like to hear from @andrewboie whether this would address his concerns for reentrancy/concurrency guarantees needed for userspace-vs-kernel syscalls (so we avoided 2 layers of synchro primitives).

pfalcon · 2018-07-02T11:38:29Z

Finally, I'd like to hear from some kernel maintainers how mutex and delayed work would interact. Because while there're many places where invalid concurrent access happens, one I experienced the most so far, and thus concerned with, is interaction of tcp.c's delayed work handlers and main code (e.g. handling of un-acked packet list).

jukkar · 2018-07-02T12:15:40Z

For example, why k_sched_lock() approach was discounted?

Anas had a question in #8386 about moving to preemptive model, in which case calling k_sched_lock() does not make much sense, as it disables preemption. Thus I started to experiment a bit with mutexes.

pfalcon · 2018-07-02T12:52:24Z

in which case calling k_sched_lock() does not make much sense, as it disables preemption

It doesn't "disable preemption" in a general sense, it's effectively a rather coarse mutex. Fine-grained mutexes are "better", but they require more code to flip them back and forth, and potentially different mutexes here and there (== require more RAM).

I personally think that finer-grained approach is the right one, but we need to think out the whole schedule to assess it. E.g. I'm talking about locking if context->tcp structure, and #8131 seemingly talks about pkt level at all. So, looking forward for more RFC/WIP commits to this RFC/WIP PR.

andrewboie · 2018-07-03T19:20:07Z

I'm assuming this works as expected (i.e. doesn't explode) if a thread calls net_context_put() while another net operation is in progress?

andrewboie · 2018-07-03T19:20:49Z

It doesn't "disable preemption" in a general sense, it's effectively a rather coarse mutex

We should not be locking the scheduler.
That prevents other threads from running even if they have nothing to do with the network stack. It's too coarse.

pfl

Locks and unlocks seem to be in balance here. Whether these changes make the stack much quicker remains to be seen.

pfl · 2018-08-06T04:32:21Z

subsys/net/ip/net_context.c

Could we here follow net_context_bind() below, and unlock and return 0? Then the functions would be cosistent, at least for these two functions.

I would rather keep the out: label here so that we have the unlock called at the end of the function. Note that the out: label is called also from offloading case few lines above.

pfl · 2018-08-06T04:34:17Z

subsys/net/ip/net_context.c

Ok, here a goto out is used, this is as good as the above. 'out' could be renamed 'unlock', as that is what it does (bikeshedding...).

As there was the "out:" label already in that function, I did not wanted to rename that just for this patch.

pfl · 2018-08-06T04:40:09Z

subsys/net/ip/tcp.c

A goto unlock would describe the action better also here.

Same here, the "out:" existed before this commit thus renaming this seemed a bit pointless.

Actually my comment was wrong, the "out:" label is new. The unlock label is actually more descriptive so I will change it.

Actually my comment was wrong, the "out:" label is new. The unlock label is actually more descriptive so I will change it.

Actually, I'd say the "out" name was better: it's the difference between interface and implementation. The high-level action is "out", it's current implementation is unlocking, but maybe later it'll update some stats counters, or do sth else completely.

pfl · 2018-08-06T04:47:54Z

Fine-grained mutexes are "better", but they require more code to flip them back and forth, and potentially different mutexes here and there (== require more RAM).

Are you now referring to this PR saying that there should be more locks and unlocks, or are you making a general comment (to which I agree)?

jukkar · 2018-08-06T08:13:30Z

Whether these changes make the stack much quicker remains to be seen.

Note that the purpose was not make the stack quicker, actually they might make the thing slower, but to avoid access to the net core stack when accessed from different threads that are run in different "mode" (preemptive vs. cooperative).

jukkar · 2018-08-06T09:39:43Z

Updated the code according to comments.

pfalcon · 2018-08-07T19:11:47Z

@pfl

Fine-grained mutexes are "better", but they require more code to flip them back and forth, and potentially different mutexes here and there (== require more RAM).

Are you now referring to this PR saying that there should be more locks and unlocks, or are you making a general comment (to which I agree)?

I'm referring to the problem in general, and suggest (in #8674 (comment)) that we should consider different strategies, their pros and cons, and then choose (seemingly) the best (or easiest to start with, while still adequate), while keeping some "plan B" in mind.

nashif · 2018-12-03T22:20:59Z

is this PR going anywhere? still valid? if not, please close.

nashif · 2018-12-03T22:21:13Z

oops

jukkar · 2018-12-04T08:29:52Z

is this PR going anywhere? still valid? if not, please close.

This is experimental stuff, not sure yet whether this would be merged or not.

If the net_context functions are accessed from preemptive priority, then we need to protect various internal resources. Signed-off-by: Jukka Rissanen <[email protected]>

There was a false timeout error because we did not check the return value correctly. This issue is seen now because code flow in core IP stack is happening in different order than before. Signed-off-by: Jukka Rissanen <[email protected]>

jukkar · 2018-12-12T13:45:32Z

fixed the merge conflict

pfalcon · 2018-12-12T14:00:30Z

I still think that this adds a bunch of bloat, which is god-knows if it's really needed or not. In the name of going forward, let's go with it, but what about addressing my concern with defining separate ops like CONTEXT_LOCK()/CONTEXT_UNLOCK(), so we can define then to null when not needed?

tbursztyka

PR #11775 was triggering the described bug all the time on Maxwell's TCP calibration step.

Once applied this PR, it worked. So lgtm.

pfalcon · 2019-01-30T07:30:58Z

@dgloeck: Perhaps you could have a look at this too.

pfalcon · 2019-01-30T07:44:54Z

@tbursztyka

Once applied this PR, it worked. So lgtm.

Good to know you saw a case where this PR definitely helps. With my similar PR, #9819, I never was able to see effect of it, apparently because other issues in the stack prevailed, so my testing failed before I could see cases where this helps.

We now have more people looking actively into the stack, so I guess it would be a good idea to get acks from them too re: this PR. Otherwise let's target to merge it before the code freeze on Fri.

jukkar added area: Networking DNM This PR should not be merged (Do Not Merge) labels Jul 2, 2018

jukkar requested review from pfalcon and tbursztyka as code owners July 2, 2018 10:18

jukkar requested a review from pfl August 2, 2018 08:14

pfl reviewed Aug 6, 2018

View reviewed changes

jukkar changed the title ~~[WIP] Implement net_context locking~~ Implement net_context locking Aug 6, 2018

jukkar removed the DNM This PR should not be merged (Do Not Merge) label Aug 6, 2018

jukkar force-pushed the net-locking branch from 4bded73 to 77aa125 Compare August 6, 2018 09:39

jukkar requested a review from tarunkum as a code owner August 6, 2018 09:39

jukkar force-pushed the net-locking branch 2 times, most recently from db996d3 to d3e9a03 Compare August 6, 2018 10:27

pfalcon mentioned this pull request Aug 23, 2018

net: tcp: Properly queue FIN packets for retransmission #9592

Merged

nashif closed this Dec 3, 2018

nashif reopened this Dec 3, 2018

jukkar added 2 commits December 12, 2018 15:44

net: context: Add locking for concurrent access

e525d28

If the net_context functions are accessed from preemptive priority, then we need to protect various internal resources. Signed-off-by: Jukka Rissanen <[email protected]>

jukkar force-pushed the net-locking branch from d3e9a03 to 64062a2 Compare December 12, 2018 13:45

tbursztyka approved these changes Jan 29, 2019

View reviewed changes

pfalcon added this to the v1.14.0 milestone Jan 30, 2019

pfalcon requested review from aurel32, mike-scott and rlubos January 30, 2019 07:30

pfalcon approved these changes Jan 31, 2019

View reviewed changes

pfalcon added the priority: high High impact/importance bug label Jan 31, 2019

jukkar merged commit 34b07b9 into zephyrproject-rtos:master Jan 31, 2019

jukkar deleted the net-locking branch January 31, 2019 09:20

galak mentioned this pull request Jan 31, 2019

Revert "Implement net_context locking" #12922

Closed

pfalcon mentioned this pull request Feb 7, 2019

net: tcp: net_tcp_queue_pkt: Lock while appending to sent_list #9819

Closed

Implement net_context locking #8674

Implement net_context locking #8674

Uh oh!

Conversation

jukkar commented Jul 2, 2018

Uh oh!

codecov-io commented Jul 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pfalcon commented Jul 2, 2018

Uh oh!

pfalcon commented Jul 2, 2018

Uh oh!

pfalcon commented Jul 2, 2018

Uh oh!

jukkar commented Jul 2, 2018

Uh oh!

pfalcon commented Jul 2, 2018

Uh oh!

andrewboie commented Jul 3, 2018

Uh oh!

andrewboie commented Jul 3, 2018

Uh oh!

pfl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pfl commented Aug 6, 2018

Uh oh!

jukkar commented Aug 6, 2018

Uh oh!

jukkar commented Aug 6, 2018

Uh oh!

pfalcon commented Aug 7, 2018

Uh oh!

nashif commented Dec 3, 2018

Uh oh!

nashif commented Dec 3, 2018

Uh oh!

jukkar commented Dec 4, 2018

Uh oh!

jukkar commented Dec 12, 2018

Uh oh!

pfalcon commented Dec 12, 2018

Uh oh!

tbursztyka left a comment

Choose a reason for hiding this comment

Uh oh!

pfalcon commented Jan 30, 2019

Uh oh!

pfalcon commented Jan 30, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

codecov-io commented Jul 2, 2018 •

edited

Loading