Skip to content

Conversation

@jukkar
Copy link
Member

@jukkar jukkar commented Jul 2, 2018

This is initial attempt to allow re-entrant access to a given net_context struct. This is still work in progress and needs more testing. See also related issue at #8131 and #8386

@jukkar jukkar added area: Networking DNM This PR should not be merged (Do Not Merge) labels Jul 2, 2018
@jukkar jukkar requested review from pfalcon and tbursztyka as code owners July 2, 2018 10:18
@codecov-io
Copy link

codecov-io commented Jul 2, 2018

Codecov Report

Merging #8674 into master will decrease coverage by 0.01%.
The diff coverage is 62%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8674      +/-   ##
==========================================
- Coverage   48.05%   48.03%   -0.02%     
==========================================
  Files         281      281              
  Lines       43414    43477      +63     
  Branches    10404    10404              
==========================================
+ Hits        20862    20885      +23     
- Misses      18403    18434      +31     
- Partials     4149     4158       +9
Impacted Files Coverage Δ
include/net/net_context.h 78.94% <ø> (ø) ⬆️
subsys/net/ip/tcp.c 54.34% <12.5%> (-2.65%) ⬇️
subsys/net/ip/net_context.c 62.19% <71.42%> (+2.35%) ⬆️
drivers/net/loopback.c 68% <0%> (-12%) ⬇️
subsys/net/l2/dummy/dummy.c 94.73% <0%> (-5.27%) ⬇️
subsys/net/ip/net_pkt.c 67.16% <0%> (-0.61%) ⬇️
subsys/net/ip/net_if.c 62.57% <0%> (-0.32%) ⬇️
kernel/timeout.c 87.61% <0%> (+0.95%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 523acef...64062a2. Read the comment docs.

@pfalcon
Copy link
Contributor

pfalcon commented Jul 2, 2018

It would be interesting to get a few comments how you arrived at this solution and which other alternatives (besides enforcing COOP prio) were considered. For example, why k_sched_lock() approach was discounted?

@pfalcon
Copy link
Contributor

pfalcon commented Jul 2, 2018

I'd also like to hear from @andrewboie whether this would address his concerns for reentrancy/concurrency guarantees needed for userspace-vs-kernel syscalls (so we avoided 2 layers of synchro primitives).

@pfalcon
Copy link
Contributor

pfalcon commented Jul 2, 2018

Finally, I'd like to hear from some kernel maintainers how mutex and delayed work would interact. Because while there're many places where invalid concurrent access happens, one I experienced the most so far, and thus concerned with, is interaction of tcp.c's delayed work handlers and main code (e.g. handling of un-acked packet list).

@jukkar
Copy link
Member Author

jukkar commented Jul 2, 2018

For example, why k_sched_lock() approach was discounted?

Anas had a question in #8386 about moving to preemptive model, in which case calling k_sched_lock() does not make much sense, as it disables preemption. Thus I started to experiment a bit with mutexes.

@pfalcon
Copy link
Contributor

pfalcon commented Jul 2, 2018

in which case calling k_sched_lock() does not make much sense, as it disables preemption

It doesn't "disable preemption" in a general sense, it's effectively a rather coarse mutex. Fine-grained mutexes are "better", but they require more code to flip them back and forth, and potentially different mutexes here and there (== require more RAM).

I personally think that finer-grained approach is the right one, but we need to think out the whole schedule to assess it. E.g. I'm talking about locking if context->tcp structure, and #8131 seemingly talks about pkt level at all. So, looking forward for more RFC/WIP commits to this RFC/WIP PR.

@andrewboie
Copy link
Contributor

I'm assuming this works as expected (i.e. doesn't explode) if a thread calls net_context_put() while another net operation is in progress?

@andrewboie
Copy link
Contributor

It doesn't "disable preemption" in a general sense, it's effectively a rather coarse mutex

We should not be locking the scheduler.
That prevents other threads from running even if they have nothing to do with the network stack. It's too coarse.

@jukkar jukkar requested a review from pfl August 2, 2018 08:14
Copy link
Contributor

@pfl pfl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Locks and unlocks seem to be in balance here. Whether these changes make the stack much quicker remains to be seen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we here follow net_context_bind() below, and unlock and return 0? Then the functions would be cosistent, at least for these two functions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather keep the out: label here so that we have the unlock called at the end of the function. Note that the out: label is called also from offloading case few lines above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, here a goto out is used, this is as good as the above. 'out' could be renamed 'unlock', as that is what it does (bikeshedding...).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As there was the "out:" label already in that function, I did not wanted to rename that just for this patch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A goto unlock would describe the action better also here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, the "out:" existed before this commit thus renaming this seemed a bit pointless.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually my comment was wrong, the "out:" label is new. The unlock label is actually more descriptive so I will change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually my comment was wrong, the "out:" label is new. The unlock label is actually more descriptive so I will change it.

Actually, I'd say the "out" name was better: it's the difference between interface and implementation. The high-level action is "out", it's current implementation is unlocking, but maybe later it'll update some stats counters, or do sth else completely.

@pfl
Copy link
Contributor

pfl commented Aug 6, 2018

Fine-grained mutexes are "better", but they require more code to flip them back and forth, and potentially different mutexes here and there (== require more RAM).

Are you now referring to this PR saying that there should be more locks and unlocks, or are you making a general comment (to which I agree)?

@jukkar jukkar changed the title [WIP] Implement net_context locking Implement net_context locking Aug 6, 2018
@jukkar jukkar removed the DNM This PR should not be merged (Do Not Merge) label Aug 6, 2018
@jukkar
Copy link
Member Author

jukkar commented Aug 6, 2018

Whether these changes make the stack much quicker remains to be seen.

Note that the purpose was not make the stack quicker, actually they might make the thing slower, but to avoid access to the net core stack when accessed from different threads that are run in different "mode" (preemptive vs. cooperative).

@jukkar
Copy link
Member Author

jukkar commented Aug 6, 2018

Updated the code according to comments.

@jukkar jukkar force-pushed the net-locking branch 2 times, most recently from db996d3 to d3e9a03 Compare August 6, 2018 10:27
@pfalcon
Copy link
Contributor

pfalcon commented Aug 7, 2018

@pfl

Fine-grained mutexes are "better", but they require more code to flip them back and forth, and potentially different mutexes here and there (== require more RAM).

Are you now referring to this PR saying that there should be more locks and unlocks, or are you making a general comment (to which I agree)?

I'm referring to the problem in general, and suggest (in #8674 (comment)) that we should consider different strategies, their pros and cons, and then choose (seemingly) the best (or easiest to start with, while still adequate), while keeping some "plan B" in mind.

@nashif
Copy link
Member

nashif commented Dec 3, 2018

is this PR going anywhere? still valid? if not, please close.

@nashif nashif closed this Dec 3, 2018
@nashif nashif reopened this Dec 3, 2018
@nashif
Copy link
Member

nashif commented Dec 3, 2018

oops

@jukkar
Copy link
Member Author

jukkar commented Dec 4, 2018

is this PR going anywhere? still valid? if not, please close.

This is experimental stuff, not sure yet whether this would be merged or not.

If the net_context functions are accessed from preemptive priority,
then we need to protect various internal resources.

Signed-off-by: Jukka Rissanen <[email protected]>
There was a false timeout error because we did not check the
return value correctly. This issue is seen now because code
flow in core IP stack is happening in different order than
before.

Signed-off-by: Jukka Rissanen <[email protected]>
@jukkar
Copy link
Member Author

jukkar commented Dec 12, 2018

fixed the merge conflict

@pfalcon
Copy link
Contributor

pfalcon commented Dec 12, 2018

I still think that this adds a bunch of bloat, which is god-knows if it's really needed or not. In the name of going forward, let's go with it, but what about addressing my concern with defining separate ops like CONTEXT_LOCK()/CONTEXT_UNLOCK(), so we can define then to null when not needed?

Copy link
Contributor

@tbursztyka tbursztyka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #11775 was triggering the described bug all the time on Maxwell's TCP calibration step.

Once applied this PR, it worked. So lgtm.

@pfalcon pfalcon added this to the v1.14.0 milestone Jan 30, 2019
@pfalcon
Copy link
Contributor

pfalcon commented Jan 30, 2019

@dgloeck: Perhaps you could have a look at this too.

@pfalcon
Copy link
Contributor

pfalcon commented Jan 30, 2019

@tbursztyka

Once applied this PR, it worked. So lgtm.

Good to know you saw a case where this PR definitely helps. With my similar PR, #9819, I never was able to see effect of it, apparently because other issues in the stack prevailed, so my testing failed before I could see cases where this helps.

We now have more people looking actively into the stack, so I guess it would be a good idea to get acks from them too re: this PR. Otherwise let's target to merge it before the code freeze on Fri.

@pfalcon pfalcon added the priority: high High impact/importance bug label Jan 31, 2019
@jukkar jukkar merged commit 34b07b9 into zephyrproject-rtos:master Jan 31, 2019
@jukkar jukkar deleted the net-locking branch January 31, 2019 09:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: Networking priority: high High impact/importance bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants