-
Notifications
You must be signed in to change notification settings - Fork 8.2k
[RFC] Split large TCP segments #1330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Devil's advocacy: this is a lot of code that, in a practical sense, mostly just duplicates functionality that already has to be implemented in the IP fragmentation layer. Many apps may have an argument that they don't need this (they split their writes already, or are willing to pay the somewhat higher packet loss rates involved in fragmentation), yet everyone has to pay for it. Might be worth making this a Kconfig option so someone who wants "lean TCP" can get it. |
Currently:
But indeed, this PR is only applicable for TCP and we probably do not need to have both IP and TCP fragmentation enabled at the same time. So we could have this one enabled via Kconfig option. |
|
I'm a bit late of this, but "net: pkt: Allocate ll header when cloning" patch could well go into 1.9, as it's a clear bugfix. |
I don't think I saw many such apps, actually not sure I saw any at all (of native API, BSD Sockets do split).
Well, I definitely tested it when I developed sockets prototype in MicroPython. That testcase is now gone, because sockets impl doesn't allow to send more data than a packet can fit (but as your patch shows, indeed, for TCP, MSS should be used, not MTU). Just taking HTTP server example and making it send not a few chars, but some real page (say, 10K) should show the issue. Btw, one of the examples I wanted to add for BSD Sockets was a really-working HTTP server, serving a real page like above. But putting that under ApacheBench deadlocked it very fast. BSD Sockets currently send large data in parallel, without waiting. In other words, Zephyr IP stack deadlocks when send buffers are exhausted. I'm now pondering what to do about it: burden socket structure with a semaphore to implement sequential sending, or try to debug this deadlock issue. As you understand, the latter can be a real Pandora box. |
|
And back to the patch, it well shows the skewing of the background. Ever since @tbursztyka's refactor, net_pkt was really a network packet. Now, with this patch, it suddenly becomes either a real network packet, or not so real network packet, more of a general purpose network buffer. Confusion, extra code, ineffective resource allocation, and associated bugs will ensue. I still think the best way to handle this issue is simply to not let a user app to put more data into a packet than it can hold. Then any user app should be prepared for a short write and send the rest in new packet(s). Simple. Easy. Effective. |
Btw, @andyross' comment resoudns this idea. There's no point to have fragmentation on 2 levels. We already have packet fragmentation, here it makes sense to try something else. |
I do not understand this comment. The net_pkt is still network packet, the PR does not change this in any way. Lets drop this one, I though this is what you wanted because of https://jira.zephyrproject.org/browse/ZEP-1998, we can then close that Jira item too. |
|
@jukkar: First of all, why close this? "net: pkt: Allocate ll header when cloning" is a bugfix, and "net: pkt: Allow cloning just the net_pkt without any data" is apparently useful, can you please resubmit them separately?
I would define a "valid network packet" as "a packet which can be sent thru interface it's destined to". Native Zephyr IP stack allows a user to create an invalid packet (with 10K, 100K, 1M of data attached). Previously, it was kind of grey area what to do about that, kind of "API doesn't care about such case, it's app's chore to not do that, even though app doesn't have reasonable criteria what max pkt size is safe". This patch is kind of legitimizes invalid elephant packets, and throws additional resources at "fixing" them, while leaving many related questions open (or underspecified, at least per the description of this patch). E.g., what "sent" callback will get as its arguments, how many times it will be fired, etc.
Sorry, but we can't close that ticket, as it's not resolved. Let's just not haste with resolving it with exactly this patch, and consider alternatives instead. And there're alternatives, e.g. I'm a proponent of the option that oversized packets should not be allowed to be created in the first place. And it's no longer just abstract thinking, it was implemented in BSD Sockets layer, and it "works well" in some sense of that word, and Sockets are now in the mainline. It's on my TODO to send an RFC to the mailing list noting that Sockets are now in the mainline and what implications that should have (e.g. should we push for resolving outstanding IP stack issues in "different" ways or adopt ideas used in Sockets implementation after all). I plan to send it after 1.9 release and more pressing issues are handled (like Sockets performance regression due to k_poll patch). So for now, I'd suggest to reopen this ticket, add "[RFC]" prefix, and let it hang around, perhaps attracting more comments. |
|
@jukkar: So, I hope you don't mind if I reopen this PR and mark as RFC. |
|
I started to use this PR in #980 as I have there large files that need to be sent over HTTP. There seems to be some issue in frdm ethernet driver or somewhere in L2 as without this segment split support, either there is a bus fault or the device just restarts. |
Perhaps "without controlling MTU size", and there was GH-3439 long ago warning of various issues due to this. I hope this PR is still just a one way to solve it, another way is used in BSD Sockets code. |
Instead of trying to send a packet that is larger than MSS, split it into suitable pieces and queue the pieces individually. Jira: ZEP-1998 Signed-off-by: Jukka Rissanen <[email protected]>
|
Let's revisit this one again. I have been using this PR in #980 and it makes the application logic much simpler and less error prone (as application does not need to send packets in specific size). Anyway, this is now rebased against latest master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rest looks good to me
| int fit_len, appdata_len; | ||
|
|
||
| new_len += frag->len; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could remove this empty line
Yeah, let's do it. It was on my TODO for some time to write a mailing list RFC on the situation, so now I did: |
This comment is exaggerating the case. What I am proposing here is of course adding more code to net_context.c but it makes application logic more easy and less error prone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that is reaching too far. I still would like to see this merged. |
But why do we need this if we agreed to use "short write" approach instead? What's the (remaining) usecase(s) for this? And note that these approaches are conflicting, with "short write" approach, you simply will never get an oversize packet, so there will be nothing to split. (#119 may be not yet there, i.e., might not apply size check on all paths, but the idea would be to elaborate it and make that to be the case). |
|
Closing this as things regarding TCP will be done differently. |
This is related to https://jira.zephyrproject.org/browse/ZEP-1998 issue. So when sending large TCP segment, we should only send max MTU size packets.
This PR seems to do something sane, but requires still more testing. @pfalcon did you had some test for testing sending large packets?
The first two patches fix some issues with net_pkt_clone(), the last patch adds splitting functionality to net_context.c when sending TCP data. This is probably too risky to apply to v1.9 at this point as it requires more testing.