Skip to content

Conversation

mwheinz
Copy link

@mwheinz mwheinz commented Sep 23, 2019

INTERNAL: STL-59403

The OFI (libfabric) MTL does not respect the maximum message size
parameter that OFI provides in the fi_info data.

This patch adds this missing max_msg_size field to the mca_ofi_module_t
structure and adds a length check to the low-level send routines.

Change-Id: I05aa71d332f2df897133b30c28bf37d98f061996
Signed-off-by: Michael Heinz [email protected]
Reviewed-by: Adam Goldman [email protected]
Reviewed-by: Brendan Cunningham [email protected]

INTERNAL: STL-59403

The OFI (libfabric) MTL does not respect the maximum message size
parameter that OFI provides in the fi_info data.

This patch adds this missing max_msg_size field to the mca_ofi_module_t
structure and adds a length check to the low-level send routines.

Change-Id: I05aa71d332f2df897133b30c28bf37d98f061996
Signed-off-by: Michael Heinz <[email protected]>
Reviewed-by: Adam Goldman <[email protected]>
Reviewed-by: Brendan Cunningham <[email protected]>
@mwheinz
Copy link
Author

mwheinz commented Sep 23, 2019

Double checked the commit before submitting the pull request. No whitespace issues.</throws salt over shoulder></knocks on wood>

@bwbarrett
Copy link
Member

I don't think this patch is what we want to have happen. This leads to a slightly more sane error message, but the OFI component doesn't do any fragmentation, so it still will be a fatal error message during runtime. Wouldn't a better solution be either (1) not select providers with a small max message size or (2) add fragmentation?

@mwheinz
Copy link
Author

mwheinz commented Sep 24, 2019

@bwbarrett,

I agree that this is a band-aid at best. My understanding is that (a) it is better than letting the current situation continue and (b) a real fix might require a significant design and coding effort. We should discuss this at the weekly meeting today.

@jsquyres
Copy link
Member

Ya, we discussed this exact issue on the RM call yesterday, and resolved to bring it up on the webex today (i.e.: this is a band-aid -- it resolves the "silent" part of the error, which is arguably the most nefarious part of the issue, but the more complete fix [i.e., sending arbitrary-sized messages] is more involved).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants