Skip to content

Conversation

@hjelmn
Copy link
Member

@hjelmn hjelmn commented Sep 26, 2019

OpenUCX broke the UCT API again in v1.8. This commit updates
btl/uct to fix compilation with current OpenUCX master
(future v1.8). Further changes will likely be needed for
the final release.

Signed-off-by: Nathan Hjelm [email protected]

@hjelmn hjelmn requested a review from yosefe September 26, 2019 16:10
@hjelmn
Copy link
Member Author

hjelmn commented Sep 26, 2019

@yosefe Any other unannounced (UCT) API changes in OpenUCX master? Please remember to keep me in the loop.

ucs_status_t ucs_status;
int rc;

ucs_status = uct_component_query (component, &attr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to initialize field_mask and allocate md_resources..
@hjelmn seems like you didn't try to run it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed to run and found components but I was probably getting lucky :). Will fix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity... Why not allocate the md_resources inside uct_component_query. Calling twice seems a little silly to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Contributor

@yosefe yosefe Sep 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity... Why not allocate the md_resources inside uct_component_query. Calling twice seems a little silly to me.

If uct_component_query allocates resources, would need another API to release that memory, in the existing scheme the user is responsible for memory allocation and could use something like alloca() or std::vector which do not require explicit free

@ibm-ompi
Copy link

The IBM CI (GNU Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/006525567fe2cbca76d9636ccc30d45c

@ibm-ompi
Copy link

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/6811f1458827d0f76f362a9659c573b7

@hjelmn
Copy link
Member Author

hjelmn commented Sep 26, 2019

:bot:retest

@hjelmn hjelmn force-pushed the fix_btl_uct_from_yet_another_unannounced_api_break_in_the_openucx_uct_layer branch from c01e31f to f506aa0 Compare September 26, 2019 17:33
@hjelmn
Copy link
Member Author

hjelmn commented Sep 26, 2019

@yosefe Installed xpmem in a VM and the changes appear to be working with UCX master.

@gpaulsen
Copy link
Member

@yosefe Please re-review.

If it's possible it'd be nice to PR back to v4.0.x tonight.
(FYI: @hppritcha)

/* generate all suitable btl modules */
for (unsigned i = 0 ; i < resource_count ; ++i) {
rc = mca_btl_uct_component_process_uct_md (resources + i, allowed_ifaces);
rc = mca_btl_uct_component_process_uct_md (resources + i, allowed_ifaces);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wierd indentation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


/* UCT bandwidth is in bytes/sec, BTL is in MB/sec */
#if UCT_API > UCT_VERSION(1, 7)
module->super.btl_bandwidth = (uint32_t) (MCA_BTL_UCT_TL_ATTR(tl, 0).bandwidth.dedicated / 1048576.0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to take both dedicated and shared bw into account
dedicated + shared/ppn

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

OpenUCX broke the UCT API again in v1.8. This commit updates
btl/uct to fix compilation with current OpenUCX master
(future v1.8). Further changes will likely be needed for
the final release.

Signed-off-by: Nathan Hjelm <[email protected]>
@hjelmn hjelmn force-pushed the fix_btl_uct_from_yet_another_unannounced_api_break_in_the_openucx_uct_layer branch from f506aa0 to 0ef5634 Compare September 27, 2019 19:34
@hjelmn
Copy link
Member Author

hjelmn commented Sep 27, 2019

@yosefe Fixed the issues you identified and added another minor bug-fix.

This commit fixes a crash that can occur if a transport
is usable but doesn't have zero-copy support. In this
case do not attempt to use zero-copy and set the max
send size off the bcopy limit.

Signed-off-by: Nathan Hjelm <[email protected]>
@hjelmn hjelmn force-pushed the fix_btl_uct_from_yet_another_unannounced_api_break_in_the_openucx_uct_layer branch from 0ef5634 to 8473a66 Compare September 28, 2019 00:26
@hjelmn
Copy link
Member Author

hjelmn commented Sep 28, 2019

Should be good to go.

@hjelmn hjelmn merged commit b1ef5a4 into open-mpi:master Oct 17, 2019
@hppritcha
Copy link
Member

@gpaulsen we need to cherry-pick this over to v4.0.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants