Skip to content

Conversation

@tvegas1
Copy link
Contributor

@tvegas1 tvegas1 commented May 28, 2024

cherry-pick #12577

What

The UCX PML needs to differentiate between mpi serialized and single thread mode according to MPI standard.

Why

Using UCX with multiple threads, even when providing mutual exclusion, is not completely equivalent to only using UCX with one same thread.

To guarantee ordering and avoid corruption, some UCX transports need the cpu to flush stores to the devices, before another thread can progress the worker. For latency reasons, this needed flush is only performed in non single thread mode.

The UCX PML needs to differentiate between mpi serialized and single thread
mode according to MPI standard.

Using UCX with multiple threads, even when providing mutual exclusion, is not
completely equivalent to only using UCX with one same thread.

Signed-off-by: Thomas Vegas <[email protected]>
(cherry picked from commit b33b61a)
@github-actions github-actions bot added this to the v5.0.4 milestone May 28, 2024
@tvegas1
Copy link
Contributor Author

tvegas1 commented May 28, 2024

Shall we also backport it to v4.1.x?

case MPI_THREAD_FUNNELED:
case MPI_THREAD_SINGLE:
default:
return UCS_THREAD_MODE_SINGLE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default should be multi, or at least produce a warning

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This should indeed be moved in UCX common (opal/mca/common/ucx/common_ucx_wpool.c seems like a good place) and reused in all the UCX (and maybe UCC) initializations.

For MPI:

  • ompi/mca/pml/ucx/pml_ucx.c
  • ompi/mca/coll/ucc/coll_ucc_module.c

For OSHMEM:

  • oshmem/mca/spml/ucx/spml_ucx.c
  • oshmem/mca/spml/ucx/spml_ucx_component.c
  • oshmem/mca/scoll/ucc/scoll_ucc_module.c

return OMPI_SUCCESS;
}

static ucs_thread_mode_t mca_pml_ucx_thread_mode(int ompi_mode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move the function on opal/mca/common/ucx and reuse it also for osc/ucx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems osc/ucx is in v4.1.x (thanks for noting @brminich), will add to patch when backporting

@tvegas1
Copy link
Contributor Author

tvegas1 commented May 28, 2024

Based on @yosefe and @bosilca comments, I will close this PR, address them with new PR on main branch, and backport the final patches in v5.0.x and v4.1.x.

I will move added function with added warning to opal/mca/common/ucx/common_ucx{.c,.h}. oshmem and ucc use their own enums so i don't think we can reuse the function.

Suggested files:

  • common_ucx_wpool.c: either single/multi: no change since it's one worker on one thread
  • osc/ucx: no ucp worker creation found: no change (actually change seems to be in v4.1.x)
  • coll/scoll ucc initializations: UCC has it's own defines and only uses single/multi: no change
  • oshmem/mca/spml/ucx/spml_ucx.c: will add serialized

Please let me know otherwise.

@tvegas1
Copy link
Contributor Author

tvegas1 commented Jun 7, 2024

superseded by #12608

@tvegas1 tvegas1 closed this Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants