-
Notifications
You must be signed in to change notification settings - Fork 934
oshmem: Add symmetric remote key handling code #11893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hello! The Git Commit Checker CI bot found a few problems with this PR: 051134c: oshmem: Don't deduplicate rkeys of local node
3a1284a: oshmem: Add actual deduplication calls
f287afb: oshmem: Add storage structure
3d3faba: oshmem: Add symmetric key environment variable
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
| mem_map_params.flags = flags; | ||
|
|
||
| if (spml->symmetric_rkey) { | ||
| mem_map_params.flags |= UCP_MEM_MAP_SYMMETRIC_RKEY; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add configure check - it's a new flag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added configure and wrapper function to have flag in one place.
oshmem/mca/spml/ucx/spml_ucx.c
Outdated
| ucs_status_t status; | ||
|
|
||
| for (i = 0, end = store->count; i < end;) { | ||
| m = (i + end) / 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use "bsearch" from glibc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bsearch returns NULL when not found, but here instead we return the position where the item can be inserted (OSHMEM_ERR_NOT_FOUND), so i think is simpler as it is, wdyt?
oshmem/mca/spml/ucx/spml_ucx.c
Outdated
| store->size = size; | ||
| store->count = 0; | ||
| store->ucp_context = ucp_context; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: can align by =, since the first 2 rows are already aligned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
oshmem/mca/spml/ucx/spml_ucx.h
Outdated
| extern ucp_rkey_h mca_spml_ucx_rkey_store_get(mca_spml_ucx_rkey_store_t *store, const ucp_rkey_h target); | ||
| extern void mca_spml_ucx_rkey_store_put(mca_spml_ucx_rkey_store_t *store, const ucp_rkey_h target); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd remove it and make these funcs static, as they are only used in a single file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kept only first two which are needed for init
|
|
||
| if (mca_spml_ucx.symmetric_rkey) { | ||
| ret = mca_spml_ucx_rkey_store_create(&mca_spml_ucx.rkey_store, | ||
| mca_spml_ucx.ucp_context, 1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe make the size configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, passing the added parameter symmetric_rkey_size
|
After review, please squash the "Address review commits" appropriately into the other commits on this PR; thanks. |
ca392e2 to
d151cf5
Compare
Dispatched various changes appropriately in existing commits and force-pushed branch. |
oshmem/mca/spml/ucx/spml_ucx.c
Outdated
| mca_spml_ucx_rkey_t *entry; | ||
| int ret, i; | ||
|
|
||
| if (store->size == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should not we check count here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed for clarity as the check is mainly expected to handle when feature is not enabled.
d151cf5 to
b4320d4
Compare
b4320d4 to
17611b8
Compare
17611b8 to
1ca142e
Compare
oshmem/mca/spml/ucx/spml_ucx.h
Outdated
| mca_spml_ucx_rkey_store_t rkey_store; | ||
| bool symmetric_rkey; | ||
| int symmetric_rkey_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to maintain it per mca_spml_ucx_ctx_t (i.e. per worker)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am assuming we only care about first context (in practice only one shmem_ctx_create() call), and we do not want to create symmetric store for aux_ctx or other instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, we want to create symmetric store for all contexts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| mca_spml_ucx_rkey_store_destroy(&ctx->rkey_store); | ||
| ucp_worker_destroy(ctx->ucp_worker[0]); | ||
| free(ctx->ucp_worker); | ||
| free(ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add common function for this? also use it for line 531-533,540
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
c32cd16 to
3b46e7a
Compare
yosefe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls squash and rebase as described in https://github.com/openucx/ucx/wiki/Guidance-for-contributors#squashing-a-pr-with-merge-commits-in-history
At very high scale, having each rank storing each other rank's remote keys for each segment can lead to high memory consumption. We activate symmetric remote key option to generate remote keys that will be deduplicated and then used interchangeably. Signed-off-by: Thomas Vegas <[email protected]>
At very high scale, having each rank storing each other rank's remote keys for each segment can lead to high memory consumption. We activate symmetric remote key option to generate remote keys that will be deduplicated and then used interchangeably.
This adds: