-
Notifications
You must be signed in to change notification settings - Fork 934
Description
Digging further through the UCX osc code, I noticed that there appear to be non-atomic memory accesses to variables that are potentially accessed concurrently in multiple threads, without proper locking. The following two instances seem suspicious in master and v4.0.x:
master:
mca_osc_ucx_component.num_incomplete_req_ops is incremented in ompi_osc_ucx_rget and ompi_osc_ucx_rput and decremented in req_completion. The counter seems to be only used in OMPI_OSC_UCX_REQUEST_ALLOC to trigger progress if deemed necessary. I'm not sure exactly what impact the race condition might have though.
v4.0.x:
Similarly, module->global_ops_num and module->per_target_ops_nums[target] are incremented and decremented in at least incr_and_check_ops_num() and ompi_osc_ucx_flush() without mutual exclusion. Again, this seems to trigger a flush if the number of outstanding ops grows too large (1M afaics) so the impact may be limited.