Skip to content

UCX osc thread-safety concerns #6941

@devreal

Description

@devreal

Digging further through the UCX osc code, I noticed that there appear to be non-atomic memory accesses to variables that are potentially accessed concurrently in multiple threads, without proper locking. The following two instances seem suspicious in master and v4.0.x:

master:
mca_osc_ucx_component.num_incomplete_req_ops is incremented in ompi_osc_ucx_rget and ompi_osc_ucx_rput and decremented in req_completion. The counter seems to be only used in OMPI_OSC_UCX_REQUEST_ALLOC to trigger progress if deemed necessary. I'm not sure exactly what impact the race condition might have though.

v4.0.x:
Similarly, module->global_ops_num and module->per_target_ops_nums[target] are incremented and decremented in at least incr_and_check_ops_num() and ompi_osc_ucx_flush() without mutual exclusion. Again, this seems to trigger a flush if the number of outstanding ops grows too large (1M afaics) so the impact may be limited.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions