Skip to content

Memory patcher conflicts with OMPI and libfabric #8822

@rajachan

Description

@rajachan

We had ported OMPI’s patcher code to libfabric not too long ago to get notifications about memory events for its registration cache. Libfabric also has a userfaultfd notifier, but it defaults to the patcher-based “memhooks” notifier given the additional coverage it provides. Given how patcher uses jump statements to patch the various calls, we can not really stack multiple patcher hooks atop each other. When testing the OFI BTL with EFA, we observed silent data corruption with some benchmarks and root caused it to libfabric using stale registrations having failed to invalidate entries after OMPI’s patcher taking over the hooks.

UCX has a mechanism to take in external events from applications, ucm_set_external_event(). OMPI uses this mechanism in the UCX PML to invoke ucm_vm_munmap() from a unmap callback based on its internal memory hook events. We can achieve a similar workflow with libfabric's FI_MR_MMU_NOTIFY mode. While this mode was designed with a different use-case in mind (allowing registrations that are not backed by physical pages), we should be able to use it in conjunction with fi_mr_refresh() to take external events from applications like OMPI. To be specific, libfabric providers can make FI_MR_MMU_NOTIFY a soft requirement. If an application does not support it, they can continue using memhooks as is. If an application like OMPI does support that mode, providers can rely on fi_mr_refresh() notifications in place of the internal monitor (or perhaps in addition to userfaultfd) to determine when to evict an entry from the cache.

On the Open MPI side, the provider query logic will have to set FI_MR_MMU_NOTIFY mode and call fi_mr_refresh() on unmap events. Rcache uses the OPAL memory hooks, so it registers a callback to react to memory events (mca_rcache_base_mem_cb). For what the OFI components need, we can register another OFI-specific callback from common OFI code used both by the BTL and the MTL (which will eventually want to use a cache for CUDA buffers given the cost of querying CUDA buffer attributes). This OFI-specific callback can then directly pass on the notification to libfabric via fi_mr_refresh(). This will all be provider-agnostic.

@open-mpi/ofi @hppritcha @shefty, thoughts?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions