-
Notifications
You must be signed in to change notification settings - Fork 928
Description
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
main only
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Build from source with CUDA support and build accelerator DSO component
./configure ... --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuda/lib64/stubs --enable-mca-dso=btl-smcuda,rcache-rgpusm,rcache-gpusm,accelerator-cuda ...
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status
.
9095457 3rd-party/openpmix (v1.1.3-3932-g9095457b)
4676a3cb8f7eabde919f19bf70b1d211a79c2b6d 3rd-party/prrte (psrvr-v2.0.0rc1-4715-g4676a3cb8f)
c1cfc910d92af43f8c27807a9a84c9c13f4fbc65 config/oac (heads/main)
Please describe the system on which you are running
- Operating system/version: Amazon Linux2, RHEL 8/9, Ubuntus
- Computer hardware: EC2 hpc6a.48xlarge
- Network type: EFA
Details of the problem
When I run an application with high rank-per-node, e.g. --map-by ppr:96:node
, I get segfault
[ip-172-31-16-16:73623] *** Process received signal ***
[ip-172-31-16-16:73623] Signal: Segmentation fault (11)
[ip-172-31-16-16:73623] Signal code: Address not mapped (1)
[ip-172-31-16-16:73623] Failing at address: 0x7f7f60fddb3f
[ip-172-31-16-16.us-east-2.compute.internal:73185] PMIX ERROR: PMIX_ERR_UNREACH in file base/ptl_base_connection_hdlr.c at line 396
prterun: pmix_list.c:62: pmix_list_item_destruct: Assertion `0 == item->pmix_list_item_refcount' failed.
dmesg shows that it's from cuda
[79590.726378] cuda00001400006[73833]: segfault at 7f7f60fddb3f ip 00007f7f61fa7407 sp 00007f7f60d23eb0 error 4 in libgcc_s-7-20180712.so.1[7f7f61f99000+15000]
[79590.734804] Code: bb 0c 00 00 00 e9 f2 fe ff ff 40 80 ff 08 75 9d 80 78 01 00 75 97 0f b6 78 02 48 83 c0 02 e9 17 fd ff ff 49 8b 85 98 00 00 00 <80> 38 48 0f 85 67 fe ff ff 48 ba c7 c0 0f 00 00 00 0f 05 48 39 50
I did git bisect and identified this change https://github.com/open-mpi/ompi/pull/11617/files
I added a call after cuInit
, and when the segfault happens I only see some ranks passed that point, so either
cuInit
panicked, or- The accelerator component dlopen failed for some reason and never reached
cuInit
(not sure how this can happen)
Note: I can mitigate the issue by removing --enable-mca-dso=btl-smcuda,rcache-rgpusm,rcache-gpusm,accelerator-cuda
. So it is likely related to DSO and dlopen.