-
Notifications
You must be signed in to change notification settings - Fork 931
Open
Description
Doing some tests with OSU one-sided benchmarks created some errors when testing D2D. Creating this ticket to help track this down.
The test fails at different message sizes, but consistently at 32k there is a failure (see details below).
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
- openmpi 5.0.0rc11 + ofi: NIC selection update #11565 patch (also saw an issue with main at da6d715)
- rocm 5.3.0
- openmpi 5.0.0rc11 tag
7d4539377a95c9f00e8b6ad43b359850924cf016 3rd-party/openpmix (v4.2.4rc1-1-g7d453937)
20ee75229f68f3d73004a5432b90edeffb8d281e 3rd-party/prrte (v3.0.1rc2-9-g20ee75229f)
3b9e6e346d6d807ca820fbfb5e86734136f4889b config/oac (heads/main)
- osu-micro-benchmarks-7.0.1
Please describe the system on which you are running
- Operating system/version: linux
- Computer hardware:
- Network type:
Details of the problem
Things fail in different ways, but could consistently reproduce one issue with following setup.
shell$ mpirun -n 2 ./hello_world
frontier10235: $ env | grep OMPI_MCA
OMPI_MCA_opal_common_ofi_provider_include=shm
OMPI_MCA_pml=^ucx
OMPI_MCA_mtl=^ofi
OMPI_MCA_btl=^tcp,ofi,openib
frontier10235: $ mpirun --np 2 --map-by ppr:1:l3cache --bind-to core ./osu_put_bibw -m 1:32768 -d rocm D D
PID: 121613
PID: 121614
# OSU MPI_Put-ROCM Bi-directional Bandwidth Test v7.0
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_post/start/complete/wait
# Rank 0 Memory on DEVICE (D) and Rank 1 Memory on DEVICE (D)
# Size Bandwidth (MB/s)
1 1.73
2 3.40
4 6.78
8 13.34
16 27.02
32 53.78
64 101.80
128 201.56
256 415.30
512 747.69
1024 1443.92
2048 2427.74
4096 3854.74
8192 5285.93
16384 6715.70
...CRASHES AT 32K... see backtrace below...Thread 1 "osu_put_bibw" received signal SIGBUS, Bus error.
0x00007fffeab1650e in __cray_memcpy_ROME () from /opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libu.so.1
(gdb) bt
#0 0x00007fffeab1650e in __cray_memcpy_ROME () from /opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libu.so.1
#1 0x00007fffe11c582b in mca_smsc_xpmem_memmove (dst=0x7ff78e600000, src=0x7ff78e200000, size=32768)
at ../../../../../../../../source/openmpi-tag-v5.0.0rc11+local/opal/mca/smsc/xpmem/smsc_xpmem_module.c:259
#2 0x00007fffe11c58aa in mca_smsc_xpmem_copy_from (endpoint=0x9e6ea0, local_address=0x7ff78e600000,
remote_address=0x7ff7acc00000, size=32768, reg_handle=0xe36970)
at ../../../../../../../../source/openmpi-tag-v5.0.0rc11+local/opal/mca/smsc/xpmem/smsc_xpmem_module.c:291
#3 0x00007fffe11a05c2 in mca_btl_sm_get (btl=0x7fffe11d3750 <mca_btl_sm>, endpoint=0x9e5680,
local_address=0x7ff78e600000, remote_address=140701731913728, local_handle=0xe36870,
remote_handle=0xe36970, size=32768, flags=0, order=255,
cbfunc=0x7fffe11851d0 <am_rdma_rdma_complete>, cbcontext=0xe367e0, cbdata=0x0)
at ../../../../../../../../source/openmpi-tag-v5.0.0rc11+local/opal/mca/btl/sm/btl_sm_get.c:40
#4 0x00007fffe1184c05 in am_rdma_target_put (btl=0x7fffe11d3750 <mca_btl_sm>, endpoint=0x9e5680,
descriptor=0x7fffffff5788, segments=0x7fffffff5810, segment_count=1, target_address=0x7ff78e600000,
hdr=0x7ff797b5b030, operation=0x7fffffff5780)
at ../../../../../../../source/openmpi-tag-v5.0.0rc11+local/opal/mca/btl/base/btl_base_am_rdma.c:700
#5 0x00007fffe1185f9c in am_rdma_process_rdma (btl=0x7fffe11d3750 <mca_btl_sm>, desc=0x7fffffff57e8)
at ../../../../../../../source/openmpi-tag-v5.0.0rc11+local/opal/mca/btl/base/btl_base_am_rdma.c:976
#6 0x00007fffe119b365 in mca_btl_sm_poll_handle_frag (hdr=0x7ff797b5b000, endpoint=0x9e5680)
at ../../../../../../../../source/openmpi-tag-v5.0.0rc11+local/opal/mca/btl/sm/btl_sm_component.c:452
#7 0x00007fffe119c049 in mca_btl_sm_check_fboxes ()
at ../../../../../../../../source/openmpi-tag-v5.0.0rc11+local/opal/mca/btl/sm/btl_sm_fbox.h:283
#8 0x00007fffe119b143 in mca_btl_sm_component_progress ()
at ../../../../../../../../source/openmpi-tag-v5.0.0rc11+local/opal/mca/btl/sm/btl_sm_component.c:553
#9 0x00007fffe10e09d9 in opal_progress ()
at ../../../../../source/openmpi-tag-v5.0.0rc11+local/opal/runtime/opal_progress.c:224
#10 0x00007fffeb7c24c6 in ompi_osc_rdma_sync_rdma_complete (sync=0xe9b4c0)
at ../../../../../../../../source/openmpi-tag-v5.0.0rc11+local/ompi/mca/osc/rdma/osc_rdma.h:627
#11 0x00007fffeb7c2304 in ompi_osc_rdma_complete_atomic (win=0xe848c0)
at ../../../../../../../../source/openmpi-tag-v5.0.0rc11+local/ompi/mca/osc/rdma/osc_rdma_active_target.c:460
#12 0x00007fffeb5939e0 in PMPI_Win_complete (win=0xe848c0)
at ../../../../../../../source/openmpi-tag-v5.0.0rc11+local/ompi/mpi/c/win_complete.c:52
#13 0x0000000000209c38 in run_put_with_pscw ()
#14 0x0000000000208cd1 in main ()
(gdb)
Backtrace from ompi-main @ da6d715 looks a bit different, so just throwing this in ticket for notes.
- openmpi main
shell:$ git submodule status
10fe4735ee374f5807c2160e61274c4aa53491ae 3rd-party/openpmix (v1.1.3-3847-g10fe4735)
d8bd12b3ffda4af6918d641f024a6b0118789700 3rd-party/prrte (psrvr-v2.0.0rc1-4624-gd8bd12b3ff)
c1cfc910d92af43f8c27807a9a84c9c13f4fbc65 config/oac (remotes/origin/HEAD)
(gdb) bt
#0 0x00007fffead3250e in __cray_memcpy_ROME ()
from /opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libu.so.1
#1 0x00007fffe13e42fc in mca_smsc_xpmem_copy_from ()
from /sw/frontier/ompix/DEVELOP/cce/15.0.0/install/openmpi-br-main.tjn-20230426/lib/libopen-pal.so.0
#2 0x00007fffe13d1bbd in mca_btl_sm_get ()
from /sw/frontier/ompix/DEVELOP/cce/15.0.0/install/openmpi-br-main.tjn-20230426/lib/libopen-pal.so.0
#3 0x00007fffe13c3353 in am_rdma_process_rdma ()
from /sw/frontier/ompix/DEVELOP/cce/15.0.0/install/openmpi-br-main.tjn-20230426/lib/libopen-pal.so.0
#4 0x00007fffe13cf9d7 in mca_btl_sm_component_progress ()
from /sw/frontier/ompix/DEVELOP/cce/15.0.0/install/openmpi-br-main.tjn-20230426/lib/libopen-pal.so.0
#5 0x00007fffe133f37d in opal_progress ()
from /sw/frontier/ompix/DEVELOP/cce/15.0.0/install/openmpi-br-main.tjn-20230426/lib/libopen-pal.so.0
#6 0x00007fffeb7f6ca5 in ompi_osc_rdma_complete_atomic ()
from /sw/frontier/ompix/DEVELOP/cce/15.0.0/install/openmpi-br-main.tjn-20230426/lib/libmpi.so.0
#7 0x00007fffeb6b39b5 in PMPI_Win_complete ()
from /sw/frontier/ompix/DEVELOP/cce/15.0.0/install/openmpi-br-main.tjn-20230426/lib/libmpi.so.0
#8 0x000000000020a13d in run_put_with_pscw ()
#9 0x0000000000208cc1 in main ()
(gdb)
Metadata
Metadata
Assignees
Labels
No labels