Skip to content

osu one-sided failure with vader and rocm #11618

@naughtont3

Description

@naughtont3

Doing some tests with OSU one-sided benchmarks created some errors when testing D2D. Creating this ticket to help track this down.

The test fails at different message sizes, but consistently at 32k there is a failure (see details below).

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

7d4539377a95c9f00e8b6ad43b359850924cf016 3rd-party/openpmix (v4.2.4rc1-1-g7d453937)
20ee75229f68f3d73004a5432b90edeffb8d281e 3rd-party/prrte (v3.0.1rc2-9-g20ee75229f)
3b9e6e346d6d807ca820fbfb5e86734136f4889b config/oac (heads/main)
  • osu-micro-benchmarks-7.0.1

Please describe the system on which you are running

  • Operating system/version: linux
  • Computer hardware:
  • Network type:

Details of the problem

Things fail in different ways, but could consistently reproduce one issue with following setup.

shell$ mpirun -n 2 ./hello_world
frontier10235: $ env | grep OMPI_MCA
OMPI_MCA_opal_common_ofi_provider_include=shm
OMPI_MCA_pml=^ucx
OMPI_MCA_mtl=^ofi
OMPI_MCA_btl=^tcp,ofi,openib
frontier10235: $ mpirun --np 2 --map-by ppr:1:l3cache --bind-to core  ./osu_put_bibw -m 1:32768 -d rocm D D
PID: 121613
PID: 121614

# OSU MPI_Put-ROCM Bi-directional Bandwidth Test v7.0
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_post/start/complete/wait
# Rank 0 Memory on DEVICE (D) and Rank 1 Memory on DEVICE (D)
# Size      Bandwidth (MB/s)
1                       1.73
2                       3.40
4                       6.78
8                      13.34
16                     27.02
32                     53.78
64                    101.80
128                   201.56
256                   415.30
512                   747.69
1024                 1443.92
2048                 2427.74
4096                 3854.74
8192                 5285.93
16384                6715.70

 ...CRASHES AT 32K... see backtrace below...
Thread 1 "osu_put_bibw" received signal SIGBUS, Bus error.
0x00007fffeab1650e in __cray_memcpy_ROME () from /opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libu.so.1
(gdb) bt
#0  0x00007fffeab1650e in __cray_memcpy_ROME () from /opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libu.so.1
#1  0x00007fffe11c582b in mca_smsc_xpmem_memmove (dst=0x7ff78e600000, src=0x7ff78e200000, size=32768)
    at ../../../../../../../../source/openmpi-tag-v5.0.0rc11+local/opal/mca/smsc/xpmem/smsc_xpmem_module.c:259
#2  0x00007fffe11c58aa in mca_smsc_xpmem_copy_from (endpoint=0x9e6ea0, local_address=0x7ff78e600000, 
    remote_address=0x7ff7acc00000, size=32768, reg_handle=0xe36970)
    at ../../../../../../../../source/openmpi-tag-v5.0.0rc11+local/opal/mca/smsc/xpmem/smsc_xpmem_module.c:291
#3  0x00007fffe11a05c2 in mca_btl_sm_get (btl=0x7fffe11d3750 <mca_btl_sm>, endpoint=0x9e5680, 
    local_address=0x7ff78e600000, remote_address=140701731913728, local_handle=0xe36870, 
    remote_handle=0xe36970, size=32768, flags=0, order=255, 
    cbfunc=0x7fffe11851d0 <am_rdma_rdma_complete>, cbcontext=0xe367e0, cbdata=0x0)
    at ../../../../../../../../source/openmpi-tag-v5.0.0rc11+local/opal/mca/btl/sm/btl_sm_get.c:40
#4  0x00007fffe1184c05 in am_rdma_target_put (btl=0x7fffe11d3750 <mca_btl_sm>, endpoint=0x9e5680, 
    descriptor=0x7fffffff5788, segments=0x7fffffff5810, segment_count=1, target_address=0x7ff78e600000, 
    hdr=0x7ff797b5b030, operation=0x7fffffff5780)
    at ../../../../../../../source/openmpi-tag-v5.0.0rc11+local/opal/mca/btl/base/btl_base_am_rdma.c:700
#5  0x00007fffe1185f9c in am_rdma_process_rdma (btl=0x7fffe11d3750 <mca_btl_sm>, desc=0x7fffffff57e8)
    at ../../../../../../../source/openmpi-tag-v5.0.0rc11+local/opal/mca/btl/base/btl_base_am_rdma.c:976
#6  0x00007fffe119b365 in mca_btl_sm_poll_handle_frag (hdr=0x7ff797b5b000, endpoint=0x9e5680)
    at ../../../../../../../../source/openmpi-tag-v5.0.0rc11+local/opal/mca/btl/sm/btl_sm_component.c:452
#7  0x00007fffe119c049 in mca_btl_sm_check_fboxes ()
    at ../../../../../../../../source/openmpi-tag-v5.0.0rc11+local/opal/mca/btl/sm/btl_sm_fbox.h:283
#8  0x00007fffe119b143 in mca_btl_sm_component_progress ()
    at ../../../../../../../../source/openmpi-tag-v5.0.0rc11+local/opal/mca/btl/sm/btl_sm_component.c:553
#9  0x00007fffe10e09d9 in opal_progress ()
    at ../../../../../source/openmpi-tag-v5.0.0rc11+local/opal/runtime/opal_progress.c:224
#10 0x00007fffeb7c24c6 in ompi_osc_rdma_sync_rdma_complete (sync=0xe9b4c0)
    at ../../../../../../../../source/openmpi-tag-v5.0.0rc11+local/ompi/mca/osc/rdma/osc_rdma.h:627
#11 0x00007fffeb7c2304 in ompi_osc_rdma_complete_atomic (win=0xe848c0)
    at ../../../../../../../../source/openmpi-tag-v5.0.0rc11+local/ompi/mca/osc/rdma/osc_rdma_active_target.c:460
#12 0x00007fffeb5939e0 in PMPI_Win_complete (win=0xe848c0)
    at ../../../../../../../source/openmpi-tag-v5.0.0rc11+local/ompi/mpi/c/win_complete.c:52
#13 0x0000000000209c38 in run_put_with_pscw ()
#14 0x0000000000208cd1 in main ()
(gdb) 

Backtrace from ompi-main @ da6d715 looks a bit different, so just throwing this in ticket for notes.

  • openmpi main
 shell:$ git submodule status
 10fe4735ee374f5807c2160e61274c4aa53491ae 3rd-party/openpmix (v1.1.3-3847-g10fe4735)
 d8bd12b3ffda4af6918d641f024a6b0118789700 3rd-party/prrte (psrvr-v2.0.0rc1-4624-gd8bd12b3ff)
 c1cfc910d92af43f8c27807a9a84c9c13f4fbc65 config/oac (remotes/origin/HEAD)
(gdb) bt
#0  0x00007fffead3250e in __cray_memcpy_ROME ()
   from /opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libu.so.1
#1  0x00007fffe13e42fc in mca_smsc_xpmem_copy_from ()
   from /sw/frontier/ompix/DEVELOP/cce/15.0.0/install/openmpi-br-main.tjn-20230426/lib/libopen-pal.so.0
#2  0x00007fffe13d1bbd in mca_btl_sm_get ()
   from /sw/frontier/ompix/DEVELOP/cce/15.0.0/install/openmpi-br-main.tjn-20230426/lib/libopen-pal.so.0
#3  0x00007fffe13c3353 in am_rdma_process_rdma ()
   from /sw/frontier/ompix/DEVELOP/cce/15.0.0/install/openmpi-br-main.tjn-20230426/lib/libopen-pal.so.0
#4  0x00007fffe13cf9d7 in mca_btl_sm_component_progress ()
   from /sw/frontier/ompix/DEVELOP/cce/15.0.0/install/openmpi-br-main.tjn-20230426/lib/libopen-pal.so.0
#5  0x00007fffe133f37d in opal_progress ()
   from /sw/frontier/ompix/DEVELOP/cce/15.0.0/install/openmpi-br-main.tjn-20230426/lib/libopen-pal.so.0
#6  0x00007fffeb7f6ca5 in ompi_osc_rdma_complete_atomic ()
   from /sw/frontier/ompix/DEVELOP/cce/15.0.0/install/openmpi-br-main.tjn-20230426/lib/libmpi.so.0
#7  0x00007fffeb6b39b5 in PMPI_Win_complete ()
   from /sw/frontier/ompix/DEVELOP/cce/15.0.0/install/openmpi-br-main.tjn-20230426/lib/libmpi.so.0
#8  0x000000000020a13d in run_put_with_pscw ()
#9  0x0000000000208cc1 in main ()
(gdb)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions