Skip to content

v4.1.x memory alignment issue on AVX support #7954

@zhngaj

Description

@zhngaj

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.1.x branch

git clone --recursive https://github.com/open-mpi/ompi.git
git checkout v4.1.x
bd16024a0b (HEAD -> v4.1.x, origin/v4.1.x) Merge pull request #7946 from gpaulsen/topic/v4.1.x/README_for_SLURM_binding

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

./configure --prefix=/fsx/ompi/install --with-libfabric=/fsx/libfabric/install --enable-debug

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

There's no output from git submodule status after I checked out v4.1.x branch.
I can see its output with master branch though.

[ec2-user@ip-172-31-42-103 ompi]$ git submodule status
 4a43c39c89037f52b4e25927e58caf08f3707c33 opal/mca/hwloc/hwloc2/hwloc (hwloc-2.1.0rc2-53-g4a43c39c)
 b18f8ff3f62bd93f17e01c4230e7eca4a30aa204 opal/mca/pmix/pmix4x/openpmix (v1.1.3-2460-gb18f8ff3)
 17ace07e7d41f50d035dc42dfd6233f8802e4405 prrte (dev-30660-g17ace07e7d)

Please describe the system on which you are running

  • Operating system/version:
LSB_VERSION=base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
NAME="Amazon Linux AMI"
VERSION="2018.03"
ID="amzn"
ID_LIKE="rhel fedora"
VERSION_ID="2018.03"
PRETTY_NAME="Amazon Linux AMI 2018.03"
ANSI_COLOR="0;33"
CPE_NAME="cpe:/o:amazon:linux:2018.03:ga"
HOME_URL="http://aws.amazon.com/amazon-linux-ami/"
Amazon Linux AMI release 2018.03
  • Computer hardware:
    Two c5n.18xlarge instances, each with 36 cores, Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz, and one EFA network interface.
  • Network type:
    EFA (Elastic Fabric Adapter) network interface

Details of the problem

  1. Built libfabric master (b01932dfb (HEAD -> master, origin/master, origin/HEAD) Merge pull request #6103 from wzamazon/efa_fix_readmsg)
./configure --prefix=/fsx/libfabric/install --enable-mrail --enable-tcp --enable-rxm --disable-rxd --disable-verbs --enable-efa=/usr --enable-debug
  1. Built Open MPI v4.1.x branch (bd16024 (HEAD -> v4.1.x, origin/v4.1.x) Merge pull request Adding SLURM binding policy change to README #7946 from gpaulsen/topic/v4.1.x/README_for_SLURM_binding) with libfabric
./configure --prefix=/fsx/ompi/install --with-libfabric=/fsx/libfabric/install --enable-debug
  1. Built intel-mpi-benchmark 2019 U6 (https://github.com/intel/mpi-benchmarks/tree/IMB-v2019.6) with Open MPI

  2. Ran IMB-EXT Accumulate test with 2 MPI processes, and hit the segfault.

[ec2-user@ip-172-31-9-184 ~]$ mpirun --prefix /fsx/ompi/install  -n 2 -N 1 --mca btl ofi --mca osc rdma --mca btl_ofi_provider_include efa --hostfile /fsx/hosts -x PATH -x LD_LIBRARY_PATH /fsx/SubspaceBenchmarks/spack/opt/spack/linux-amzn2018-x86_64/gcc-4.8.5/intel-mpi-benchmarks-2019.6-tnzpd3z7s4mgsvchsl2ofn2cgw3aonty/bin/IMB-EXT Accumulate  -npmin 2 -iter 200
Warning: Permanently added 'ip-172-31-13-230,172.31.13.230' (ECDSA) to the list of known hosts.
#------------------------------------------------------------
#    Intel(R) MPI Benchmarks 2019 Update 6, MPI-2 part
#------------------------------------------------------------
# Date                  : Tue Jul 21 19:06:54 2020
# Machine               : x86_64
# System                : Linux
# Release               : 4.14.165-103.209.amzn1.x86_64
# Version               : #1 SMP Sun Feb 9 00:23:26 UTC 2020
# MPI Version           : 3.1
# MPI Thread Environment:


# Calling sequence was:

# /fsx/SubspaceBenchmarks/spack/opt/spack/linux-amzn2018-x86_64/gcc-4.8.5/intel-mpi-benchmarks-2019.6-tnzpd3z7s4mgsvchsl2ofn2cgw3aonty/bin/IMB-EXT Accumulate -npmin 2 -iter 200

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# Accumulate

#-----------------------------------------------------------------------------
# Benchmarking Accumulate
# #processes = 2
#-----------------------------------------------------------------------------
#
#    MODE: AGGREGATE
#
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]      defects
            0          200         0.04         0.22         0.13         0.00
            4          200         0.85         0.89         0.87         0.00
            8          200         0.77         0.83         0.80         0.00
           16          200         0.81         0.85         0.83         0.00
[ip-172-31-13-230:12036] *** Process received signal ***
[ip-172-31-13-230:12036] Signal: Segmentation fault (11)
[ip-172-31-13-230:12036] Signal code:  (-6)
[ip-172-31-13-230:12036] Failing at address: 0x1f400002f04
[ip-172-31-13-230:12036] [ 0] /lib64/libpthread.so.0(+0xf600)[0x7f4b6ba25600]
[ip-172-31-13-230:12036] [ 1] /fsx/ompi/install/lib/openmpi/mca_op_avx.so(+0x3e738)[0x7f4b60f23738]
[ip-172-31-13-230:12036] [ 2] /fsx/ompi/install/lib/openmpi/mca_osc_rdma.so(+0x9f5d)[0x7f4b51bd1f5d]
[ip-172-31-13-230:12036] [ 3] /fsx/ompi/install/lib/openmpi/mca_osc_rdma.so(+0xc637)[0x7f4b51bd4637]
[ip-172-31-13-230:12036] [ 4] /fsx/ompi/install/lib/openmpi/mca_osc_rdma.so(+0xca19)[0x7f4b51bd4a19]
[ip-172-31-13-230:12036] [ 5] /fsx/ompi/install/lib/openmpi/mca_osc_rdma.so(+0xffbf)[0x7f4b51bd7fbf]
[ip-172-31-13-230:12036] [ 6] /fsx/ompi/install/lib/openmpi/mca_osc_rdma.so(ompi_osc_rdma_accumulate+0x10f)[0x7f4b51bd86ed]
[ip-172-31-13-230:12036] [ 7] /fsx/ompi/install/lib/libmpi.so.40(PMPI_Accumulate+0x421)[0x7f4b6c5a8967]
[ip-172-31-13-230:12036] [ 8] /fsx/SubspaceBenchmarks/spack/opt/spack/linux-amzn2018-x86_64/gcc-4.8.5/intel-mpi-benchmarks-2019.6-tnzpd3z7s4mgsvchsl2ofn2cgw3aonty/bin/IMB-EXT[0x43da5c]
[ip-172-31-13-230:12036] [ 9] /fsx/SubspaceBenchmarks/spack/opt/spack/linux-amzn2018-x86_64/gcc-4.8.5/intel-mpi-benchmarks-2019.6-tnzpd3z7s4mgsvchsl2ofn2cgw3aonty/bin/IMB-EXT[0x42c3aa]
[ip-172-31-13-230:12036] [10] /fsx/SubspaceBenchmarks/spack/opt/spack/linux-amzn2018-x86_64/gcc-4.8.5/intel-mpi-benchmarks-2019.6-tnzpd3z7s4mgsvchsl2ofn2cgw3aonty/bin/IMB-EXT[0x432d44]
[ip-172-31-13-230:12036] [11] /fsx/SubspaceBenchmarks/spack/opt/spack/linux-amzn2018-x86_64/gcc-4.8.5/intel-mpi-benchmarks-2019.6-tnzpd3z7s4mgsvchsl2ofn2cgw3aonty/bin/IMB-EXT[0x405513]
[ip-172-31-13-230:12036] [12] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f4b6b66a575]
[ip-172-31-13-230:12036] [13] /fsx/SubspaceBenchmarks/spack/opt/spack/linux-amzn2018-x86_64/gcc-4.8.5/intel-mpi-benchmarks-2019.6-tnzpd3z7s4mgsvchsl2ofn2cgw3aonty/bin/IMB-EXT[0x403d29]
[ip-172-31-13-230:12036] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 12036 on node ip-172-31-13-230 exited on signal 11 (Segmentation fault).
  1. Some notes

Checking the backtrace below, I found that the segfault is because Open MPI added the supports for MPI_OP using AVX512, AVX2 and MMX (commit). When it's using _mm256_load_ps (or _mm512_load_ps) to load the single-precision floating-point elements, it requires the address must be 32-byte aligned (or 64-byte aligned).

In IMB-EXT, it uses MPI_Alloc_mem and MPI_Win_create to allocate source_buffer and Window, the returned address is not guaranteed to be the specific length aligned, which leads to segfault.

Instead, I checked out the commit right before the above commit, the segfault disappeared. I also tried with _mm256_loadu_ps (or _mm512_loadu_ps) which does not require memory aligned. The segfault also disappeared.

Is this a real issue in OMPI side or the apps (IMB here) are expected to provide the specific length aligned pointers?

#0  0x00007f98385f8763 in _mm256_load_ps (__P=0xef70d0) at /usr/lib/gcc/x86_64-amazon-linux/7/include/avxintrin.h:873
873   return *(__m256 *)__P;
Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.180.amzn1.x86_64 libibverbs-28.amzn0-1.amzn1.x86_64 libnl3-3.2.28-4.6.amzn1.x86_64 libpciaccess-0.13.1-4.1.11.amzn1.x86_64 zlib-1.2.8-7.18.amzn1.x86_64
(gdb) bt
#0  0x00007f98385f8763 in _mm256_load_ps (__P=0xef70d0) at /usr/lib/gcc/x86_64-amazon-linux/7/include/avxintrin.h:873
#1  ompi_op_avx_2buff_add_float_avx512 (_in=0xef70a0, _out=0xef70d0, count=0x7fff324b9b94, dtype=0x7fff324b9b88, module=0xbcfb30)
    at op_avx_functions.c:504
#2  0x00007f983fb80a17 in ompi_op_reduce (op=0x661180 <ompi_mpi_op_sum>, source=0xef70a0, target=0xef70d0, count=8,
    dtype=0x662e00 <ompi_mpi_float>) at ../../../ompi/op/op.h:581
#3  0x00007f983fb8114a in ompi_osc_base_sndrcv_op (origin=0xef70a0, origin_count=8, origin_dt=0x662e00 <ompi_mpi_float>,
    target=0xef70d0, target_count=8, target_dt=0x662e00 <ompi_mpi_float>, op=0x661180 <ompi_mpi_op_sum>)
    at base/osc_base_obj_convert.c:178
#4  0x00007f9824fd4293 in ompi_osc_rdma_gacc_local (source_buffer=0xef70a0, source_count=8,
    source_datatype=0x662e00 <ompi_mpi_float>, result_buffer=0x0, result_count=0, result_datatype=0x0, peer=0xef7fa0,
    target_address=15691984, target_handle=0x0, target_count=8, target_datatype=0x662e00 <ompi_mpi_float>,
    op=0x661180 <ompi_mpi_op_sum>, module=0xee9110, request=0x0, lock_acquired=true) at osc_rdma_accumulate.c:143
#5  0x00007f9824fd7f71 in ompi_osc_rdma_rget_accumulate_internal (sync=0xee9310, origin_addr=0xef70a0, origin_count=8,
    origin_datatype=0x662e00 <ompi_mpi_float>, result_addr=0x0, result_count=0, result_datatype=0x0, peer=0xef7fa0, target_rank=0,
    target_disp=0, target_count=8, target_datatype=0x662e00 <ompi_mpi_float>, op=0x661180 <ompi_mpi_op_sum>, request=0x0)
    at osc_rdma_accumulate.c:934
#6  0x00007f9824fd86ed in ompi_osc_rdma_accumulate (origin_addr=0xef70a0, origin_count=8,
    origin_datatype=0x662e00 <ompi_mpi_float>, target_rank=0, target_disp=0, target_count=8,
    target_datatype=0x662e00 <ompi_mpi_float>, op=0x661180 <ompi_mpi_op_sum>, win=0xeebed0) at osc_rdma_accumulate.c:1067
#7  0x00007f983fb2b967 in PMPI_Accumulate (origin_addr=0xef70a0, origin_count=8, origin_datatype=0x662e00 <ompi_mpi_float>,
    target_rank=0, target_disp=0, target_count=8, target_datatype=0x662e00 <ompi_mpi_float>, op=0x661180 <ompi_mpi_op_sum>,
    win=0xeebed0) at paccumulate.c:130
#8  0x000000000043da5c in IMB_accumulate ()
#9  0x000000000042c3aa in Bmark_descr::IMB_init_buffers_iter(comm_info*, iter_schedule*, Bench*, cmode*, int, int) ()
#10 0x0000000000432d44 in OriginalBenchmark<BenchmarkSuite<(benchmark_suite_t)4>, &IMB_accumulate>::run(scope_item const&) ()
#11 0x0000000000405513 in main ()
(gdb) f 0
#0  0x00007f98385f8763 in _mm256_load_ps (__P=0xef70d0) at /usr/lib/gcc/x86_64-amazon-linux/7/include/avxintrin.h:873
873   return *(__m256 *)__P;
(gdb) p 0xef70d0 % 32
$2 = 16

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions