-
Notifications
You must be signed in to change notification settings - Fork 935
Description
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v4.1.0
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From source tarball, default configuration built with GCC 4.8.5.
Please describe the system on which you are running
- Operating system/version: Amazon Linux
- Computer hardware: Intel Skylake
- Network type: EFA
Details of the problem
We noticed an OS-specific regression with LAMMPS (in.chute.scaled case) with 4.1.0. Bisecting through the commits, this seems to have been introduced with the AVX-based MPI_OP changes that got backported into this series. Specifically, the commit which moved to the unaligned SSE memory access primitives for reduce OPs seems to be causing it: #7957
This was added to address the Accumulate issue, so it is a necessary correctness fix (#7954)
The actual PR which introduced the SSE-based MPI_OP in the first place was backported from master: #7935
Broadly, allreduce performance seems to have taken a hit in 4.1.0 compared to 4.0.5 in this environment because of these changes. We do not see this with Amazon Linux 2 (which has a 7.x series GCC) or Ubuntu 18, for instance.
Tried with #8322 just in case, that does not help either.
@bosilca does anything obvious stand out to you?