Skip to content

Conversation

@bosilca
Copy link
Member

@bosilca bosilca commented Dec 28, 2020

Allow more flexibility on the support of AVX* extensions.

  • Use cached variables to prevent OMPI from checking specific architecture features. As an example this will allow to disable AVX512 while allowing AVX2.
  • Protect the code against mismatched flags between configure and make.
  • Document the list of of vectorial functions we need and the corresponding level of AVX, SSE required.

Signed-off-by: George Bosilca [email protected]

@ggouaillardet
Copy link
Contributor

@bosilca though the changes look good to me, they are unfortunately insufficient in a brew environment.
The reason is -march=nehalem is used by the superenv (e.g. wrapper) and though all AVX flavors are supported at configure time, none of them are available at make time, and we end up in the case below

#if !defined(PREPEND)
#  if OMPI_MCA_OP_HAVE_AVX512 || OMPI_MCA_OP_HAVE_AVX2
#    error The configure step has detected possible support for AVX512 and/or AVX2 but the compiler flags during make are too restrictive. Please disable the AVX component by adding --enable-mca-no-build=op-avx to your configure step.

should OMPI_MCA_OP_HAVE_AVX be added to the test? (that would not "fix" brew though)

also, even if no AVX flavor is supported at make time, we could still build the SSE flavors (I guess that would be still better than skipping the op/avx component).
op/avx could then be renamed to op/x86_64 (I'm not that pedantic, so I am fine with the avx component name)

I also appended a third commit to give the option to skip avx/sse detection.

In the case of brew, adding ompi_cv_op_avx_check_avx512=no, ompi_cv_op_avx_check_avx2=no and ompi_cv_op_avx_check_avx=no to the environment before invoking configure is enough to successfully build Open MPI.
(note that even if no AVX flavor is supported, SSE flavors are, and the component can be built)

@bosilca
Copy link
Member Author

bosilca commented Dec 31, 2020

I need to check that we have fully SSE versions of the code, but right now I think we need some basic AVX support for the load/store instructions.

Thus, if we add OMPI_MCA_OP_HAVE_AVX to the test, we will build the AVX component in situations where the entire component is useless and mostly empty (because we will not generate any of the MPI_Op support functions).

@ggouaillardet
Copy link
Contributor

@bosilca I see what you mean, and I must admit I did not try to run the SSE only version of the op/avx component.

If only AVX (e.g. sandy bridge aka AVX1) is detected at configure time but is not available at make time,
the error message would be "This file should not be compiled in this conditions." which is less user friendly (e.g. the user should be advised to disable the op/avx component instead of sending us config.log

@ggouaillardet
Copy link
Contributor

@bosilca I checked again configure.m4 and there seems to be some confusion here.

  • AVX512 is correctly handled
  • The AVX2 really tests AVX ( _mm256_add_ps() and _mm256_loadu_si256() are AVX primitives, __m256i _mm256_add_epi32 (__m256i a, __m256i b) would be a AVX2 primitive not available in AVX.
  • The AVX really tests SSE (_mm_add_ps() is a SSE primitive, __m256i _mm256_add_epi32 (__m256i a, __m256i b) would be a AVX primitive.

For example, if you configure CFLAGS=-march=nehalem, configure currently states that AVX can be generated without any extra flags. This is wrong, -mavx is needed here otherwise make will fail because the __AVX__ macro is not defined (-march=nehalem is pre-AVX)

I guess you already know this, AVX is basically 256 bits floating point vectors (read no integer vectors), and AVX2 is basically any 256 bits vectors (read floating point and integer vectors).
From an Open MPI point of view, it might be too convoluted to provide optimal support for sandy bridge (AVX but not AVX2 yet) since only floating point is supported by AVX and integer should fall back to a SSE flavor.

Please let me know how you see this. If you agree to drop support for AVX without AVX2, I think I can help.

@bosilca
Copy link
Member Author

bosilca commented Jan 4, 2021

You are right, we are not really correctly testing for most of the vectorial support, not even for AVX512. More precisely, we check for some flavors of AVX512 but we don't have a single consistent check that reflects how we use the AVX512 instructions. What we should really check for _mm512_add_epi32, _mm512_add_epi8 and _mm256_mullo_epi64 to make sure that AVX512F, AVX512BW and AVX512DQ + AVX512VL are supported simultaneously.

I took a stab at this in the last commit. Looks like a step in the right direction, but still lacks the finesse to provide individual support for AVX512F, AVX512BW and AVX512DQ + AVX512VL.

@ggouaillardet
Copy link
Contributor

@bosilca I back-ported this PR to the v4.1.x branch and I was able to build Open MPI under brew without any workaround :-)

Since the op/avx component has caused some issues to package maintainers (brew and conda), maybe we should release 4.1.1 shortly after this PR is back ported.

@bosilca
Copy link
Member Author

bosilca commented Jan 4, 2021

Excellent, this is great news. Let me squash the PR in 2 commits and then we can move forward. Now that we have reached this positive outcome the RM for the 4.1 (@jsquyres) should make the call for a new stable in the 4.1 field.

@jjhursey
Copy link
Member

jjhursey commented Jan 4, 2021

bot:ibm:pgi:retest

@open-mpi open-mpi deleted a comment from ibm-ompi Jan 4, 2021
@open-mpi open-mpi deleted a comment from ibm-ompi Jan 4, 2021
@open-mpi open-mpi deleted a comment from ibm-ompi Jan 4, 2021
@open-mpi open-mpi deleted a comment from ibm-ompi Jan 4, 2021
@jsquyres
Copy link
Member

jsquyres commented Jan 4, 2021

@bosilca After re-reading #8306, I think the proposed solution was not to support having the user pass different flags during configure vs. make, but rather a) to fix the ordering of flags that we (Open MPI) use during configure and make, and b) use slightly different architecture/AVX flags than we previously used.

Is that right?

If so, can the description of this PR be updated to reflect that? Because right now the description implies that this PR will do something that we probably should not support (i.e., user passing different flags at configure vs. make).

@bosilca
Copy link
Member Author

bosilca commented Jan 4, 2021

This PR is more than the 2 updates you mentionned. @ggouaillardet added the capability to manually skip checks for particular AVX-related features, that would allow the package maintainers to hand-craft their install to avoid colliding with their target architecture. I think Gilles mentioned that this part is not necessary with the last set of commit, but as far as I see it is still in this PR.

@jsquyres
Copy link
Member

jsquyres commented Jan 4, 2021

This PR is more than the 2 updates you mentionned.

Fair enough. I just wanted to make sure that we're not actually moving to support using flag set A during configure and flag set B during make.

1. Consistent march flag order between configure and make.

2. op/avx: give the option to skip some tests

it is possible to skip some intrinsic tests by setting some environment variables to "no" before invoking configure:
 - ompi_cv_op_avx_check_avx512
 - ompi_cv_op_avx_check_avx2
 - ompi_cv_op_avx_check_avx
 - ompi_cv_op_avx_check_sse41
 - ompi_cv_op_avx_check_sse3

3. op/avx: update AVX512 flags

try
-mavx512f -mavx512bw -mavx512vl -mavx512dq
instead of
-march=skylake-avx512

since the former is less likely to conflict with user provided CFLAGS
(e.g. -march=...)

Thanks Bart Oldeman for pointing this.

4. op/avx: have the op/avx library depend on libmpi.so

Refs. open-mpi#8323

Signed-off-by: Gilles Gouaillardet <[email protected]>
Signed-off-by: George Bosilca <[email protected]>
1. Allow fallback to a lesser AVX support during make

Due to the fact that some distro restrict the compiule architecture
during make (while not setting any restrictions during configure) we
need to detect the target architecture also during make in order to
restrict the code we generate.

2. Add comments and better protect the arch specific code.

Identify all the vectorial functions used and clasify them according to
the neccesary hardware capabilities.
Use these requirements to protect the code for load and stores (the rest
of the code being automatically generated it is more difficult to
protect).

3. Correctly check for AVX* support.

Signed-off-by: George Bosilca <[email protected]>
@bosilca bosilca force-pushed the topic/portable_avx branch from 63f2951 to fcf2766 Compare January 4, 2021 18:44
@bosilca
Copy link
Member Author

bosilca commented Jan 4, 2021

This is now ready to go. @jsquyres your call if you want a subrelease on the 4.1 stable.

@jsquyres
Copy link
Member

jsquyres commented Jan 4, 2021

Yes, we will definitely want this on v4.1.x. Thanks!

@jsquyres
Copy link
Member

jsquyres commented Jan 4, 2021

bot:ompi:retest

The test now has the ability to add a shift to all or to any of the
input and output buffers to assess the impact of unaligned operations.

Signed-off-by: George Bosilca <[email protected]>
@isuruf
Copy link

isuruf commented Jan 7, 2021

I just wanted to make sure that we're not actually moving to support using flag set A during configure and flag set B during make.

As far as I can see, this is now supported in this PR. Maybe that part of the code should be removed?

@amckinstry
Copy link

So we have problems with the AVX code in v4.1.0 in Debian Unstable.
We see sporadic sigbus and other segfaults (depending on app, architecture) which go away if RDMAV_FORK_SAFE is set
(See https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=979041)

There are two suspects:
(a) The module is built with AVX operands on a build system that are not present on the runtime machine.
(b) All the apps that have shown the problem have been python apps. Is this relevant ? perhaps due to fork()ing ?

Is there something a general purpose build in a distro such as Debian should be doing to avoid (a) ?

@bosilca
Copy link
Member Author

bosilca commented Jan 8, 2021

I can't see anything in the Debian issue pointing to AVX, instead every discussion points to RDMAV_FORK_SAFE not being correctly set. That's an orthogonal discussion to this AVX PR, if needed it should have its own issue.

To answer your question (a), some functions in mca_op_avx.so are compiled with different flavors of AVX enabled (AVX & pre, AVX2 and AVX512) but these function are never called directly. Instead, they are called from a generic function (aka built with the default compiler options), that checks the necessary processor capabilities before calling these functions.

That being said, it is possible we missed some processor capabilities requirements before calling these functions, but we need more specific information to be able to understand what is going on.

@rajachan
Copy link
Member

rajachan commented Jan 8, 2021

@amckinstry I see in that thread the user is running with OFI MTL and the EFA OFI provider. Can you take a look at ofiwg/libfabric#6332?

@jsquyres
Copy link
Member

jsquyres commented Jan 8, 2021

@bosilca Any reason not to merge this PR? Are we waiting for anything here?

@bosilca
Copy link
Member Author

bosilca commented Jan 8, 2021

The regression on lammps has not yet been solved, and at this point we don't know if it is, or how it is related to AVX. Maybe we should wait until @rajachan get a better understanding of what is happening there (following the discussion in #8334).

@jsquyres
Copy link
Member

jsquyres commented Jan 8, 2021

Sure -- I understand #8334 is not yet resolved (the perf issue). But this PR is the correctness issue, and is separate from that, right?

@bosilca
Copy link
Member Author

bosilca commented Jan 8, 2021

I would think this PR is independent, but somehow the use of the AVX module has an undesirable side effect on lammps. Which means that maybe we still have some corner cases that are not yet correctly addressed.

@jsquyres
Copy link
Member

jsquyres commented Jan 9, 2021

I would think this PR is independent, but somehow the use of the AVX module has an undesirable side effect on lammps. Which means that maybe we still have some corner cases that are not yet correctly addressed.

Agreed. But AVX is already the default on master, so let's go ahead and get this correctness fix in, and we can work on the AVX performance fix independently.

@jsquyres jsquyres merged commit 8115bd2 into open-mpi:master Jan 10, 2021
@bosilca bosilca deleted the topic/portable_avx branch January 10, 2021 19:28
@leofang
Copy link

leofang commented Jan 11, 2021

Thanks for resolving it quickly @ggouaillardet @bosilca @jsquyres! Question: will this be backported to the v4.x branch?

leofang added a commit to regro-cf-autotick-bot/openmpi-feedstock that referenced this pull request Jan 11, 2021
leofang added a commit to regro-cf-autotick-bot/openmpi-feedstock that referenced this pull request Jan 11, 2021
@jsquyres
Copy link
Member

jsquyres commented Jan 11, 2021

Question: will this be backported to the v4.x branch?

Yes: #8361

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants