Skip to content

Conversation

@bwbarrett
Copy link
Member

Backport of #7957 into the v4.1.x branch.

hppritcha and others added 30 commits February 3, 2020 13:25
…inel-linker-black-magic

v4.0.x: Make C and Fortran types for MPI sentinels agree in size
Both opal_hwloc_base_get_relative_locality() and _get_locality_string()
iterate over hwloc levels to build the proc locality information.
Unfortunately, NUMA nodes are not in those normal levels anymore since 2.0.
We have to explicitly look a the special NUMA level to get that locality info.

I am factorizing the core of the iterations inside dedicated "_by_depth"
functions and calling them again for the NUMA level at the end of the loops.

Thanks to Hatem Elshazly for reporting the NUMA communicator split failure
at https://www.mail-archive.com/[email protected]/msg33589.html

It looks like only the opal_hwloc_base_get_locality_string() part is needed
to fix that split, but there's no reason not to fix get_relative_locality()
as well.

Signed-off-by: Brice Goglin <[email protected]>
(cherry picked from commit ea80a20)
not being defined.

related to open-mpi#7201

Signed-off-by: Howard Pritchard <[email protected]>
These -D's are for C compilation, not Fortran compilation.  Remove
this useless statement.

Signed-off-by: Jeff Squyres <[email protected]>
(cherry picked from commit f4a47a5)
Automake's Fortran compilation rules inexplicably use CPPFLAGS and
AM_CPPFLAGS.  Unfortunately, this can cause problems in some cases
(e.g., picking up already-installed mpi.mod in a system-default
include search path).

So in relevant module-using Fortran compilation Makefile.am's, zero
out CPPFLAGS and AM_CPPFLAGS.

This has a side-effect of requiring that we compile the one .c file in
the F08 library in a new, separate subdirectory (with its own
Makefile.am that does _not_ have CPPFLAGS/AM_CPPFLAGS zeroed out).

Signed-off-by: Jeff Squyres <[email protected]>
Signed-off-by: Gilles Gouaillardet <[email protected]>
(cherry picked from commit ab398f4)
- increase number of max segments to allow application be launched
  on some Ubuntu configurations

Signed-off-by: Sergey Oblomov <[email protected]>
(cherry picked from commit f742f28)
…egments-v4.0

OSHMEM/SEGMENTS: increase max number of segments - v4.0
Signed-off-by: Geoffrey Paulsen <[email protected]>
Build was broken by mistake in commit d40662edc41a5a4d09ae690b640cfdeeb24e15a1

Fixes open-mpi#7362

Signed-off-by: Brice Goglin <[email protected]>
(cherry picked from commit 907ad85)
- pgcc18 defines __GNUC__ similar to Intel compilers. So we must
  check for pgi higher up, or else configury will mistake
  it for gcc.

Signed-off-by: Austen Lauria <[email protected]>
(cherry picked from commit 14785de)
This commit addresses two issues in osc/rdma:

 1) It is erroneous to attach regions that overlap. This was being
    allowed but the standard does not allow overlapping attachments.

 2) Overlapping registration regions (4k alignment of attachments)
    appear to be allowed. Add attachment bases to the bookeeping
    structure so we can keep better track of what can be detached.

It is possible that the standard did not intend to allow #2. If that
is the case then #2 should fail in the same way as #1. There should
be no technical reason to disallow #2 at this time.

References open-mpi#7384

Signed-off-by: Nathan Hjelm <[email protected]>
(cherry picked from commit 6649aef)
Signed-off-by: Nathan Hjelm <[email protected]>
This commit increaes the osc_rdma_max_attach variable from 32
to 64. The new default is kept low due to the small number
of registration resources on some systems (Cray Aries). A
larger max attachement value can be set by the user on other
systems.

Signed-off-by: Nathan Hjelm <[email protected]>
(cherry picked from commit 54c8233)
Signed-off-by: Nathan Hjelm <[email protected]>
…ize zero

Signed-off-by: Joseph Schuchart <[email protected]>
(cherry picked from commit 06bbcf4)
…x_ci_for_release_branches_v4

Enabled Mellanox CI for release branches (changes for v4.0.x branch).
Correctly set baseptr in contiguous shared memory window with local size zero (v4.0.x)
This commit changes the behavior of the individual sharedfp component. If
the component cannot create either the datafile or the metadatafile during File_open,
no error is being raised going forward. This allows applications that do not use shared
file pointer operations to continue execution without any issue.

If the user however subsequently calls MPI_File_write_shared or similar operations, an error
will be raised.

Fixes issue open-mpi#7429

Signed-off-by: Edgar Gabriel <[email protected]>
(cherry picked from commit df6e3e5)
The CI is triggered only upon a PR creation or by special PR comments.

Signed-off-by: Artem Ryabov <[email protected]>
- fix a typo `alloc_shared_contig` to `alloc_shared_noncontig`
- correct the value of `blocking_fence`

Signed-off-by: Tsubasa Yanagibashi <[email protected]>
(cherry picked from commit a07a83d)
…ummy-module-v4.0.x

sharedfp/individual: defer error when not being able to open datafile
ggouaillardet and others added 28 commits July 10, 2020 16:58
do not check some input parameters when an {in,out}degree is zero

Thanks Junchao Zhang for analyzing and reporting this issue.

Signed-off-by: Gilles Gouaillardet <[email protected]>
(cherry picked from commit 5655d64)
…n-cint-not-equal-to-finteger

v4.1.x: fortran.m4: disallow when sizeof(int) != sizeof(INTEGER)
mpi/c: fix param checks in [I]Neighbor_alltoall{v,w}
Signed-off-by: Artem Polyakov <[email protected]>
(cherry picked from commit c72f295)
Add logic to handle different architectural capabilities
Detect the compiler flags necessary to build specialized
versions of the MPI_OP. Once the different flavors (AVX512,
AVX2, AVX) are built, detect at runtime which is the best
match with the current processor capabilities.

Add validation checks for loadu 256 and 512 bits.
Add validation tests for MPI_Op.

Signed-off-by: Jeff Squyres <[email protected]>
Signed-off-by: Gilles Gouaillardet <[email protected]>
Signed-off-by: dongzhong <[email protected]>
Signed-off-by: George Bosilca <[email protected]>
(cherry picked from commit 14b3c70)
…ngth

v4.1.x: Add supports for MPI_OP using AVX512, AVX2 and MMX
v4.1.x: schizo/slurm: Fix binding detection
v4.1.x: schizo/jsm: Disable binding when direct launched
bugfix: provider selection would not differentiate between ipv4
and ipv6 addresses which would cause some nodes to be unable
to communicate between each other. Adding a check for address
format to provider selection to ensure that all nodes use the
same address format.

Signed-off-by: Nikola Dancejic <[email protected]>
(cherry picked from commit 7e46371)
The missing include file causes an error when using an external version of LibEvent.

Signed-off-by: tomhers <[email protected]>
(cherry picked from commit 88f9d2c)
…r_SLURM_binding

Adding SLURM binding policy change to README
Signed-off-by: Joseph Schuchart <[email protected]>
(cherry picked from commit eebc451)
(v4.1.x) osc/rdma: fail query_btls if no endpoint for non-local peer is found
v4.1.x: common/ofi: added address format check to fix provider selection
If building Open MPI with sanitizers, e.g
$ configure CC=clang CFLAGS=-fsanitize=address ....
configure test programs are also build with the sanitizers and will
report errors resulting in configure to fail.

Signed-off-by: Christoph Niethammer <[email protected]>
…de_file

v4.1.x: BTL/OFI: Fix missing include file.
…v4.1.x

v4.1: Fix memory leak in configure, which prevents leak sanitizer usage
The default algorithm selections were out of date and not performing
well. After gathering data from OMPI developers, new default algorithm
decisions were selected for:

    allgather
    allgatherv
    allreduce
    alltoall
    alltoallv
    barrier
    bcast
    gather
    reduce
    reduce_scatter_block
    reduce_scatter
    scatter

These results were gathered using the ompi-collectives-tuning package
and then averaged amongst the results gathered from multiple OMPI
developers on their clusters.

You can access the graphs and averaged data here:
https://drive.google.com/drive/folders/1MV5E9gN-5tootoWoh62aoXmN0jiWiqh3

Signed-off-by: William Zhang <[email protected]>
(cherry picked from commit ce40cfb)
coll/tuned: Change the default collective algorithm selection
The btl/ofi does not currently utilize the common ofi include/exclude
list. Added verification code similar to the mtl/ofi that will check if
the info object is in the include or exclude list. If it isn't in the
include list or is in the exclude list, validate_info will return
OPAL_ERROR. The btl/ofi will no longer pass a provider name as a hint
when calling getinfo, instead filtering the provider during
validate_info.

This patch also moves the is_in_list MTL function into common code and
adds additional debugging output to the BTL to match the MTL standard.

Signed-off-by: William Zhang <[email protected]>
(cherry picked from commit 9b8f463)
(`prte_hwloc_base_get_locality_string` never returns locality string with L0).

Signed-off-by: Mikhail Kurnosov <[email protected]>
(cherry picked from commit 4708458)
…strng

v4.1.x: opal/hwloc: fix a typo in parsing locality string: L0 changed to L1
v4.1.x: btl/ofi: Use common provider include/exclude list
Alter the test to validate misaligned data.

Fixes open-mpi#7954.

Signed-off-by: George Bosilca <[email protected]>
(cherry picked from commit b6d71aa)
Signed-off-by: Brian Barrett <[email protected]>
Signed-off-by: George Bosilca <[email protected]>
(cherry picked from commit c4e88a4)
Signed-off-by: Brian Barrett <[email protected]>
@bwbarrett bwbarrett closed this Aug 12, 2020
@bwbarrett bwbarrett deleted the backports/v4.1.x-7957 branch March 16, 2021 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.