Skip to content

Conversation

@MaskRay
Copy link
Contributor

@MaskRay MaskRay commented Dec 1, 2019

Due to a GNU ld bug (https://sourceware.org/bugzilla/show_bug.cgi?id=25236),
the 5 common symbols are incorrectly versioned VER_NDX_LOCAL
because their definitions in Fortran sources have smaller st_size than
those in libmpi.so.

This makes the Fortran library not linkable with lld in distributions
that ship openmpi built with -Wl,--version-script
(https://bugs.llvm.org/show_bug.cgi?id=43748):

% mpifort -fuse-ld=lld /dev/null
ld.lld: error: corrupt input file: version definition index 0 for symbol
mpi_fortran_argv_null_ is out of bounds
>>> defined in /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_usempif08.so
...

Work around it by changing dimensions so that st_size of Fortran sources
are at least as large as those in libmpi.so (equal on 64-bit platforms).

This also fixes a minor issue that MPI_UNWEIGHTED and
MPI_WEIGHTS_EMPTY were not declared as arrays (not fully fixed by commit
107c007).

Fixes #7209

@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@MaskRay MaskRay force-pushed the mpi_fortran_argv_null_ branch 2 times, most recently from dd34892 to f5da714 Compare December 1, 2019 01:24
@MaskRay
Copy link
Contributor Author

MaskRay commented Dec 1, 2019

I've verified this works for HEAD and openmpi 3.1.3

Working with the build system is not very pleasant. For my own notes,

To trigger re-link of ompi/mpi/fortran/use-mpi-f08/.libs/libmpi_usempif08.so:

rm ompi/mpi/fortran/use-mpi-f08/libmpi_usempif08.la
make -C ompi/mpi/fortran/use-mpi-f08 libmpi_usempif08.la

To regenerate mpif-c-constants-decl.h and mpif-c-constants-decl.h. The headers should be depended by many targets, but sadly the build system does not track them.

make -C ompi/include mpif-c-constants-decl.h

The definitions in Fortran sources come from use :: mpi_f08_types constructs. Those targets should depend on ompi/mpi/fortran/use-mpi-f08/mod/mpi_f08_types.mod but sadly they don't. Nor do I know how to rebuild .mod.

@ggouaillardet
Copy link
Contributor

did you try to rebuild a part of Open MPI from a subdirectory?

or did you simply run make in the top build directory and some stuff was not rebuilt as expected?
I am aware of some issues/features about the former, but would be surprised if the latter did not work.

@jsquyres
Copy link
Member

jsquyres commented Dec 3, 2019

ok to test

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's true that the Fortran stuff in the build system probably doesn't track all the sources' and targets' dependencies properly. We got spoiled by Automake handling all of that automatically for us in C/C++, and didn't really spend the time to do it properly in the Fortran interface. Plus, we rarely need to go update the Fortran interface, so no one really noticed / cared. 😦 PR's to fix this would be appreciated.

c_type => "char *",
c_name => "mpi_fortran_argv_null",
f_type => "character, dimension(1)",
f_type => "character, dimension(8)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need checks here to adjust for 32 or 64 bits? I.e., do we know that these sizes will be the same as those in libmpi.*?

I ask because I don't fully grok the ld issue, nor the workaround -- i.e., why making the symbols bigger actually makes it better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fortran object files and libmpi.so are used to link libmpi_usempif08.so. We need to make sure st_size (1 before this change) of common symbols in Fortran object files are not smaller than st_size in libmpi.so (4 on 32-bit platforms). Increasing it to 8 does not matter (just like st_size=1 does not matter before this change).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgive me: I see what you are doing here, I still don't understand why, though. I know enough about linkers to know that every time I think I understand them, I find out that I don't know jack about linkers. 😦

Let's discuss one of these common symbol constants in detail (the others are all generally the same):

  1. In C, mpi_fortran_argv_null is declared in ompi/include/mpif-c-constants.h as char *mpi_fortran_argv_null_. In Fortran -- originally from this Perl script -- MPI_ARGV_NULL is declared as character, dimension(1). It's an array of size 1 so that it would match the dereferenced C type (i.e., (char*)). If we bump the Fortran array to size 8, do we need to make the C declaration be char mpi_fortran_argv_null_[8]? Or is that naieve, and it was never about matching the size of the dereferenced item -- since Fortran passes by reference, it was always about matching the size of the pointer to the item? That would seem weird, though...
  2. I'm unclear on why the symbol in Fortran has to be greater than or equal to the size of the C variable. Why doesn't it have to be the same size?

Also, what will this do in versions of ld that do not have this bug?

Finally, what will this do to our ABI? We unfortunately have strong ABI guarantees within a single release series (e.g., within the v4.0.x series).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we bump the Fortran array to size 8, do we need to make the C declaration be char mpi_fortran_argv_null_[8]?

MPI 3.1 defines types of these C/Fortran constants, but I am very unclear about how these constants are used. It does appear weird to me that in C, it is a pointer while in Fortran, it is a character array. Due to 2., we have to make st_size of common symbols consistent.

  1. I'm unclear on why the symbol in Fortran has to be greater than or equal to the size of the C variable. Why doesn't it have to be the same size?

ompi/mpi/fortran/base/gen-mpi-mangling.pl does not know the wordsize of the target platform. Ideally the same sizes are the best.

Also, what will this do in versions of ld that do not have this bug?

The exact behavior is described in https://sourceware.org/bugzilla/show_bug.cgi?id=25236

  1. The definition os mpi_fortran_argvs_null_ will be provided by libmpi.so. It seems I have a debate with Alan Modra in that bug as whether resolving the symbol to libmpi.so makes sense. Alan insists that GNU ld's behavior is correct. Though it is at odds with other symbols where definitions in object files always win.
  2. mpi_fortran_argvs_null_ gets an incorrect version VER_NDX_LOCAL.

Finally, what will this do to our ABI? We unfortunately have strong ABI guarantees within a single release series (e.g., within the v4.0.x series).

st_size is the only affected field. It may affect an ELF hack: copy relocations, which only appears in non-pic case. Fortunately, we probably don't have to consider copy relocations, because version symbols were not correctly assigned to these symbols, so it never worked.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in replying / thanks for the reminder ping (your last post was right before I traveled for work, and that always makes a disaster of my INBOX ☹️).

MPI 3.1 defines types of these C/Fortran constants, but I am very unclear about how these constants are used.

Let me take a step back and explain what we're trying to do here.

  1. MPI-3.1 defines these constants that can be used as sentinel values, passed in via parameters to MPI functions. E.g., a Fortran app can call an MPI subroutine and pass MPI_ARGV_NULL as one of the parameters. This sentinel will then trigger some specific behavior in that MPI subroutine.
  2. Open MPI's base-level Fortran bindings for MPI subroutines are actually written in C. Meaning: we do some testing in configure to figure out which of the 4 common symbol nomenclatures are used for Fortran subroutines, and use some C preprocessor magic to make our C implementations of the Fortran bindings adhere to that nomenclature (sometimes through weak symbols).
    • The implication here is that Fortran user codes are directly calling C -- and not via ISO(C) binding. We do this because not all Fortran compilers that we have to support include support for ISO C binding. And even if they did, there's some parameters that do not translate to ISO C binding (specifically: Fortran LOGICAL). Ugh!
    • As such, all subroutine parameters are received in that C function as references -- i.e., pointers to that actual values.
    • Hence, if you pass a Fortran INTEGER parameter, we receive it in C as (int*) (assuming Fortran INTEGER and C int are the same size).
  3. Therefore, for sentinels like MPI_ARGV_NULL, the Fortran version of the symbol must be the correct type to pass through the strict Fortran interface.
    • All we care about in C is testing if a given parameter matches a sentinel symbol (i.e., to know if MPI_ARG_NULL was passed or not).

Is there a better way to do this?

The limiting factor here is that the Fortran bindings are written in C, so we need to be able to compare these parameters for equality with sentinel values in C. Some of these MPI routines are highly performance critical; making an additional call out to an actual Fortran routine to do the comparison is... undesirable (i.e., it would add overhead in a performance-sensitive section of code).

It does appear weird to me that in C, it is a pointer while in Fortran, it is a character array. Due to 2., we have to make st_size of common symbols consistent.

Ok. If the real issue is that somehow it has been working for years even though our sizes have been inconsistent between C and Fortran but something broke -- at least in one released version of the linker -- such that our sloppiness doesn't work, but it does work if the sizes are equal / we're not sloppy, then I would agree with you that this strategy is an improvement and we should do it.

ompi/mpi/fortran/base/gen-mpi-mangling.pl does not know the wordsize of the target platform. Ideally the same sizes are the best.

FWIW: we can fix that (i.e., to make the script know the wordsize of the target).

Also, what will this do in versions of ld that do not have this bug?

The exact behavior is described in https://sourceware.org/bugzilla/show_bug.cgi?id=25236

  1. The definition os mpi_fortran_argvs_null_ will be provided by libmpi.so. It seems I have a debate with Alan Modra in that bug as whether resolving the symbol to libmpi.so makes sense. Alan insists that GNU ld's behavior is correct. Though it is at odds with other symbols where definitions in object files always win.
  2. mpi_fortran_argvs_null_ gets an incorrect version VER_NDX_LOCAL.

I wonder if we're approaching this wrong. We shouldn't be trying to change the Fortran type of MPI_ARGV_NULL -- we should be changing the C type of the symbol that we're using to compare.

Specifically: Fortran is going to pass the address of the Fortran symbol MPI_ARGV_NULL to the MPI function. We have a corresponding C symbol, and we basically do this comparison in C to see if that sentinel value was passed:

extern char * mpi_fortran_argv_null_;
#define OMPI_IS_FORTRAN_ARGV_NULL(addr) \
       (addr == (void*) &mpi_fortran_argv_null_)

// This is the C function implementing the Fortran binding for MPI_COMM_SPAWN
void ompi_comm_spawn_f(char *command, char *argv, MPI_Fint *maxprocs,
                      MPI_Fint *info, MPI_Fint *root, MPI_Fint *comm,
                      MPI_Fint *intercomm, MPI_Fint *array_of_errcodes,
                      MPI_Fint *ierr, int cmd_len, int string_len)
{
    // ...
    if (OMPI_IS_FORTRAN_ARGV_NULL(argv)) {
        // ...do ARGV_NULL things...
    }
    // ...
}

With this example, I guess I'm still confused as to why the Fortran MPI_ARGV_NULL (which is of size 1) is not equal in size to the C mpi_fortran_argv_null_ (which is also of size 1). I'm also confused as to why this matters, because all we're doing is comparing the address of the parameter that is passed in (i.e., argv) to &mpi_fortran_argv_null_ -- the size of the dereferenced symbols shouldn't matter...?

What am I missing?

@MaskRay
Copy link
Contributor Author

MaskRay commented Dec 29, 2019

Ping:)

GNU ld 2.34 will not have the bug (https://sourceware.org/bugzilla/show_bug.cgi?id=25236). You may mention that in the commit message.

This works around the GNU ld<2.34 bug but I also consider it a general improvement. The merging rule between a common symbol and a shared symbol is complex. I noticed a problem in lld (unrelated to this report) and fixed it https://reviews.llvm.org/D71161

@MaskRay
Copy link
Contributor Author

MaskRay commented Dec 31, 2019

Let me take a step back and explain what we're trying to do here.

Thank you for your explanation:) I think it'd be nice to record it permanently somewhere (wiki or comments).
To be honest I don't use openmpi daily. I used it a few years ago for student cluster competition projects (LINPACK, licom, Quantum Espresso, OpenFOAM, NAMD, Palabos, etc), but now my career is proceeding in a very different direction (compiler, toolchain) ;-) When I received https://bugs.llvm.org/show_bug.cgi?id=43748 , it reminded me of the ASC/HPCAC-ISC old days.. So I told myself I should at least try investigating the problem more deeply. That was how I ended up sending a patch here.

Specifically: Fortran is going to pass the address of the Fortran symbol MPI_ARGV_NULL to the MPI function. We have a corresponding C symbol, and we basically do this comparison in C to see if that sentinel value was passed:

extern char * mpi_fortran_argv_null_;
#define OMPI_IS_FORTRAN_ARGV_NULL(addr)
(addr == (void*) &mpi_fortran_argv_null_)

I noticed this piece of code. Thank you bring up this point. Now I more firmly believe the approach used in this PR should solve the lld linking problem.

It does appear weird to me that in C, it is a pointer while in Fortran, it is a character array. Due to 2., we have to make st_size of common symbols consistent.

Ok. If the real issue is that somehow it has been working for years even though our sizes have been inconsistent between C and Fortran but something broke -- at least in one released version of the linker -- such that our sloppiness doesn't work, but it does work if the sizes are equal / we're not sloppy, then I would agree with you that this strategy is an improvement and we should do it.

So GNU ld has 2 problems. The first is a bug while the second is a minor issue.

  1. The symbols have larger st_size in libmpi.so (regular symbols) than st_size in Fortran object files (common symbols).
    This causes the symbols to have VER_NDX_LOCAL versions.
    The bug will be fixed in binutils 2.34.
    If st_size of the common symbol is equal to or larger than the st_size in libmpi.so, then this bug won't fire.
    (This is why I said larger st_size does not matter.)
  2. VER_NDX_LOCAL versioned symbols should behave as if they have the binding of STB_LOCAL.
    See https://github.com/llvm/llvm-project/blob/master/lld/ELF/Symbols.cpp#L264 for the symbol binding rule in lld.
    They should not be used to resolve undefined references in other executables/shared objects.
    lld correctly reports an error (https://bugs.llvm.org/show_bug.cgi?id=43748) but GNU ld appears to silently accept this. This problem has not been fixed in binutils.

Point 2 depends on point 1. If we fix or mitigate 1, then 2 will be nullified as well.

FWIW: we can fix that (i.e., to make the script know the wordsize of the target).

This will certainly be a better fix, but it requires a more intrusive change and I am afraid my openmpi-fu (TBH motivation) is not enough to achieve that.. See my reasoning in point 1.

extern char * mpi_fortran_argv_null_;
With this example, I guess I'm still confused as to why the Fortran MPI_ARGV_NULL (which is of size 1) is not equal in size to the C mpi_fortran_argv_null_ (which is also of size 1).

There may be a confusion about common symbols. A common symbol is a special case of a definition.

A reference to extern char * mpi_fortran_argv_null_; creates an undefined symbol (st_shndx=SHN_UNDEF) of st_size=0. It is not a common symbol.

The C definition (common symbol char * mpi_fortran_argv_null;) is in ompi/include/mpif-c-constants.h. There it is a common symbol (st_shndx=SHN_COMMON) and st_size=8.

@jsquyres
Copy link
Member

jsquyres commented Jan 3, 2020

Ah, my understanding is slowly getting there, but I have a few more questions:

  1. You keep mentioning st_size. Is this the linker terminology for the actual size of the symbol in the .so file?
  2. Ditto for VER_NDX_LOCAL -- does that refer to the linker's knowledge of where the symbol is actually located (e.g., in the currently-scoped .o file or something)?
  3. I still don't fully grok the "must be same size or bigger" issue -- I think you're saying that "same size is best, but if it's bigger, it's sloppy but it still will work". Is that right?

Let's distill this down to the smallest case:

In a Fortran file (ompi/include/mpif-f08-types.h), we have the following:

character, dimension(1), bind(C, name="mpi_fortran_argv_null_") :: MPI_ARGV_NULL

In a single C .h file (ompi/include/mpif-c-constants.h) which gets included in exactly one .c file (ompi/runtime/ompi_mpi_init.c, i.e., it gets instantiated exactly once), we have:

char * mpi_fortran_argv_null_;

(and similar for the other MPI sentinel / global constant values like this)
(the Perl script on this PR -- ompi/mpi/fortran/base/gen-mpi-mangling.pl -- is the one that generates this .h file)

This is the mismatch that we're talking about, right?

Your patch changes the Fortran declaration of ARGV_NULL from dimension(1) to dimension(8).

Is that because it has to match the size of C (char*)?

If so, should we really change the C to char mpi_fortran_argv_null_ -- i.e., a (char), not a (char*)? Then they should both be of size 1, right?

(ditto for UNWEIGHTED and WEIGHTS_EMPTY)

Finally, I note that there are 9 of these MPI Fortran/C sentinel values (see the rest of mpif-f08-types.h and mpif-c-constants.h), but your PR only updates 3 of them. Are the other 6 somehow not a problem?

@MaskRay
Copy link
Contributor Author

MaskRay commented Jan 28, 2020

I just realized I dropped the ball.

Ah, my understanding is slowly getting there, but I have a few more questions:

  1. You keep mentioning st_size. Is this the linker terminology for the actual size of the symbol in the .so file?

Yes. Though, this is a best-effort thing. It doesn't affect program behaviors, except when copy relocation is used.
The st_size field of a symbol in an executable decides how many bytes ld.so should copy at runtime.

Symbolizers (addr2line, llvm-symbolizer) use st_size to symbolize addresses.

  1. Ditto for VER_NDX_LOCAL -- does that refer to the linker's knowledge of where the symbol is actually located (e.g., in the currently-scoped .o file or something)?

It is only useful for a defined symbol. It essentially changes the output binding of the symbol to STB_LOCAL. (https://github.com/llvm/llvm-project/blob/master/lld/ELF/Symbols.cpp#L269).

I still don't fully grok the "must be same size or bigger" issue --

"must be same size or bigger" is to work around the GNU ld bug.

I think you're saying that "same size is best, but if it's bigger, it's sloppy but it still will work". Is that right?

Yes. "Same size is best" is to avoid user questions.

Let's distill this down to the smallest case:

If so, should we really change the C to char mpi_fortran_argv_null_ -- i.e., a (char), not a (char*)?Then they should both be of size 1, right?

I think char mpi_fortran_argv_null_ is preferable, if it works.

(ditto for UNWEIGHTED and WEIGHTS_EMPTY)

If st_size=1 works, let's do it.

Are the other 6 somehow not a problem?

They are not a problem. They don't trigger the GNU ld bug.

@jsquyres
Copy link
Member

jsquyres commented Feb 1, 2020

@MaskRay I just pushed a 2nd commit to your PR here with what I think should be the changes. Could you review and test? If it's correct, we can squash these two commits together and merge.

Thanks!

@MaskRay
Copy link
Contributor Author

MaskRay commented Feb 2, 2020

My /usr/local/bin/ld points to lld.

readelf -V ompi/mpi/fortran/use-mpi-f08/.libs/libmpi_usempif08.so.0.0.0 => no bogus 0 entries.

So this works! Thanks

@jsquyres jsquyres force-pushed the mpi_fortran_argv_null_ branch from a9a8836 to f391255 Compare February 2, 2020 12:17
jsquyres pushed a commit to MaskRay/ompi that referenced this pull request Feb 2, 2020
Fix the C types for the following:

* MPI_UNWEIGHTED
* MPI_WEIGHTS_EMPTY
* MPI_ARGV_NULL
* MPI_ARGVS_NULL
* MPI_ERRCODES_IGNORE

There is lengthy discussion on
open-mpi#7210 describing the issue; the
gist of it is that the C and Fortran types for several MPI global
sentenial values should agree (specifically: their sizes must(**)
agree).  We erroneously had several of these array-like sentinel
values be "array-like" values in C.  E.g., MPI_ERRCODES_IGNORE was an
(int *) in C while its corresponding Fortran type was "integer,
dimension(1)".  On a 64 bit platform, this resulted in C expecting the
symbol size to be sizeof(int*)==8 and sizeof(INTEGER,
DIMENSION(1))==4.

That is incorrect -- the corresponding C type needed to be (int).
Then both C and Fortran expect the size of the symbol to be the same.

(**) NOTE: This code has been wrong for years.  This mismatch of types
typically *worked* because, due to Fortran's call-by-reference
semantics, Open MPI was comparing the *addresses* of these instances,
not their *types* -- so even if C expected the size of the symbol to
be X and Fortran expected the size of the symbol to be Y, all we
really checked at run time was that the addresses of the symbols were
the same.  But it caused linker warning messages, and even cause
errors in some cases.

Specifically: due to a GNU ld bug
(https://sourceware.org/bugzilla/show_bug.cgi?id=25236), the 5 common
symbols are incorrectly versioned VER_NDX_LOCAL because their
definitions in Fortran sources have smaller st_size than those in
libmpi.so.

This makes the Fortran library not linkable with lld in distributions
that ship openmpi built with -Wl,--version-script
(https://bugs.llvm.org/show_bug.cgi?id=43748):

  % mpifort -fuse-ld=lld /dev/null
  ld.lld: error: corrupt input file: version definition index 0 for symbol
  mpi_fortran_argv_null_ is out of bounds
  >>> defined in /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_usempif08.so
  ...

If we fix the C and Fortran symbols to actually be the same size, the
problem goes away and the GNU ld bug does not come into play.

This commit also fixes a minor issue that MPI_UNWEIGHTED and
MPI_WEIGHTS_EMPTY were not declared as Fortran arrays (not fully fixed
by commit 107c007).

Fixes open-mpi#7209

Signed-off-by: Fangrui Song <[email protected]>
Signed-off-by: Jeff Squyres <[email protected]>
@jsquyres
Copy link
Member

jsquyres commented Feb 2, 2020

Ok, squashed commit pushed. Can you check over the commit message and make sure I got all of that correct? This is a complex enough issue that it's worth documenting properly in the git commit message (so that when someone looks at this commit 6+ months from now, they have a hope of understanding why this change was made).

jsquyres pushed a commit to MaskRay/ompi that referenced this pull request Feb 2, 2020
Fix the C types for the following:

* MPI_UNWEIGHTED
* MPI_WEIGHTS_EMPTY
* MPI_ARGV_NULL
* MPI_ARGVS_NULL
* MPI_ERRCODES_IGNORE

There is lengthy discussion on
open-mpi#7210 describing the issue; the
gist of it is that the C and Fortran types for several MPI global
sentenial values should agree (specifically: their sizes must(**)
agree).  We erroneously had several of these array-like sentinel
values be "array-like" values in C.  E.g., MPI_ERRCODES_IGNORE was an
(int *) in C while its corresponding Fortran type was "integer,
dimension(1)".  On a 64 bit platform, this resulted in C expecting the
symbol size to be sizeof(int*)==8 while Fortran expected the symbol
size to be sizeof(INTEGER, DIMENSION(1))==4.

That is incorrect -- the corresponding C type needed to be (int).
Then both C and Fortran expect the size of the symbol to be the same.

(**) NOTE: This code has been wrong for years.  This mismatch of types
typically *worked* because, due to Fortran's call-by-reference
semantics, Open MPI was comparing the *addresses* of these instances,
not their *types* -- so even if C expected the size of the symbol to
be X and Fortran expected the size of the symbol to be Y, all we
really checked at run time was that the addresses of the symbols were
the same.  But it caused linker warning messages, and even cause
errors in some cases.

Specifically: due to a GNU ld bug
(https://sourceware.org/bugzilla/show_bug.cgi?id=25236), the 5 common
symbols are incorrectly versioned VER_NDX_LOCAL because their
definitions in Fortran sources have smaller st_size than those in
libmpi.so.

This makes the Fortran library not linkable with lld in distributions
that ship openmpi built with -Wl,--version-script
(https://bugs.llvm.org/show_bug.cgi?id=43748):

  % mpifort -fuse-ld=lld /dev/null
  ld.lld: error: corrupt input file: version definition index 0 for symbol
  mpi_fortran_argv_null_ is out of bounds
  >>> defined in /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_usempif08.so
  ...

If we fix the C and Fortran symbols to actually be the same size, the
problem goes away and the GNU ld bug does not come into play.

This commit also fixes a minor issue that MPI_UNWEIGHTED and
MPI_WEIGHTS_EMPTY were not declared as Fortran arrays (not fully fixed
by commit 107c007).

Fixes open-mpi#7209

Signed-off-by: Fangrui Song <[email protected]>
Signed-off-by: Jeff Squyres <[email protected]>
@jsquyres jsquyres force-pushed the mpi_fortran_argv_null_ branch from f391255 to 94307d6 Compare February 2, 2020 12:19
Fix the C types for the following:

* MPI_UNWEIGHTED
* MPI_WEIGHTS_EMPTY
* MPI_ARGV_NULL
* MPI_ARGVS_NULL
* MPI_ERRCODES_IGNORE

There is lengthy discussion on
open-mpi#7210 describing the issue; the
gist of it is that the C and Fortran types for several MPI global
sentenial values should agree (specifically: their sizes must(**)
agree).  We erroneously had several of these array-like sentinel
values be "array-like" values in C.  E.g., MPI_ERRCODES_IGNORE was an
(int *) in C while its corresponding Fortran type was "integer,
dimension(1)".  On a 64 bit platform, this resulted in C expecting the
symbol size to be sizeof(int*)==8 while Fortran expected the symbol
size to be sizeof(INTEGER, DIMENSION(1))==4.

That is incorrect -- the corresponding C type needed to be (int).
Then both C and Fortran expect the size of the symbol to be the same.

(**) NOTE: This code has been wrong for years.  This mismatch of types
typically worked because, due to Fortran's call-by-reference
semantics, Open MPI was comparing the *addresses* of these instances,
not their *types* (or sizes) -- so even if C expected the size of the
symbol to be X and Fortran expected the size of the symbol to be Y
(where X!=Y), all we really checked at run time was that the addresses
of the symbols were the same.  But it caused linker warning messages,
and even caused errors in some cases.

Specifically: due to a GNU ld bug
(https://sourceware.org/bugzilla/show_bug.cgi?id=25236), the 5 common
symbols are incorrectly versioned VER_NDX_LOCAL because their
definitions in Fortran sources have smaller st_size than those in
libmpi.so.

This makes the Fortran library not linkable with lld in distributions
that ship openmpi built with -Wl,--version-script
(https://bugs.llvm.org/show_bug.cgi?id=43748):

  % mpifort -fuse-ld=lld /dev/null
  ld.lld: error: corrupt input file: version definition index 0 for symbol
  mpi_fortran_argv_null_ is out of bounds
  >>> defined in /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_usempif08.so
  ...

If we fix the C and Fortran symbols to actually be the same size, the
problem goes away and the GNU ld bug does not come into play.

This commit also fixes a minor issue that MPI_UNWEIGHTED and
MPI_WEIGHTS_EMPTY were not declared as Fortran arrays (not fully fixed
by commit 107c007).

Fixes open-mpi#7209

Signed-off-by: Fangrui Song <[email protected]>
Signed-off-by: Jeff Squyres <[email protected]>
@MaskRay
Copy link
Contributor Author

MaskRay commented Feb 2, 2020

Looks great. Thanks!

@jsquyres jsquyres added the bug label Feb 2, 2020
@jsquyres jsquyres merged commit dfbaf23 into open-mpi:master Feb 2, 2020
jsquyres pushed a commit to jsquyres/ompi that referenced this pull request Feb 2, 2020
Fix the C types for the following:

* MPI_UNWEIGHTED
* MPI_WEIGHTS_EMPTY
* MPI_ARGV_NULL
* MPI_ARGVS_NULL
* MPI_ERRCODES_IGNORE

There is lengthy discussion on
open-mpi#7210 describing the issue; the
gist of it is that the C and Fortran types for several MPI global
sentenial values should agree (specifically: their sizes must(**)
agree).  We erroneously had several of these array-like sentinel
values be "array-like" values in C.  E.g., MPI_ERRCODES_IGNORE was an
(int *) in C while its corresponding Fortran type was "integer,
dimension(1)".  On a 64 bit platform, this resulted in C expecting the
symbol size to be sizeof(int*)==8 while Fortran expected the symbol
size to be sizeof(INTEGER, DIMENSION(1))==4.

That is incorrect -- the corresponding C type needed to be (int).
Then both C and Fortran expect the size of the symbol to be the same.

(**) NOTE: This code has been wrong for years.  This mismatch of types
typically worked because, due to Fortran's call-by-reference
semantics, Open MPI was comparing the *addresses* of these instances,
not their *types* (or sizes) -- so even if C expected the size of the
symbol to be X and Fortran expected the size of the symbol to be Y
(where X!=Y), all we really checked at run time was that the addresses
of the symbols were the same.  But it caused linker warning messages,
and even caused errors in some cases.

Specifically: due to a GNU ld bug
(https://sourceware.org/bugzilla/show_bug.cgi?id=25236), the 5 common
symbols are incorrectly versioned VER_NDX_LOCAL because their
definitions in Fortran sources have smaller st_size than those in
libmpi.so.

This makes the Fortran library not linkable with lld in distributions
that ship openmpi built with -Wl,--version-script
(https://bugs.llvm.org/show_bug.cgi?id=43748):

  % mpifort -fuse-ld=lld /dev/null
  ld.lld: error: corrupt input file: version definition index 0 for symbol
  mpi_fortran_argv_null_ is out of bounds
  >>> defined in /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_usempif08.so
  ...

If we fix the C and Fortran symbols to actually be the same size, the
problem goes away and the GNU ld bug does not come into play.

This commit also fixes a minor issue that MPI_UNWEIGHTED and
MPI_WEIGHTS_EMPTY were not declared as Fortran arrays (not fully fixed
by commit 107c007).

Fixes open-mpi#7209

Signed-off-by: Fangrui Song <[email protected]>
Signed-off-by: Jeff Squyres <[email protected]>
(cherry picked from commit 5609268)
@MaskRay MaskRay deleted the mpi_fortran_argv_null_ branch February 2, 2020 23:57
cniethammer pushed a commit to cniethammer/ompi that referenced this pull request May 10, 2020
Fix the C types for the following:

* MPI_UNWEIGHTED
* MPI_WEIGHTS_EMPTY
* MPI_ARGV_NULL
* MPI_ARGVS_NULL
* MPI_ERRCODES_IGNORE

There is lengthy discussion on
open-mpi#7210 describing the issue; the
gist of it is that the C and Fortran types for several MPI global
sentenial values should agree (specifically: their sizes must(**)
agree).  We erroneously had several of these array-like sentinel
values be "array-like" values in C.  E.g., MPI_ERRCODES_IGNORE was an
(int *) in C while its corresponding Fortran type was "integer,
dimension(1)".  On a 64 bit platform, this resulted in C expecting the
symbol size to be sizeof(int*)==8 while Fortran expected the symbol
size to be sizeof(INTEGER, DIMENSION(1))==4.

That is incorrect -- the corresponding C type needed to be (int).
Then both C and Fortran expect the size of the symbol to be the same.

(**) NOTE: This code has been wrong for years.  This mismatch of types
typically worked because, due to Fortran's call-by-reference
semantics, Open MPI was comparing the *addresses* of these instances,
not their *types* (or sizes) -- so even if C expected the size of the
symbol to be X and Fortran expected the size of the symbol to be Y
(where X!=Y), all we really checked at run time was that the addresses
of the symbols were the same.  But it caused linker warning messages,
and even caused errors in some cases.

Specifically: due to a GNU ld bug
(https://sourceware.org/bugzilla/show_bug.cgi?id=25236), the 5 common
symbols are incorrectly versioned VER_NDX_LOCAL because their
definitions in Fortran sources have smaller st_size than those in
libmpi.so.

This makes the Fortran library not linkable with lld in distributions
that ship openmpi built with -Wl,--version-script
(https://bugs.llvm.org/show_bug.cgi?id=43748):

  % mpifort -fuse-ld=lld /dev/null
  ld.lld: error: corrupt input file: version definition index 0 for symbol
  mpi_fortran_argv_null_ is out of bounds
  >>> defined in /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_usempif08.so
  ...

If we fix the C and Fortran symbols to actually be the same size, the
problem goes away and the GNU ld bug does not come into play.

This commit also fixes a minor issue that MPI_UNWEIGHTED and
MPI_WEIGHTS_EMPTY were not declared as Fortran arrays (not fully fixed
by commit 107c007).

Fixes open-mpi#7209

Signed-off-by: Fangrui Song <[email protected]>
Signed-off-by: Jeff Squyres <[email protected]>
(cherry picked from commit 5609268)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

mpi_fortran_argv_null_ has incorrect st_size in ompi/mpi/fortran/base/gen-mpi-mangling.pl

4 participants