Skip to content

Conversation

@jsquyres
Copy link
Member

@jsquyres jsquyres commented Jul 10, 2020

NOTE: This is intentionally not a cherry pick from master. See below.

There is a problem with the mpi_f08 module when sizeof(int) !=
sizeof(INTEGER): the size of TYPE(MPI_Status) is too small. This
causes buffer overruns when Open MPI is configured with (for example)
sizeof(int)==4 and sizeof(INTEGER)==8, and then you call the mpi_f08
MPI_RECV subroutine. This will end up copying the resulting C
MPI_Status to the buffer pointing to the Fortran status, but the code
does not know if the Fortran status is an mpif.h status or a
TYPE(MPI_Status) -- it just blindly copies over as if the Fortran
status is an INTEGER array of length MPI_STATUS_SIZE. Unfortunately,
TYPE(MPI_Status) is actually smaller than this, so we overrun the
buffer. Hilarity ensues.

The simple fix for this is to make TYPE(MPI_Status) the same size as
INTEGER(MPI_STATUS_SIZE), but we can't do that here on the release
branch because it will break ABI.

This commit does the following:

  • checks to see if we're in a sizeof(int) != sizeof(INTEGER) scenario
  • if so, if the user has not specifically excluded building the
    mpi_f08 module, display a Giant Error Message (GEM) and abort
    configure.

This is unusual; we don't usually abort configure when feature XYZ
can't be built -- if the user didn't specifically ask for XYZ, we
just emit a notice that we won't build XYZ and continue.

This situation is a little different because we're on a release
branch: prior releases have built mpi_f08 by default -- even in this
"bad" scenario. Hence, in this case, we explicitly tell the user that
this is now a known-bad scenario and abort. In the GEM, we give the
user two options:

  1. Change their compiler flags so that sizeof(int) == sizeof(INTEGER)
    and re-run configure, or
  2. Explicitly disable the mpi_f08 module via --enable-mpi-fortran=usempi

Thanks to @ahaichen for reporting the issue.

Note: the proper fix has been implemented on master (i.e., what will
become v5.0.0), but since that breaks ABI, we can't cherry pick it
back here to an existing release branch series.

Signed-off-by: Jeff Squyres [email protected]

Refs #7918

@ggouaillardet
Copy link
Contributor

Are you sure you want configure to fail when Open MPI is built with -fdefault-integer-8?

My understanding is the issue only occurs with the mpi_f08 bindings, so I would simply refuse to build them if the sizes do not match (and have configure automatically fail only if mpi_f08 bindings were explicitly requested by the user, and issue a warning otherwise).

@jsquyres
Copy link
Member Author

My understanding is the issue only occurs with the mpi_f08 bindings, so I would simply refuse to build them if the sizes do not match (and have configure automatically fail only if mpi_f08 bindings were explicitly requested by the user, and issue a warning otherwise).

I guess that's the question: what would a user expect?

  1. Would you expect to have some of the Fortran bindings available (i.e., mpif.h and mpi) but not others (mpi_f08)?
  2. Or would you expect to have none of the Fortran bindings available?

I guess we have (lots of) precedent for having only things that are possible to be available. It still feels a little weird in this case, but I can't put my finger on the reason why.

Bottom line: I don't have a good argument against your suggestion, so I'll update this PR per your suggestion.

@bwbarrett
Copy link
Member

I think suddenly not building the f08 bindings in the 4.0.x series would be unexpected. I would suggest we do like we used to with C++, where we include the check and abort, but if the user disables the f08 bindings, then we don't run the check. If that is possible; can't remember if we have that much fine grained control over which bindings get built.

@jsquyres
Copy link
Member Author

I think suddenly not building the f08 bindings in the 4.0.x series would be unexpected. I would suggest we do like we used to with C++, where we include the check and abort, but if the user disables the f08 bindings, then we don't run the check. If that is possible; can't remember if we have that much fine grained control over which bindings get built.

Yes, I think we have this level of granularity. I will check.

@ggouaillardet
Copy link
Contributor

use mpi is working with 64 bits integer, so having a default configure break
would have a clear impact on endusers.
(I am not arguing use mpi_f08 since it is broken)

IIRC, the logic for the Fortran bindings is "by default, build all you can".
Only if the user explicitly requests some bindings, then abort if they cannot be built.

This is why I suggested we do not build mpi_f08 bindings with 64 bits Fortran INTEGER (because we cannot build them correctly), and keep the existing logic (e.g. only abort if the user configure --with-mpi-fortran=use-mpi-f08, and issue a warning and only build use mpi otherwise).

@jsquyres
Copy link
Member Author

FWIW, we do offer the desired granularity:

   --enable-mpi-fortran    specify which Fortran MPI bindings to build: yes,
                           none (or no), best-effort, mpifh (build only mpif.h
                           support), usempi (build mpif.h and the mpi module),
                           or usempif08 (or all, build mpifh, the mpi module,
                           and the mpi_f08 module) (default: "yes" if Fortran
                           compiler found)

I'm guessing that people who compile Open MPI with INTEGER=8 bytes and int4 bytes is pretty small. So we're probably talking about a change that only affects a small number of users.

I think Gilles is right to cite our "build everything you can, and mention-but-ignore what you can't build" philosophy. We have lots of precedence for that.

Per Brian's point: not building the mpi_f08 module in this case just moves the detection of the error up from confusing run time errors (i.e., buffer overruns) to configure time. More specifically: if sizeof(int)!=sizeof(INTEGER) and a customer was using any mpi_f08 subroutine involving an MPI_Status (even STATUS_IGNORE), they were potentially triggering silent data corruption.

The real issue here might be the decision not to fix this error (and simply decide not to build mpi_f08). We could fix it, but it's probably very, very invasive (because it will touch everywhere a status is used in -- at a minimum -- the mpi_f08 bindings). That might be the discussion that we need to have.

@bwbarrett Do you feel strongly about this?

@bwbarrett
Copy link
Member

I feel strongly that a customer will be confused if they have f08 bindings with 4.0.4 and 4.0.5 with the exact same configure arguments results in no f08 bindings. I think something like the following in the 4.x branches whenever we detect the mismatch:

Open MPI has detected that sizeof(int) is not equal to sizeof(INTEGER). This is an unsupported configuration with the f08 bindings. In previous versions of Open MPI, this configuration potentially resulted in memory corruption. You may either change the C or Fortran compiler options to make the two integer types the same size or disable the f08 bindings by specifying --enable-mpi-fortran=usempi

That way, no one gets surprised, but also no one has to do much work.

NOTE: This is intentionally not a cherry pick from master.  See below.

There is a problem with the mpi_f08 module when sizeof(int) !=
sizeof(INTEGER): the size of TYPE(MPI_Status) is too small.  This
causes buffer overruns when Open MPI is configured with (for example)
sizeof(int)==4 and sizeof(INTEGER)==8, and then you call the mpi_f08
MPI_RECV subroutine.  This will end up copying the resulting C
MPI_Status to the buffer pointing to the Fortran status, but the code
does not know if the Fortran status is an mpif.h status or a
TYPE(MPI_Status) -- it just blindly copies over as if the Fortran
status is an INTEGER array of length MPI_STATUS_SIZE.  Unfortunately,
TYPE(MPI_Status) is actually smaller than this, so we overrun the
buffer.  Hilarity ensues.

The simple fix for this is to make TYPE(MPI_Status) the same size as
INTEGER(MPI_STATUS_SIZE), but we can't do that here on the release
branch because it will break ABI.

This commit does the following:

- checks to see if we're in a sizeof(int) != sizeof(INTEGER) scenario
- if so, if the user has not specifically excluded building the
  mpi_f08 module, display a Giant Error Message (GEM) and abort
  configure.

This is unusual; we don't usually abort configure when feature XYZ
can't be built -- if the user didn't specifically ask for XYZ, we
just emit a notice that we won't build XYZ and continue.

This situation is a little different because we're on a release
branch: prior releases have built mpi_f08 by default -- even in this
"bad" scenario.  Hence, in this case, we explicitly tell the user that
this is now a known-bad scenario and abort.  In the GEM, we give the
user two options:

1. Change their compiler flags so that sizeof(int) == sizeof(INTEGER)
   and re-run configure, or
2. Explicitly disable the mpi_f08 module via --enable-mpi-fortran=usempi

Thanks to @ahaichen for reporting the issue.

Note: the proper fix has been implemented on master (i.e., what will
become v5.0.0), but since that breaks ABI, we can't cherry pick it
back here to an existing release branch series.

Signed-off-by: Jeff Squyres <[email protected]>
@jsquyres jsquyres force-pushed the pr/v4.0.x/disallow-when-cint-not-equal-to-finteger branch from affa853 to 27836a6 Compare July 10, 2020 21:32
@jsquyres jsquyres added the NEWS label Jul 10, 2020
@jsquyres
Copy link
Member Author

Here's what the new error message looks like:

checking if sizeof(C int) == sizeof(Fortran INTEGER)... no

=========================================================================
                        ERROR   ERROR    ERROR

Configure has detected that the size of a C integer (4 bytes) is
different than the size of a Fortran INTEGER (8 bytes).  In the
entire v3.x and v4.x series of Open MPI, this configuration is known
to cause data corruption with the mpi_f08 module (it does *not* cause
problems with mpif.h or the mpi module bindings).

You may either choose to change the C and/or Fortran compiler flags
(in order to have the size of a C int be the same as the size of a
Fortran INTEGER) and re-run configure, or you may specify the
--enable-mpi-fortran=usempi flag to configure to explicitly
disable building the mpi_f08 module.

(NOTE: this error has been fixed in Open MPI releases beyond v4.x)

=========================================================================

configure: error: Cannot continue

@jsquyres jsquyres marked this pull request as ready for review July 10, 2020 21:36
@jsquyres jsquyres requested a review from bwbarrett July 10, 2020 21:36
@gpaulsen gpaulsen merged commit 90d868a into open-mpi:v4.0.x Jul 13, 2020
@jsquyres jsquyres deleted the pr/v4.0.x/disallow-when-cint-not-equal-to-finteger branch July 13, 2020 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants