Skip to content

Conversation

t-bltg
Copy link
Contributor

@t-bltg t-bltg commented Jun 22, 2018

Avg response time coverage on master
Issue Stats Codecov branch

Summary of changes

Currents checks for memory reallocation is incorrectly triggered by a singleton on 1st dim when referencing a coarray (delta == 1) in a derived type.

@rouson @gutmann @zbeekman Steps to Reproduce mentioned in #552 passes with the .dat files given.

+ mpiexec -np 4 ./test-ideal
 Number of images =            4
           1 domain%initialize_from_file('input-parameters.txt')
 ximgs=           2 yimgs=           2
 call master_initialize(this)
 call this%variable%initialize(this%get_grid_dimensions(),variable_test_val)
  Layer height       Pressure        Temperature      Water Vapor
      [m]              [hPa]             [K]            [kg/kg]
   9750.00000       271.047180       206.509430       9.17085254E-06
   7750.00000       364.236786       224.725372       7.91714992E-05
   5750.00000       481.825287       243.449936       5.01311326E-04
   3750.00000       628.424316       262.669800       2.46796501E-03
   1750.00000       809.217651       282.372711       9.08217765E-03
 ThompMP: read qr_acr_qg.dat instead of computing
 qr_acr_qg initialized:  0.104000002            
 ThompMP: read qr_acr_qs.dat instead of computing
 qr_acr_qs initialized:   6.89999983E-02        
 ThompMP: read freezeH2O.dat instead of computing
 freezeH2O initialized:  0.500000000            
 qi_aut_qs initialized:   1.40000004E-02        
 
 Beginning simulation...
 Model run time:  61.352 seconds
<rest truncated ...>

Rationale for changes

Should fix #552
Added a regression test covering this patch.

WIP: please don't merge yet

EDIT 1: bug is only partially resolved, see the test case

EDIT 2: the issue seems to lie in send_for_ref ...

EDIT 3: ready to merge, stop committing to this branch

co % a(1, :, :)[remote] = co % b(1, :, :) => fails, I do not see (yet ?) how to fix this ...
The workaround is to use a range :
co % a(1:1, :, :)[remote] = co % b(1:1, :, :) => fixed by this PR
co % a(:, :, :)[remote] = co % b(:, :, :) => works as expected before the PR

This PR should merge without conflict from commit b09b88e
I known this PR is big and ugly, but without the whole cleanup, debugging is painful.

Maybe we should also enforce const qualifiers, but that would be for another PR.

In addition to the TRAVIS build, I tested this PR with the following compilers:

  • gcc 9.0.0 (devel - e255d1cb8f1e)
  • gcc 8.1.0
  • gcc 7.3.0
  • gcc 6.4.0

The caf-testsuite was ran to ensure non-regression, in addition to the OpenCoarrays embedded test suite.

Toolchain

gcc 8.1.0 + patches
mpich 3.2.1
caf 2.1.0 + patches
glibc 2.27

Additional info and certifications

This pull request (PR) is a:

  • Bug fix
  • Feature addition
  • Other, Please describe:

I certify that

  • I reviewed and followed the contributing guidelines, including
    - Increasing test coverage for all feature-addition PRs
    - Increasing test coverage for all bug-fix PRs for which there
    does not already exist a related test that failed before the PR
    - At least maintaining test coverage for all other PRs
    - Ensuring that all tests pass when run locally
    - Naming PR to indicate work in progress (WIP) and to attach the PR
    to the appropriate bug report or feature request issue
    - White space (no trailing white space or white space errors may
    be introduced)
    - Commenting code where it is non-obvious and non-trivial
    - Logically atomic, self consistent and coherent commits
    - Commit message content
    - Waiting 24 hours before self-approving the PR to give another
    OpenCoarrays developer a chance to review my proposed code

@gutmann
Copy link

gutmann commented Jun 22, 2018

This is great to see!

Copy link
Collaborator

@zbeekman zbeekman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neok-m4700: Oh my! You rock! 💯

I have a few (hopefully quick) changes that I would like to request in the test. (Commented on the diff, should show up in the code review/discussion.)

If you don't think you'll be able to get to them, please let me know and I'll make some time this weekend to investigate & implement.

Otherwise, I'll merge this as soon as the changes are made and the tests pass.

Thanks, so much, this is a most welcome contribution!

! 3/4: _gfortran_caf_send_by_ref() remote desc dim[2] = (lb = 1, ub = 103, stride = 20)
! 3/4: _gfortran_caf_send_by_ref() extent(dst, 0): 1 != delta: 20. ! <====== mistmatch src_cur_dim not incremented
! 3/4: _gfortran_caf_send_by_ref() extent(dst, 1): 20 != delta: 101.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neok-m4700 Can you please insert a global barrier (sync all) here? And then print the "Test passed." message from just the first image?


allocate(co % a(1, 10, 20)[*], co % b(1, 10, 20))
call random_number(co % b)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're not actually checking the output anywhere, don't we need a synchronization here, so that a is assigned from b after random_number returns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zbeekman
Will include your changes ASAP !

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, cheers!

! 3/4: _gfortran_caf_send_by_ref() extent(dst, 0): 1 != delta: 20. ! <====== mistmatch src_cur_dim not incremented
! 3/4: _gfortran_caf_send_by_ref() extent(dst, 1): 20 != delta: 101.

write(*, *) 'Test passed.'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to ensure that co % a(1,:,:)[2] == co % b(1,:,:)[1] .and. co % a(1,:,:)[1] == co % b(1,:,:)[2] before marking the test as successful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good advice, so this does not crash, but there seems to be an issue left related to rank reduction. Working on it ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. If you can't fix this issue, a simpler test that doesn't trigger it, along with a new bug report would more than suffice.

@zbeekman
Copy link
Collaborator

FYI, test failure was on GCC 6 branch, due to #347 / #266 / #161 Re-triggered the test to see if it will pass on the second attempt, since this bug is non-deterministic

@codecov
Copy link

codecov bot commented Jun 23, 2018

Codecov Report

Merging #555 into master will increase coverage by 0.51%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #555      +/-   ##
==========================================
+ Coverage   52.58%   53.09%   +0.51%     
==========================================
  Files           4        4              
  Lines        3406     3343      -63     
==========================================
- Hits         1791     1775      -16     
+ Misses       1615     1568      -47

neok-m4700 added 2 commits June 23, 2018 17:59
Macro indentation
Convert func_name () => func_name() (manually), rationale: avoid splitting lines
Consistency across dprint(...) calls
Change binary (-/+) operator split accross multiple lines: a = b \n + c => a = \n b + c
Consistency of comments
{for,if,while}(...) => {for,if,while} (...)
Space around = and ; in for loops
Rationale: avoid splitting statements with multiple nested conditions, and enhance readability
@ghost ghost added the needs-review label Jun 23, 2018
Copy link
Collaborator

@zbeekman zbeekman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely needs attention from you:

  • The most recent commit triggers build failures of the library for debug builds (-DCMAKE_BUILD_TYPE=DEBUG)

May need attention from you, or just a build system update (I can do it):

  • The test you added started failing with GCC 7.3 debug builds after it was updated. It seems to be fine with 6. It's possible that the test requires functionality Andre added/fixed in GFortran >= 7.4. Could you please look at the error when you have a chance and let me know if:
  1. This is actually a new bug that we (you) have found
  2. This is expected and the test should not be run with GFortran 7.3 because it doesn't have the latest updates to the by ref stuff (this is appears to me to be the case, but I haven't dug into it)
  3. If I should adjust the build system to pass on running the test if GFortran >= 7.1 and <= 7.4

Thanks!

KINDCASE(1, int8_t);
KINDCASE(2, int16_t);
KINDCASE(4, int32_t);
KINDCASE(8, int64_t);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug builds in this commit fail to compile due to errors attributed to lines 3240-3243 with all versions of GCC/GFortran we are testing:

/home/travis/build/sourceryinstitute/OpenCoarrays/src/mpi/mpi_caf.c:3240:25: error: expected expression before ‘int8_t’
             KINDCASE(1, int8_t);
                         ^~~~~~
# etc.

Something in this commit broke all the builds. I appreciate the cleanup, and would welcome it if you can resolve the issue, but I would also be content with resolving this with a git reset HEAD^ ; git push --force

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, pushed too soon, fixing this issue right now.

end program

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes to this test seem to have broken it when running a debug build test with GFortran 7.3 on Ubuntu. (The error from the previous commit is a different, unrelated issue.) Does it require the latest changes in GFortran 8 and >= 7.4?

neok-m4700 added 16 commits June 23, 2018 22:56
Portable size_t and ptrdiff_t formatting
More mpi checks
More indent fixes
…ingleton.f90

Relax the test failure until the bug is found ...
FIXME: The data is not put correctly on remote by MPI_Put, only a single dtype value is set
mpi_caf.c: debug array referencing
libcaf.h: consistency tabs/spaces
… duplication, without additional code gen

Always insert a constant prefix: caf_this_image, caf_num_images, __FUNCTION__
Ref:
- gcc.gnu.org/onlinedocs/cpp/Variadic-Macros.html
- gcc.gnu.org/onlinedocs/cpp/Swallowing-the-Semicolon.html
dst_dim seems to be failing some increment in send_for_ref
see the log below:
1/2: send_for_ref(5119) Entering send_for_ref: dst_offset = 0, desc_offset = 0, ds_glb = 0, desc_glb = 0
1/2: send_for_ref(5216) image_index = 1, num = 1, src_size = 4, src_dim = 0, dst_dim = 0, ref_type = CAF_REF_ARRAY
1/2: send_for_ref(5302) remote desc rank: 2 (ref_rank: 1)
1/2: send_for_ref(5307) remote desc dim[0] = (lb = 1, ub = 8, stride = 1)
1/2: send_for_ref(5307) remote desc dim[1] = (lb = 1, ub = 4, stride = 8)
1/2: send_for_ref(5313) array_ref_dst[0] = CAF_ARR_REF_SINGLE := array_ref_src[0] = CAF_ARR_REF_SINGLE
1/2: send_for_ref(5119) Entering send_for_ref: dst_offset = 0, desc_offset = 0, ds_glb = 0, desc_glb = 0
1/2: send_for_ref(5216) image_index = 1, num = 1, src_size = 4, src_dim = 1, dst_dim = 0, ref_type = CAF_REF_ARRAY
1/2: send_for_ref(5313) array_ref_dst[0] = CAF_ARR_REF_SINGLE := array_ref_src[1] = CAF_ARR_REF_FULL
1/2: send_for_ref(5119) Entering send_for_ref: dst_offset = 0, desc_offset = 0, ds_glb = 0, desc_glb = 0
1/2: send_for_ref(5216) image_index = 1, num = 1, src_size = 4, src_dim = 2, dst_dim = 0, ref_type = CAF_REF_ARRAY
1/2: send_for_ref(5313) array_ref_dst[0] = CAF_ARR_REF_SINGLE := array_ref_src[2] = CAF_ARR_REF_FULL
1/2: send_for_ref(5119) Entering send_for_ref: dst_offset = 0, desc_offset = 0, ds_glb = 0, desc_glb = 0
1/2: put_data(5017) (win: -1610612734, image: 2, offset: 0) <- 0x21e7440, num: 1, size 4 -> 4, dst type 3(4), src type 3(4)
co % a(1, :, :)   => CAF_ARR_REF_SINGLE, test still failing
co % a(1:1, :, :) => CAF_ARR_REF_RANGE, test passes
Arithmetic overflow gcc-7 on test
Build failure with gcc-7

Runtim failure with gcc-9 on debug output
@zbeekman
Copy link
Collaborator

zbeekman commented Jun 26, 2018

LGTM

Thank you, oh so very much, @neok-m4700 🎉 🙇 🙏

Without contributors like you Open Source projects like OpenCoarrays would be doomed! Thanks again for this contribution and all your hard work! 💯

Approved with PullApprove

@zbeekman zbeekman merged commit 81f1114 into sourceryinstitute:master Jun 26, 2018
@ghost ghost removed the needs-review label Jun 26, 2018
@rouson
Copy link
Member

rouson commented Jun 27, 2018

Thanks for fixing this longstanding issue, @neok-m4700 !

@t-bltg t-bltg deleted the icar branch June 27, 2018 17:25
@gutmann
Copy link

gutmann commented Jul 2, 2018

I finally had a chance to test this out. While it does indeed fix the issue for 4 images, and is able to communicate with a remote image, for me this is still breaking with >64 images. It works with 63 and 64, but breaks with 65 (and up). I've tried a number of larger sizes up to 720 and nothing works above 64 :( I don't think this is an issue with coarray-icar code as that code base has run fine with gfortran 6.3 and cray fortran up to much larger numbers of images.

Can anyone else reproduce this?

r1i4n6:tests> mpiexec -np 65 ./test-ideal
Attaching 36339 to 685660.chadmin1

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 36431 RUNNING AT r2i1n30.ib0.cheyenne.ucar.edu
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0@r1i4n6] HYD_pmcd_pmip_control_cmd_cb (/glade/u/home/gutmann/scratch/opencoarrays/OpenCoarrays/prerequisites/downloads/mpich-3.2/src/pm/hydra/pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:0@r1i4n6] HYDT_dmxu_poll_wait_for_event (/glade/u/home/gutmann/scratch/opencoarrays/OpenCoarrays/prerequisites/downloads/mpich-3.2/src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0@r1i4n6] main (/glade/u/home/gutmann/scratch/opencoarrays/OpenCoarrays/prerequisites/downloads/mpich-3.2/src/pm/hydra/pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@r1i4n6] HYDT_bscu_wait_for_completion (/glade/u/home/gutmann/scratch/opencoarrays/OpenCoarrays/prerequisites/downloads/mpich-3.2/src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@r1i4n6] HYDT_bsci_wait_for_completion (/glade/u/home/gutmann/scratch/opencoarrays/OpenCoarrays/prerequisites/downloads/mpich-3.2/src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@r1i4n6] HYD_pmci_wait_for_completion (/glade/u/home/gutmann/scratch/opencoarrays/OpenCoarrays/prerequisites/downloads/mpich-3.2/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec@r1i4n6] main (/glade/u/home/gutmann/scratch/opencoarrays/OpenCoarrays/prerequisites/downloads/mpich-3.2/src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error waiting for completion

@gutmann
Copy link

gutmann commented Jul 2, 2018

Also, I assume this is a new bug we have stumbled across, so I can open a new issue, but I thought I would drop this here for now in case it is related still.

@t-bltg
Copy link
Contributor Author

t-bltg commented Jul 2, 2018

@gutmann , I see some mpich-3.2 in your path.
Can you try with mpich 3.2.1 ?
see #496 (comment)

@zbeekman
Copy link
Collaborator

zbeekman commented Jul 2, 2018

@gutmann: I know we have experienced problems with Failed images in the past, because they rely on some very new MPI features (even features that are not part of MPI-3, that are proposed for MPI-4).

It would be great if you could open a new issue (if you haven't already done so) to track this.

I would recommend trying to compile OpenCoarrays with -DCAF_ENABLE_FAILED_IMAGES=FALSE

Also, as pointed out by @neok-m4700 there was a bug fix in MPICH that may not be in 3.2 but definitely seems to be in 3.2.1.

@gutmann
Copy link

gutmann commented Jul 2, 2018

Thanks @neok-m4700 @zbeekman, recompiling with mpich 3.2.1 fixed that problem (at least it runs with 72 images now), but now it prints , dst_type = 3). repeatedly (it seems to be something like once for each send... so with lots of images it gets to be a lot of output and it spends all it's time printing to the screen so it runs much slower.

@zbeekman
Copy link
Collaborator

zbeekman commented Jul 2, 2018 via email

@gutmann
Copy link

gutmann commented Jul 2, 2018

I also found out that dst_type message is printed to stderr not stdout if that helps. I installed mpich 3.2.1 using the opencoarrays install script via ./install.sh -y -j4 -p mpich -I 3.2.1. This is using gcc 8.1, also installed via the install script.

I think this is built as a "release" (at least that is what it prints when it is building, I didn't set anything for it.)

@gutmann
Copy link

gutmann commented Jul 2, 2018

FWIW, in opencoarrays src/mpi/mpi_caf.c I see fprintf(stderr, ", dst_type = %d", dst_type);

edit : line 5694

@gutmann
Copy link

gutmann commented Jul 2, 2018

git blame points to 250fb06 (For whatever that is worth)

@zbeekman
Copy link
Collaborator

zbeekman commented Jul 2, 2018 via email

@gutmann
Copy link

gutmann commented Jul 2, 2018

Also, just to add in response to:

It would be great if you could open a new issue (if you haven't already done so) to track this.

I think we should leave this discussion here for now as (I think) the relevant line was touched as part of this Pull Request, we can certainly move it elsewhere if you think that would be more useful.

@zbeekman
Copy link
Collaborator

zbeekman commented Jul 2, 2018 via email

@t-bltg
Copy link
Contributor Author

t-bltg commented Jul 3, 2018

@gutmann Sorry, yes this was part of the huge commit where we changed a lot of stuff. I has already been fixed in my branch https://github.com/neok-m4700/OpenCoarrays/tree/perf
especially this commit t-bltg@c7399d6

@zbeekman
Since this is a WIP on #556, there is no PR yet, I'm waiting updates in the comments ...

quick fix

mpi_caf.c l.5692-5696

#ifdef GCC_GE_8
  dprint("Entering send_by_ref(may_require_tmp = %d, dst_type = %d)\n",
         may_require_tmp, dst_type);
#else
  dprint("Entering send_by_ref(may_require_tmp = %d)\n", may_require_tmp);
#endif

For #555 (comment), the build also passes with open MPI 3.1.1

@gutmann
Copy link

gutmann commented Jul 3, 2018

Thanks, I've just commented out those lines for now and it seems to be running fine. You all are awesome!

@zbeekman zbeekman mentioned this pull request Jul 3, 2018
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Coarray ICAR mini-app generates runtime error in caf_send_by_ref
4 participants