-
-
Notifications
You must be signed in to change notification settings - Fork 57
FIX issue 552 #555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX issue 552 #555
Conversation
This is great to see! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@neok-m4700: Oh my! You rock! 💯
I have a few (hopefully quick) changes that I would like to request in the test. (Commented on the diff, should show up in the code review/discussion.)
If you don't think you'll be able to get to them, please let me know and I'll make some time this weekend to investigate & implement.
Otherwise, I'll merge this as soon as the changes are made and the tests pass.
Thanks, so much, this is a most welcome contribution!
! 3/4: _gfortran_caf_send_by_ref() remote desc dim[2] = (lb = 1, ub = 103, stride = 20) | ||
! 3/4: _gfortran_caf_send_by_ref() extent(dst, 0): 1 != delta: 20. ! <====== mistmatch src_cur_dim not incremented | ||
! 3/4: _gfortran_caf_send_by_ref() extent(dst, 1): 20 != delta: 101. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@neok-m4700 Can you please insert a global barrier (sync all
) here? And then print the "Test passed." message from just the first image?
|
||
allocate(co % a(1, 10, 20)[*], co % b(1, 10, 20)) | ||
call random_number(co % b) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we're not actually checking the output anywhere, don't we need a synchronization here, so that a is assigned from b after random_number
returns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zbeekman
Will include your changes ASAP !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing, cheers!
! 3/4: _gfortran_caf_send_by_ref() extent(dst, 0): 1 != delta: 20. ! <====== mistmatch src_cur_dim not incremented | ||
! 3/4: _gfortran_caf_send_by_ref() extent(dst, 1): 20 != delta: 101. | ||
|
||
write(*, *) 'Test passed.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to ensure that co % a(1,:,:)[2] == co % b(1,:,:)[1] .and. co % a(1,:,:)[1] == co % b(1,:,:)[2]
before marking the test as successful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good advice, so this does not crash, but there seems to be an issue left related to rank reduction. Working on it ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. If you can't fix this issue, a simpler test that doesn't trigger it, along with a new bug report would more than suffice.
Codecov Report
@@ Coverage Diff @@
## master #555 +/- ##
==========================================
+ Coverage 52.58% 53.09% +0.51%
==========================================
Files 4 4
Lines 3406 3343 -63
==========================================
- Hits 1791 1775 -16
+ Misses 1615 1568 -47 |
Macro indentation Convert func_name () => func_name() (manually), rationale: avoid splitting lines Consistency across dprint(...) calls Change binary (-/+) operator split accross multiple lines: a = b \n + c => a = \n b + c Consistency of comments {for,if,while}(...) => {for,if,while} (...) Space around = and ; in for loops
Rationale: avoid splitting statements with multiple nested conditions, and enhance readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely needs attention from you:
- The most recent commit triggers build failures of the library for debug builds (
-DCMAKE_BUILD_TYPE=DEBUG
)
May need attention from you, or just a build system update (I can do it):
- The test you added started failing with GCC 7.3 debug builds after it was updated. It seems to be fine with 6. It's possible that the test requires functionality Andre added/fixed in GFortran >= 7.4. Could you please look at the error when you have a chance and let me know if:
- This is actually a new bug that we (you) have found
- This is expected and the test should not be run with GFortran 7.3 because it doesn't have the latest updates to the by ref stuff (this is appears to me to be the case, but I haven't dug into it)
- If I should adjust the build system to pass on running the test if GFortran >= 7.1 and <= 7.4
Thanks!
KINDCASE(1, int8_t); | ||
KINDCASE(2, int16_t); | ||
KINDCASE(4, int32_t); | ||
KINDCASE(8, int64_t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug builds in this commit fail to compile due to errors attributed to lines 3240-3243 with all versions of GCC/GFortran we are testing:
/home/travis/build/sourceryinstitute/OpenCoarrays/src/mpi/mpi_caf.c:3240:25: error: expected expression before ‘int8_t’
KINDCASE(1, int8_t);
^~~~~~
# etc.
Something in this commit broke all the builds. I appreciate the cleanup, and would welcome it if you can resolve the issue, but I would also be content with resolving this with a git reset HEAD^ ; git push --force
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, pushed too soon, fixing this issue right now.
end program | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes to this test seem to have broken it when running a debug build test with GFortran 7.3 on Ubuntu. (The error from the previous commit is a different, unrelated issue.) Does it require the latest changes in GFortran 8 and >= 7.4?
…ingleton.f90 Relax the test failure until the bug is found ... FIXME: The data is not put correctly on remote by MPI_Put, only a single dtype value is set
… duplication, without additional code gen Always insert a constant prefix: caf_this_image, caf_num_images, __FUNCTION__ Ref: - gcc.gnu.org/onlinedocs/cpp/Variadic-Macros.html - gcc.gnu.org/onlinedocs/cpp/Swallowing-the-Semicolon.html
…ge synchronization)
dst_dim seems to be failing some increment in send_for_ref see the log below: 1/2: send_for_ref(5119) Entering send_for_ref: dst_offset = 0, desc_offset = 0, ds_glb = 0, desc_glb = 0 1/2: send_for_ref(5216) image_index = 1, num = 1, src_size = 4, src_dim = 0, dst_dim = 0, ref_type = CAF_REF_ARRAY 1/2: send_for_ref(5302) remote desc rank: 2 (ref_rank: 1) 1/2: send_for_ref(5307) remote desc dim[0] = (lb = 1, ub = 8, stride = 1) 1/2: send_for_ref(5307) remote desc dim[1] = (lb = 1, ub = 4, stride = 8) 1/2: send_for_ref(5313) array_ref_dst[0] = CAF_ARR_REF_SINGLE := array_ref_src[0] = CAF_ARR_REF_SINGLE 1/2: send_for_ref(5119) Entering send_for_ref: dst_offset = 0, desc_offset = 0, ds_glb = 0, desc_glb = 0 1/2: send_for_ref(5216) image_index = 1, num = 1, src_size = 4, src_dim = 1, dst_dim = 0, ref_type = CAF_REF_ARRAY 1/2: send_for_ref(5313) array_ref_dst[0] = CAF_ARR_REF_SINGLE := array_ref_src[1] = CAF_ARR_REF_FULL 1/2: send_for_ref(5119) Entering send_for_ref: dst_offset = 0, desc_offset = 0, ds_glb = 0, desc_glb = 0 1/2: send_for_ref(5216) image_index = 1, num = 1, src_size = 4, src_dim = 2, dst_dim = 0, ref_type = CAF_REF_ARRAY 1/2: send_for_ref(5313) array_ref_dst[0] = CAF_ARR_REF_SINGLE := array_ref_src[2] = CAF_ARR_REF_FULL 1/2: send_for_ref(5119) Entering send_for_ref: dst_offset = 0, desc_offset = 0, ds_glb = 0, desc_glb = 0 1/2: put_data(5017) (win: -1610612734, image: 2, offset: 0) <- 0x21e7440, num: 1, size 4 -> 4, dst type 3(4), src type 3(4)
co % a(1, :, :) => CAF_ARR_REF_SINGLE, test still failing co % a(1:1, :, :) => CAF_ARR_REF_RANGE, test passes
Arithmetic overflow gcc-7 on test Build failure with gcc-7 Runtim failure with gcc-9 on debug output
Thanks for fixing this longstanding issue, @neok-m4700 ! |
I finally had a chance to test this out. While it does indeed fix the issue for 4 images, and is able to communicate with a remote image, for me this is still breaking with >64 images. It works with 63 and 64, but breaks with 65 (and up). I've tried a number of larger sizes up to 720 and nothing works above 64 :( I don't think this is an issue with coarray-icar code as that code base has run fine with gfortran 6.3 and cray fortran up to much larger numbers of images. Can anyone else reproduce this?
|
Also, I assume this is a new bug we have stumbled across, so I can open a new issue, but I thought I would drop this here for now in case it is related still. |
@gutmann , I see some |
@gutmann: I know we have experienced problems with Failed images in the past, because they rely on some very new MPI features (even features that are not part of MPI-3, that are proposed for MPI-4). It would be great if you could open a new issue (if you haven't already done so) to track this. I would recommend trying to compile OpenCoarrays with Also, as pointed out by @neok-m4700 there was a bug fix in MPICH that may not be in 3.2 but definitely seems to be in 3.2.1. |
Thanks @neok-m4700 @zbeekman, recompiling with mpich 3.2.1 fixed that problem (at least it runs with 72 images now), but now it prints |
Uggg it seems like maybe someone left a debug statement somewhere. I have
never seen this before. Are you building a Release configuration of
OpenCoarrays?
…On Mon, Jul 2, 2018 at 2:41 PM Ethan Gutmann ***@***.***> wrote:
Thanks @neok-m4700 <https://github.com/neok-m4700> @zbeekman
<https://github.com/zbeekman>, recompiling with mpich 3.2.1 fixed that
problem (at least it runs with 72 images now), but now it prints ,
dst_type = 3). repeatedly (it seems to be something like once for each
send... so with lots of images it gets to be a lot of output and it spends
all it's time printing to the screen so it runs much slower.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#555 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAREPAqfPOzNVwenuEM3L05r7r-czLvhks5uCmlQgaJpZM4U0ciL>
.
|
I also found out that I think this is built as a "release" (at least that is what it prints when it is building, I didn't set anything for it.) |
FWIW, in opencoarrays edit : line 5694 |
git blame points to 250fb06 (For whatever that is worth) |
Thanks for digging. I'll investigate further once I have some time to take
a look. It's possible that it just needs a simple preprocessor conditional
wrapping or similar.
…On Mon, Jul 2, 2018 at 3:49 PM Ethan Gutmann ***@***.***> wrote:
git blame points to 250fb06
<250fb06>
(For whatever that is worth)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#555 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAREPLBDRXQalWOWLvFjPtjiIEb-7o6gks5uCnkxgaJpZM4U0ciL>
.
|
Also, just to add in response to:
I think we should leave this discussion here for now as (I think) the relevant line was touched as part of this Pull Request, we can certainly move it elsewhere if you think that would be more useful. |
👍
…On Mon, Jul 2, 2018 at 5:03 PM Ethan Gutmann ***@***.***> wrote:
Also, just to add in response to:
It would be great if you could open a new issue (if you haven't already
done so) to track this.
I think we should leave this discussion here for now as (I think) the
relevant line was touched as part of this Pull Request, we can certainly
move it elsewhere if you think that would be more useful.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#555 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAREPKtrLQLTGp2PrdlzzFSsGe5cNo3Wks5uCoqIgaJpZM4U0ciL>
.
|
@gutmann Sorry, yes this was part of the huge commit where we changed a lot of stuff. I has already been fixed in my branch https://github.com/neok-m4700/OpenCoarrays/tree/perf @zbeekman quick fixmpi_caf.c l.5692-5696 #ifdef GCC_GE_8
dprint("Entering send_by_ref(may_require_tmp = %d, dst_type = %d)\n",
may_require_tmp, dst_type);
#else
dprint("Entering send_by_ref(may_require_tmp = %d)\n", may_require_tmp);
#endif For #555 (comment), the build also passes with open MPI |
Thanks, I've just commented out those lines for now and it seems to be running fine. You all are awesome! |
Summary of changes
Currents checks for memory reallocation is incorrectly triggered by a singleton on 1st dim when referencing a coarray (delta == 1) in a derived type.
@rouson @gutmann @zbeekman Steps to Reproduce mentioned in #552 passes with the .dat files given.
Rationale for changes
Should fix #552
Added a regression test covering this patch.
WIP: please don't merge yetEDIT 1: bug is only partially resolved, see the test caseEDIT 2: the issue seems to lie in send_for_ref ...EDIT 3: ready to merge, stop committing to this branch
co % a(1, :, :)[remote] = co % b(1, :, :)
=> fails, I do not see (yet ?) how to fix this ...The workaround is to use a range :
co % a(1:1, :, :)[remote] = co % b(1:1, :, :)
=> fixed by this PRco % a(:, :, :)[remote] = co % b(:, :, :)
=> works as expected before the PRThis PR should merge without conflict from commit b09b88e
I known this PR is big and ugly, but without the whole cleanup, debugging is painful.
Maybe we should also enforce const qualifiers, but that would be for another PR.
In addition to the TRAVIS build, I tested this PR with the following compilers:
9.0.0
(devel - e255d1cb8f1e)8.1.0
7.3.0
6.4.0
The caf-testsuite was ran to ensure non-regression, in addition to the OpenCoarrays embedded test suite.
Toolchain
gcc
8.1.0
+ patchesmpich
3.2.1
caf
2.1.0
+ patchesglibc
2.27
Additional info and certifications
This pull request (PR) is a:
I certify that
- Increasing test coverage for all feature-addition PRs
- Increasing test coverage for all bug-fix PRs for which there
does not already exist a related test that failed before the PR
- At least maintaining test coverage for all other PRs
- Ensuring that all tests pass when run locally
- Naming PR to indicate work in progress (WIP) and to attach the PR
to the appropriate bug report or feature request issue
- White space (no trailing white space or white space errors may
be introduced)
- Commenting code where it is non-obvious and non-trivial
- Logically atomic, self consistent and coherent commits
- Commit message content
- Waiting 24 hours before self-approving the PR to give another
OpenCoarrays developer a chance to review my proposed code