-
-
Notifications
You must be signed in to change notification settings - Fork 57
Fix sync images race/hang (fixes #298, supersedes #299) #302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This patch ensures that the array descriptor stored in the mpi-caf-token is set to NULL consistently, when no real descriptor is stored in it. Further does the patch ensure freeing memory with gcc-7. The testcases have been actived again for annoucing their passing.
Syncing a set of images using SYNC IMAGES is done by this simple algorithm: 1. Set up asynchronous mpi-receives for all images in the set to sync to, 2. Send the current image's status to all images in the set to sync to, and 3. Wait for the receives from step 1 to finish. When one of the receives returns a stopped image state, than abort the sync on the current image immediately, else wait until all receives have been answered. This commit also adds a new testcase: syncimages_ring, where each image is syncing to its neighbours with wrap around. The testcase is called for 3 (the minimum number of images required to show that the sync works), 13 and 23 images. This commit fixes the timeout of syncimage_status.f90 mentioned in #298. This commit superseeds pull request 299 (close #299).
Current coverage is 45.33% (diff: 77.77%)@@ master #302 diff @@
==========================================
Files 3 3
Lines 1025 1019 -6
Methods 63 64 +1
Messages 0 0
Branches 197 193 -4
==========================================
- Hits 466 462 -4
Misses 479 479
+ Partials 80 78 -2
|
@vehre please confirm that I have correctly understood you:
OK... but #298 is a pull request... so I take it this PR can supersede #298? If the commit SHAs match for the common commits, I'll just close #298 and use this PR instead...
OK.... so I'll close #299 then... |
OK, perfect. 0696306 is on both vehre/events-regression and vehre/sync_images_fix_new. Since the tests pass here, but not on #298 I'm going to close #298 in favor of this PR. In addition I'm going to squash 4164cba and 86d57af (or drop 86d57af and amend 4164cba so that the travis notifications are never disabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vehre can you please answer my questions which should be inline in the code review below? Thanks!
Also, assuming my understanding is correct, you don't have to make the requested changes unless you want to: I'm comfortable doing it. If you verify that my understanding is correct, then I think this is ready to merge.
@@ -0,0 +1,25 @@ | |||
! SYNC IMAGES([this_image - 1, this_image + 1]) with the STAT=STAT_STOPPED_IMAGE | |||
! specifier and wrap around of image numbers. The test is intended to check | |||
! that syncing in a ring with a stopped image still terminates all images. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vehre: Great test idea! Can you please rename the test to better reflect its purpose? Right now it sounds as though it is meant to test if a ring correctly syncs. A better name might be something like abort_sync_on_stopped_image
or something more indicative of it's purpose. You can just add one or more commits to vehre/sync_images_fix_new and push them, it will update this PR.
Also can you please add an extra !
and move these comments below program ...
that way the documentation will get picked up by FORD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can rename it and adjust the comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation of the name is simple: It is testing the intrinsic statement sync image
which should be modelled in the test's name. It is modelling a ring. That there is an abort to be expected can be reflected in the name: sync_image_ring_abort_on_stopped_image
Idea: A test sync_image_ring should be modelled @rouson where the abort is not expected and it is just tested whether a ring is syncing.
The code is just a copy from syncimage_status and I took over that style. If you want to polish the comments go ahead. Then I concentrate on fixing issues and adding functionality. It is always a good idea when the tests are not written by the dev who implemented a feature, because those tests tend to comply, because the dev knows what to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That there is an abort to be expected can be reflected in the name: sync_image_ring_abort_on_stopped_image
Yes, I meant rename the test source file.
Idea: A test sync_image_ring should be modelled @rouson where the abort is not expected and it is just tested whether a ring is syncing.
This is a good idea. I'm trying to think of a good way to test this. My current thinking is that we could space images equally along a domain, say [0, 2pi), have each compute a periodic function (e.g. sine) at their location, and also perform a put to the left and right of what their neighbors value should be. (puts are asynchronous so the ring sync will actually be tested...) Then we do the ring sync and verify that the value put from the lhs neighbor and rhs neighbor match the local value (to within some suitable epsilon).
It is always a good idea when the tests are not written by the dev who implemented a feature, because those tests tend to comply, because the dev knows what to do.
This is very true. Also tools like quick check that can randomly generate tests to check the properties of the API rather than hard-coded values, which is a similar idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Far too complicated: This is a unit test, add image 1 to images executing the ring and remove the test for the stat_stopped_image and your done.
For an integration test you can go for that equally spaced and so on...
About that put is asynchronous I agree only in a perfect world. I doubt that it is fully asynchronous in the caf_mpi implementation, given there is locking in the send and that will lead to a specific portion of synchronisation.
sync images([lhs, rhs], STAT=stat_var) | ||
! Only on image 2 and num_images() testing whether a stopped image is | ||
! present can be done reliably. All other images could be up ahead. | ||
if (stat_var /= STAT_STOPPED_IMAGE .and. me == 2) error stop "Error: stat_var /= STAT_STOPPED_IMAGE: " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please also test num_images()
in addition to 2
? That way we will test the LHS and RHS of the stopped image, image 1. Also if you add an extra !
to these comments then FORD will pick them up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do this.
if (num_images() .lt. 2) error stop "Need at least two images to test." | ||
|
||
associate (me => this_image()) | ||
if (me /= 1) then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% confident in the semantics of calling stop
(NOT error stop
) from a coarray program, but if you can call stop 0
from image me == 1
explicitly without causing all the other images to abort, that would clarify the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can test and implement this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What? There is error stop "Need ..."
called. What do I not get here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make it more clear that image 1 is expected to reach termination before the other images since it is not part of the sync images(*,...)
call. It might be a dumb/infeasable idea, I'll test it and see what happens. Otherwise I'll just add some additional comments to clarify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, now I get you.
No, that does not work. stop_numeric() and stop_string() btw is not implemented correctly in caf_mpi. Both expect caf_finalize() to exit(0), which this function must not do! So until that is fixed, STOP 0 will not yield the result you expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stop_numeric() and stop_string() btw is not implemented correctly in caf_mpi. Both expect caf_finalize() to exit(0), which this function must not do! So until that is fixed, STOP 0 will not yield the result you expect.
I think we should open an issue for this, do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
@@ -10,7 +10,6 @@ program sync_images_stat | |||
me = this_image() | |||
|
|||
if (me /= 1 ) then | |||
call sleep(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I'm understanding correctly, all the other images will wait for a response from image 1, but image 1 isn't included in the call to sync images()
they would wait forever. However, image 1 will reach program termination, and then the other images waiting via sync images()
will receive a STAT_STOPPED_IMAGE as the response. This is why there is no longer a race condition and/or hang in this test: all the images will be waiting for each other except image 1 which will continue until it terminates, at which point the other images will get the STAT_STOPPED_IMAGE notification. Is my understanding correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nearly correct. Only the reason for the race is not absolutely correct, but a little bit more difficult to explain. In the former implementation the status of the image was read using one-sided communication. When that read was before the aborting image set its abort-status, then the program would hang. When the set of the abort-status was interleaved with some images having read the status already and some not, then the race begun, that the images that had the abort received set their own abort-status while the other images where reading that status. So timing was everything here.
Now the abort-status is send using two-sided communication, which also means, that the sync is done at this stage. That way no race is possible anymore, because an aborted-image would publish its state and all other receive it a defined location.
@@ -391,6 +391,9 @@ if(opencoarrays_aware_compiler) | |||
add_mpi_test(duplicate_syncimages 8 ${tests_root}/unit/sync/duplicate_syncimages) | |||
add_mpi_test(co_reduce 4 ${tests_root}/unit/collectives/co_reduce_test) | |||
add_mpi_test(syncimages_status 32 ${tests_root}/unit/sync/syncimages_status) | |||
add_mpi_test(syncimages_ring_np3 3 ${tests_root}/unit/sync/syncimages_ring) | |||
add_mpi_test(syncimages_ring_np13 13 ${tests_root}/unit/sync/syncimages_ring) | |||
add_mpi_test(syncimages_ring_np23 23 ${tests_root}/unit/sync/syncimages_ring) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you arrive at 13 and 23?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arbitrary, in fact, just by adding a 1 and a 2 before the 3 of the initial test. I did not want to have a power of two here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, primes are good, but 23 may start to get a bit slow on some systems. I may tweak these values to be smaller primes.
Pending confirmation from @vehre that my understanding is correct, and some small tweaks, this looks good to me |
@vehre are you done with |
I am done with the pull request. I don't expect it need more attention. Go ahead with merging, although I did not get your latest comment. |
This should fix #293 |
- Use FORD comments - Clarify intent of test
if (stat_var /= STAT_STOPPED_IMAGE .and. me == 2) & | ||
error stop "Error: stat_var /= STAT_STOPPED_IMAGE: " | ||
if (stat_var /= STAT_STOPPED_IMAGE .and. me == num_images()) & | ||
error stop "Error: stat_var /= STAT_STOPPED_IMAGE: " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vehre shouldn't this be OK too? It seems that this is failing on one Travis-CI OS X test (when code coverage flags are enabled): https://travis-ci.org/sourceryinstitute/opencoarrays/jobs/191684163#L463 Could there be an off-by-one error somewhere that I introduced, or in another part of the library/test?
Mmh, at first glance this looks right. I don't have access to a PC currently. So you have to wait until Monday when I return home before I can analyze this in depth. |
OK, thanks for taking a look. My guess is that if we re-run it, then it will go away due to a race condition somewhere... because it's not regularly showing up. At any rate, this is a fairly low priority bug since:
|
Also it happens on Mac OS X even when the coverage flags aren't enabled: https://travis-ci.org/sourceryinstitute/opencoarrays/jobs/191687068 |
Jikes, that was a bug. I read the sync result using the index into the array of receive requests, where it would have been better to use the MPI_SOURCE of the status parameter. This is fixed now by pull request #313. |
This pull request fixes the issue open in #298.
This pull request supersedes #299, which can be closed now.