Skip to content

Conversation

vehre
Copy link
Collaborator

@vehre vehre commented Jan 9, 2017

This pull request fixes the issue open in #298.

This pull request supersedes #299, which can be closed now.

This patch ensures that the array descriptor stored in the mpi-caf-token is
set to NULL consistently, when no real descriptor is stored in it.
Further does the patch ensure freeing memory with gcc-7.

The testcases have been actived again for annoucing their passing.
Syncing a set of images using SYNC IMAGES is done by this simple algorithm:

1. Set up asynchronous mpi-receives for all images in the set to sync to,
2. Send the current image's status to all images in the set to sync to, and
3. Wait for the receives from step 1 to finish. When one of the receives
   returns a stopped image state, than abort the sync on the current image
   immediately, else wait until all receives have been answered.

This commit also adds a new testcase: syncimages_ring, where each image
is syncing to its neighbours with wrap around. The testcase is called for
3 (the minimum number of images required to show that the sync works), 13
and 23 images.

This commit fixes the timeout of syncimage_status.f90 mentioned in #298.
This commit superseeds pull request 299 (close #299).
@codecov-io
Copy link

codecov-io commented Jan 9, 2017

Current coverage is 45.33% (diff: 77.77%)

Merging #302 into master will decrease coverage by 0.12%

@@             master       #302   diff @@
==========================================
  Files             3          3          
  Lines          1025       1019     -6   
  Methods          63         64     +1   
  Messages          0          0          
  Branches        197        193     -4   
==========================================
- Hits            466        462     -4   
  Misses          479        479          
+ Partials         80         78     -2   

Sunburst

Powered by Codecov. Last update 39421c4...aed6d2f

@zbeekman
Copy link
Collaborator

zbeekman commented Jan 9, 2017

@vehre please confirm that I have correctly understood you:

This pull request fixes the issue open in #298.

OK... but #298 is a pull request... so I take it this PR can supersede #298? If the commit SHAs match for the common commits, I'll just close #298 and use this PR instead...

This pull request supersedes #299, which can be closed now.

OK.... so I'll close #299 then...

@vehre
Copy link
Collaborator Author

vehre commented Jan 9, 2017

Well, this pull request contains or rather depends on/extends the code #298. In the discussion of #298 an issue is raised, which is now fixed by this pull request. When you merge, make sure, that you merge both, the changes to pull of #298 (which is 0696306) and this pull request.

@zbeekman
Copy link
Collaborator

zbeekman commented Jan 9, 2017

When you merge, make sure, that you merge both, the changes to pull of #298 (which is 0696306) and this pull request.

OK, perfect. 0696306 is on both vehre/events-regression and vehre/sync_images_fix_new. Since the tests pass here, but not on #298 I'm going to close #298 in favor of this PR. In addition I'm going to squash 4164cba and 86d57af (or drop 86d57af and amend 4164cba so that the travis notifications are never disabled.

Copy link
Collaborator

@zbeekman zbeekman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vehre can you please answer my questions which should be inline in the code review below? Thanks!

Also, assuming my understanding is correct, you don't have to make the requested changes unless you want to: I'm comfortable doing it. If you verify that my understanding is correct, then I think this is ready to merge.

@@ -0,0 +1,25 @@
! SYNC IMAGES([this_image - 1, this_image + 1]) with the STAT=STAT_STOPPED_IMAGE
! specifier and wrap around of image numbers. The test is intended to check
! that syncing in a ring with a stopped image still terminates all images.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vehre: Great test idea! Can you please rename the test to better reflect its purpose? Right now it sounds as though it is meant to test if a ring correctly syncs. A better name might be something like abort_sync_on_stopped_image or something more indicative of it's purpose. You can just add one or more commits to vehre/sync_images_fix_new and push them, it will update this PR.

Also can you please add an extra ! and move these comments below program ... that way the documentation will get picked up by FORD.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can rename it and adjust the comments

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation of the name is simple: It is testing the intrinsic statement sync image which should be modelled in the test's name. It is modelling a ring. That there is an abort to be expected can be reflected in the name: sync_image_ring_abort_on_stopped_image

Idea: A test sync_image_ring should be modelled @rouson where the abort is not expected and it is just tested whether a ring is syncing.

The code is just a copy from syncimage_status and I took over that style. If you want to polish the comments go ahead. Then I concentrate on fixing issues and adding functionality. It is always a good idea when the tests are not written by the dev who implemented a feature, because those tests tend to comply, because the dev knows what to do.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That there is an abort to be expected can be reflected in the name: sync_image_ring_abort_on_stopped_image

Yes, I meant rename the test source file.

Idea: A test sync_image_ring should be modelled @rouson where the abort is not expected and it is just tested whether a ring is syncing.

This is a good idea. I'm trying to think of a good way to test this. My current thinking is that we could space images equally along a domain, say [0, 2pi), have each compute a periodic function (e.g. sine) at their location, and also perform a put to the left and right of what their neighbors value should be. (puts are asynchronous so the ring sync will actually be tested...) Then we do the ring sync and verify that the value put from the lhs neighbor and rhs neighbor match the local value (to within some suitable epsilon).

It is always a good idea when the tests are not written by the dev who implemented a feature, because those tests tend to comply, because the dev knows what to do.

This is very true. Also tools like quick check that can randomly generate tests to check the properties of the API rather than hard-coded values, which is a similar idea.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Far too complicated: This is a unit test, add image 1 to images executing the ring and remove the test for the stat_stopped_image and your done.

For an integration test you can go for that equally spaced and so on...
About that put is asynchronous I agree only in a perfect world. I doubt that it is fully asynchronous in the caf_mpi implementation, given there is locking in the send and that will lead to a specific portion of synchronisation.

sync images([lhs, rhs], STAT=stat_var)
! Only on image 2 and num_images() testing whether a stopped image is
! present can be done reliably. All other images could be up ahead.
if (stat_var /= STAT_STOPPED_IMAGE .and. me == 2) error stop "Error: stat_var /= STAT_STOPPED_IMAGE: "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please also test num_images() in addition to 2? That way we will test the LHS and RHS of the stopped image, image 1. Also if you add an extra ! to these comments then FORD will pick them up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do this.

if (num_images() .lt. 2) error stop "Need at least two images to test."

associate (me => this_image())
if (me /= 1) then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% confident in the semantics of calling stop (NOT error stop) from a coarray program, but if you can call stop 0 from image me == 1 explicitly without causing all the other images to abort, that would clarify the test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can test and implement this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What? There is error stop "Need ..." called. What do I not get here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make it more clear that image 1 is expected to reach termination before the other images since it is not part of the sync images(*,...) call. It might be a dumb/infeasable idea, I'll test it and see what happens. Otherwise I'll just add some additional comments to clarify.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, now I get you.

No, that does not work. stop_numeric() and stop_string() btw is not implemented correctly in caf_mpi. Both expect caf_finalize() to exit(0), which this function must not do! So until that is fixed, STOP 0 will not yield the result you expect.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stop_numeric() and stop_string() btw is not implemented correctly in caf_mpi. Both expect caf_finalize() to exit(0), which this function must not do! So until that is fixed, STOP 0 will not yield the result you expect.

I think we should open an issue for this, do you agree?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@@ -10,7 +10,6 @@ program sync_images_stat
me = this_image()

if (me /= 1 ) then
call sleep(1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I'm understanding correctly, all the other images will wait for a response from image 1, but image 1 isn't included in the call to sync images() they would wait forever. However, image 1 will reach program termination, and then the other images waiting via sync images() will receive a STAT_STOPPED_IMAGE as the response. This is why there is no longer a race condition and/or hang in this test: all the images will be waiting for each other except image 1 which will continue until it terminates, at which point the other images will get the STAT_STOPPED_IMAGE notification. Is my understanding correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nearly correct. Only the reason for the race is not absolutely correct, but a little bit more difficult to explain. In the former implementation the status of the image was read using one-sided communication. When that read was before the aborting image set its abort-status, then the program would hang. When the set of the abort-status was interleaved with some images having read the status already and some not, then the race begun, that the images that had the abort received set their own abort-status while the other images where reading that status. So timing was everything here.

Now the abort-status is send using two-sided communication, which also means, that the sync is done at this stage. That way no race is possible anymore, because an aborted-image would publish its state and all other receive it a defined location.

@@ -391,6 +391,9 @@ if(opencoarrays_aware_compiler)
add_mpi_test(duplicate_syncimages 8 ${tests_root}/unit/sync/duplicate_syncimages)
add_mpi_test(co_reduce 4 ${tests_root}/unit/collectives/co_reduce_test)
add_mpi_test(syncimages_status 32 ${tests_root}/unit/sync/syncimages_status)
add_mpi_test(syncimages_ring_np3 3 ${tests_root}/unit/sync/syncimages_ring)
add_mpi_test(syncimages_ring_np13 13 ${tests_root}/unit/sync/syncimages_ring)
add_mpi_test(syncimages_ring_np23 23 ${tests_root}/unit/sync/syncimages_ring)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you arrive at 13 and 23?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arbitrary, in fact, just by adding a 1 and a 2 before the 3 of the initial test. I did not want to have a power of two here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, primes are good, but 23 may start to get a bit slow on some systems. I may tweak these values to be smaller primes.

@zbeekman
Copy link
Collaborator

zbeekman commented Jan 9, 2017

Pending confirmation from @vehre that my understanding is correct, and some small tweaks, this

looks good to me

Approved with PullApprove

@zbeekman
Copy link
Collaborator

@vehre are you done with vehre/sync_images_fix_new? Can I force push and re-write history on this branch to clean up this PR, or will that cause problems for you?

@zbeekman
Copy link
Collaborator

zbeekman commented Jan 10, 2017

Just to make our lives simper, I'm not going to force push even though my OCD wants to squash 86d57af with 4164cba; I'll just push some new commits to this branch to cleanup the issues I want to address.

@vehre
Copy link
Collaborator Author

vehre commented Jan 10, 2017

I am done with the pull request. I don't expect it need more attention. Go ahead with merging, although I did not get your latest comment.

@rouson
Copy link
Member

rouson commented Jan 10, 2017

This should fix #293

@zbeekman
Copy link
Collaborator

zbeekman commented Jan 13, 2017

LGTM

Approved with PullApprove

@zbeekman zbeekman self-assigned this Jan 13, 2017
@zbeekman zbeekman merged commit 8053475 into master Jan 13, 2017
@zbeekman zbeekman deleted the vehre/sync_images_fix_new branch January 13, 2017 16:38
if (stat_var /= STAT_STOPPED_IMAGE .and. me == 2) &
error stop "Error: stat_var /= STAT_STOPPED_IMAGE: "
if (stat_var /= STAT_STOPPED_IMAGE .and. me == num_images()) &
error stop "Error: stat_var /= STAT_STOPPED_IMAGE: "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vehre shouldn't this be OK too? It seems that this is failing on one Travis-CI OS X test (when code coverage flags are enabled): https://travis-ci.org/sourceryinstitute/opencoarrays/jobs/191684163#L463 Could there be an off-by-one error somewhere that I introduced, or in another part of the library/test?

@vehre
Copy link
Collaborator Author

vehre commented Jan 13, 2017

Mmh, at first glance this looks right. I don't have access to a PC currently. So you have to wait until Monday when I return home before I can analyze this in depth.

@zbeekman
Copy link
Collaborator

Mmh, at first glance this looks right. I don't have access to a PC currently. So you have to wait until Monday when I return home before I can analyze this in depth.

OK, thanks for taking a look. My guess is that if we re-run it, then it will go away due to a race condition somewhere... because it's not regularly showing up. At any rate, this is a fairly low priority bug since:

  1. It often works correctly
  2. If you already have stopped images with a sync images then this is likely an exception or logic error to begin with, and shouldn't have a big impact on the correct execution of most programs

@zbeekman
Copy link
Collaborator

Also it happens on Mac OS X even when the coverage flags aren't enabled: https://travis-ci.org/sourceryinstitute/opencoarrays/jobs/191687068

@vehre
Copy link
Collaborator Author

vehre commented Jan 16, 2017

Jikes, that was a bug. I read the sync result using the index into the array of receive requests, where it would have been better to use the MPI_SOURCE of the status parameter. This is fixed now by pull request #313.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants