Skip to content

Conversation

vehre
Copy link
Collaborator

@vehre vehre commented Dec 29, 2016

This patch ensures that the array descriptor stored in the mpi-caf-token is
set to NULL consistently, when no real descriptor is stored in it.
Further does the patch ensure freeing memory with gcc-7.

The testcases have been actived again for annoucing their passing.

Reffing #293

This patch ensures that the array descriptor stored in the mpi-caf-token is
set to NULL consistently, when no real descriptor is stored in it.
Further does the patch ensure freeing memory with gcc-7.

The testcases have been actived again for annoucing their passing.
@codecov-io
Copy link

codecov-io commented Dec 29, 2016

Current coverage is 45.46% (diff: 100%)

Merging #298 into master will not change coverage

@@             master       #298   diff @@
==========================================
  Files             3          3          
  Lines          1025       1025          
  Methods          63         63          
  Messages          0          0          
  Branches        197        197          
==========================================
  Hits            466        466          
  Misses          479        479          
  Partials         80         80          

Sunburst

Powered by Codecov. Last update 3ccb984...0696306

@rouson
Copy link
Member

rouson commented Dec 29, 2016

@vehre Thanks for working on this. It appears that this pull request causes the syncimages_status test to hang as Travis-CI reports a the bottom of the text of Job 945.6, where the aforementioned test times out after not responding for 10 minutes. I need to review the standard in order to understand line 14 in the test better. I haven't made much use of sync images(*... so I'm not that familiar with its effects.

@rouson
Copy link
Member

rouson commented Dec 29, 2016

The GitHub History page for the sync_images test shows two contributors: @zbeekman and me with me first merging it in from afanfa-master branch. I'm not sure I'm understanding the history, but maybe @afanfa wrote it, I merged it into the master branch, and @zbeekman made minor white-space edits.

At any rate, I think the the test is incorrect. Section 11.6.4, paragraph 3 of the draft Fortran 2015 standard states, "An image-set that is an asterisk specifies all images in the current team." Given that we don't yet support teams, I believe this quote implies that sync images(*... is equivalent to sync all, which might explain why it hangs because all images other than image 1 are waiting for image 1 to execute a corresponding synchronization (presumably sync images(*...). I'll rewrite the test and see if that fixes the problem. It would also be nice to eliminate the non-standard sleep(1) for purposes of supporting compilers that don't offer that extension. The second code listing here provides what I assume is a standard-conforming Fortran implementation of sleep. For now, I'll include it as an internal program in the test, but we at some point should probably collect a few standard utilities like this into a library for use throughout the test suite (another candidate would be the assert procedure used in the async_comp_alloc test.

vehre added a commit that referenced this pull request Dec 30, 2016
Sync images works by setting up asynchronous receivers for each image in the sync set.
Next all images check the image status of all other images participating in the sync.
Then each image sends a zero int or the stopped image special code to all other images
in the sync set. At last all images wait for the asynchronous receivers to get their data.

The race here was, that an image could be in the waiting phase while the stopped image
had not set its status correctly yet. The waiting image did not return then, because it
never got the stopped image code from the stopped image. To solve this two changes had
to be made:

1. caf_finalize() now calls sync_image_internal ()
2. After waiting sync_image_internal() checks the status of the image, that send its data,
   again.

sync_image() has be renamed to sync_image_internal(). A flag was added to distinguish
calls to sync_image_internal() from caf_finalize and regular sync image calls. The latter
shall report an error on failure, while the former keeps silent.

This commit fixes the timeout of syncimage_status.f90 mentioned in #298.
@vehre
Copy link
Collaborator Author

vehre commented Dec 30, 2016

When I examine the changes my patch has on non gcc-7 compiled executables, I see no change at all. The only change is in line 471 of mpi_caf.c where the MPI_Win_free(*p) is moved out of the conditional. But that statement was always executed on non-gcc-7 compiled executables, because it is on the else path of the #if GCC_GE_7 preprocessor conditional.

I could not find a F2015 standard that has a section 11.6.4, so I assume you mean 8.6.4 as in the F2015 standard from https://gcc.gnu.org/wiki/GFortranStandards. At least there is the same formulation as cited.

Nevertheless do I disagree, that the test in syncimages_status.f90 is wrong. The test is to check whether calling sync_images() on all images with at least one image stopped succeeds in reporting the stat-code for stopped images. This of course is a race condition. When image 1 is slow to exit (due to load on the machine), the test randomly will fail. An attempt to fix this is made by #299.

@rouson
Copy link
Member

rouson commented Dec 30, 2016

@vehre Section 11.6.4 is on page 208 (which is the 228th page in the PDF document) of the August 31 draft located here. Clearly I was mistaken in my interpretation of of the behavior of sync images (which I've used less often than sync all) and stat_stopped_image (which I've never used). I agree the test is correct. At some point soon, I'll still replace the non-standard sleep call in the test.

I received a new CodeCoverage message a few minutes ago, which I interpret as meaning that your commit (84b4e7) launched a new round of checks, but I can't see that the Travis-CI tests are running and I don't know how to l launch them by hand. If the failure is intermittent due to load, then hopefully they'll pass this time. Let's see.

@rouson
Copy link
Member

rouson commented Dec 30, 2016

@vehre I see now that your latest commit removes the sleep. Thanks.

@vehre
Copy link
Collaborator Author

vehre commented Dec 30, 2016

Thanks for the pointer to the most recent F2015 standard.

vehre added a commit that referenced this pull request Jan 9, 2017
Syncing a set of images using SYNC IMAGES is done by this simple algorithm:

1. Set up asynchronous mpi-receives for all images in the set to sync to,
2. Send the current image's status to all images in the set to sync to, and
3. Wait for the receives from step 1 to finish. When one of the receives
   returns a stopped image state, than abort the sync on the current image
   immediately, else wait until all receives have been answered.

This commit also adds a new testcase: syncimages_ring, where each image
is syncing to its neighbours with wrap around. The testcase is called for
3 (the minimum number of images required to show that the sync works), 13
and 23 images.

This commit fixes the timeout of syncimage_status.f90 mentioned in #298.
This commit superseeds pull request 299 (close #299).
vehre added a commit that referenced this pull request Jan 9, 2017
Sync images works by setting up asynchronous receivers for each image in the sync set.
Next all images check the image status of all other images participating in the sync.
Then each image sends a zero int or the stopped image special code to all other images
in the sync set. At last all images wait for the asynchronous receivers to get their data.

The race here was, that an image could be in the waiting phase while the stopped image
had not set its status correctly yet. The waiting image did not return then, because it
never got the stopped image code from the stopped image. To solve this two changes had
to be made:

1. caf_finalize() now calls sync_image_internal ()
2. After waiting sync_image_internal() checks the status of the image, that send its data,
   again.

sync_image() has be renamed to sync_image_internal(). A flag was added to distinguish
calls to sync_image_internal() from caf_finalize and regular sync image calls. The latter
shall report an error on failure, while the former keeps silent.

This commit fixes the timeout of syncimage_status.f90 mentioned in #298.
@zbeekman
Copy link
Collaborator

zbeekman commented Jan 9, 2017

Superseded by #302

@zbeekman zbeekman closed this Jan 9, 2017
@zbeekman zbeekman deleted the vehre/events-regression branch January 9, 2017 17:10
zbeekman added a commit that referenced this pull request Jan 13, 2017
Fix sync images race/hang (supersedes/fixes #298, supersedes #299)

This *should* resolve (but I need to confirm) #293
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants