Skip to content

Conversation

vehre
Copy link
Collaborator

@vehre vehre commented Dec 30, 2016

Please see the commit message for a detailed explanation.

This patch ensures that the array descriptor stored in the mpi-caf-token is
set to NULL consistently, when no real descriptor is stored in it.
Further does the patch ensure freeing memory with gcc-7.

The testcases have been actived again for annoucing their passing.
Sync images works by setting up asynchronous receivers for each image in the sync set.
Next all images check the image status of all other images participating in the sync.
Then each image sends a zero int or the stopped image special code to all other images
in the sync set. At last all images wait for the asynchronous receivers to get their data.

The race here was, that an image could be in the waiting phase while the stopped image
had not set its status correctly yet. The waiting image did not return then, because it
never got the stopped image code from the stopped image. To solve this two changes had
to be made:

1. caf_finalize() now calls sync_image_internal ()
2. After waiting sync_image_internal() checks the status of the image, that send its data,
   again.

sync_image() has be renamed to sync_image_internal(). A flag was added to distinguish
calls to sync_image_internal() from caf_finalize and regular sync image calls. The latter
shall report an error on failure, while the former keeps silent.

This commit fixes the timeout of syncimage_status.f90 mentioned in #298.
@codecov-io
Copy link

codecov-io commented Dec 30, 2016

Current coverage is 46.28% (diff: 100%)

Merging #299 into master will increase coverage by 0.81%

@@             master       #299   diff @@
==========================================
  Files             3          3          
  Lines          1025       1035    +10   
  Methods          63         64     +1   
  Messages          0          0          
  Branches        197        198     +1   
==========================================
+ Hits            466        479    +13   
+ Misses          479        476     -3   
  Partials         80         80          

Sunburst

Powered by Codecov. Last update 3ccb984...3a27be0

Add locking for writing the image status to caf_finalize.
@vehre vehre self-assigned this Jan 3, 2017
@vehre
Copy link
Collaborator Author

vehre commented Jan 3, 2017

Well, at first it looked like pull request #300 was causing random failure on MacOS X for the co_heat testcase, but now it shows here also.
Therefore, please DO NOT MERGE until I have fixed this reliably. Working on it.
I am still testing this but on a private repo to prevent sending out lots of annoying mails about failed Travis runs.

@zbeekman zbeekman requested review from afanfa and removed request for afanfa January 3, 2017 17:51
@zbeekman
Copy link
Collaborator

zbeekman commented Jan 7, 2017

@vehre Just wondering what the status of this is. It looks like you're still working on it, no?

vehre added a commit that referenced this pull request Jan 9, 2017
Syncing a set of images using SYNC IMAGES is done by this simple algorithm:

1. Set up asynchronous mpi-receives for all images in the set to sync to,
2. Send the current image's status to all images in the set to sync to, and
3. Wait for the receives from step 1 to finish. When one of the receives
   returns a stopped image state, than abort the sync on the current image
   immediately, else wait until all receives have been answered.

This commit also adds a new testcase: syncimages_ring, where each image
is syncing to its neighbours with wrap around. The testcase is called for
3 (the minimum number of images required to show that the sync works), 13
and 23 images.

This commit fixes the timeout of syncimage_status.f90 mentioned in #298.
This commit superseeds pull request 299 (close #299).
@vehre
Copy link
Collaborator Author

vehre commented Jan 9, 2017

Please close and delete the branch: vehre/sync_image_fix, when satisfied with #302.

@vehre vehre assigned zbeekman and unassigned vehre Jan 9, 2017
@zbeekman zbeekman closed this Jan 9, 2017
@zbeekman zbeekman deleted the vehre/sync_image_fix branch January 9, 2017 16:58
zbeekman added a commit that referenced this pull request Jan 13, 2017
Fix sync images race/hang (supersedes/fixes #298, supersedes #299)

This *should* resolve (but I need to confirm) #293
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants