-
-
Notifications
You must be signed in to change notification settings - Fork 57
Fix sync image race/hang for syncimage_status #299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This patch ensures that the array descriptor stored in the mpi-caf-token is set to NULL consistently, when no real descriptor is stored in it. Further does the patch ensure freeing memory with gcc-7. The testcases have been actived again for annoucing their passing.
Sync images works by setting up asynchronous receivers for each image in the sync set. Next all images check the image status of all other images participating in the sync. Then each image sends a zero int or the stopped image special code to all other images in the sync set. At last all images wait for the asynchronous receivers to get their data. The race here was, that an image could be in the waiting phase while the stopped image had not set its status correctly yet. The waiting image did not return then, because it never got the stopped image code from the stopped image. To solve this two changes had to be made: 1. caf_finalize() now calls sync_image_internal () 2. After waiting sync_image_internal() checks the status of the image, that send its data, again. sync_image() has be renamed to sync_image_internal(). A flag was added to distinguish calls to sync_image_internal() from caf_finalize and regular sync image calls. The latter shall report an error on failure, while the former keeps silent. This commit fixes the timeout of syncimage_status.f90 mentioned in #298.
Current coverage is 46.28% (diff: 100%)@@ master #299 diff @@
==========================================
Files 3 3
Lines 1025 1035 +10
Methods 63 64 +1
Messages 0 0
Branches 197 198 +1
==========================================
+ Hits 466 479 +13
+ Misses 479 476 -3
Partials 80 80
|
Add locking for writing the image status to caf_finalize.
Well, at first it looked like pull request #300 was causing random failure on MacOS X for the co_heat testcase, but now it shows here also. |
@vehre Just wondering what the status of this is. It looks like you're still working on it, no? |
Syncing a set of images using SYNC IMAGES is done by this simple algorithm: 1. Set up asynchronous mpi-receives for all images in the set to sync to, 2. Send the current image's status to all images in the set to sync to, and 3. Wait for the receives from step 1 to finish. When one of the receives returns a stopped image state, than abort the sync on the current image immediately, else wait until all receives have been answered. This commit also adds a new testcase: syncimages_ring, where each image is syncing to its neighbours with wrap around. The testcase is called for 3 (the minimum number of images required to show that the sync works), 13 and 23 images. This commit fixes the timeout of syncimage_status.f90 mentioned in #298. This commit superseeds pull request 299 (close #299).
Please close and delete the branch: vehre/sync_image_fix, when satisfied with #302. |
Please see the commit message for a detailed explanation.