-
-
Notifications
You must be signed in to change notification settings - Fork 57
Null array-descriptor in mpi-token causing segfault on finalize. (#293) #298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This patch ensures that the array descriptor stored in the mpi-caf-token is set to NULL consistently, when no real descriptor is stored in it. Further does the patch ensure freeing memory with gcc-7. The testcases have been actived again for annoucing their passing.
Current coverage is 45.46% (diff: 100%)@@ master #298 diff @@
==========================================
Files 3 3
Lines 1025 1025
Methods 63 63
Messages 0 0
Branches 197 197
==========================================
Hits 466 466
Misses 479 479
Partials 80 80
|
@vehre Thanks for working on this. It appears that this pull request causes the syncimages_status test to hang as Travis-CI reports a the bottom of the text of Job 945.6, where the aforementioned test times out after not responding for 10 minutes. I need to review the standard in order to understand line 14 in the test better. I haven't made much use of |
The GitHub History page for the sync_images test shows two contributors: @zbeekman and me with me first merging it in from afanfa-master branch. I'm not sure I'm understanding the history, but maybe @afanfa wrote it, I merged it into the master branch, and @zbeekman made minor white-space edits. At any rate, I think the the test is incorrect. Section 11.6.4, paragraph 3 of the draft Fortran 2015 standard states, "An image-set that is an asterisk specifies all images in the current team." Given that we don't yet support teams, I believe this quote implies that |
Sync images works by setting up asynchronous receivers for each image in the sync set. Next all images check the image status of all other images participating in the sync. Then each image sends a zero int or the stopped image special code to all other images in the sync set. At last all images wait for the asynchronous receivers to get their data. The race here was, that an image could be in the waiting phase while the stopped image had not set its status correctly yet. The waiting image did not return then, because it never got the stopped image code from the stopped image. To solve this two changes had to be made: 1. caf_finalize() now calls sync_image_internal () 2. After waiting sync_image_internal() checks the status of the image, that send its data, again. sync_image() has be renamed to sync_image_internal(). A flag was added to distinguish calls to sync_image_internal() from caf_finalize and regular sync image calls. The latter shall report an error on failure, while the former keeps silent. This commit fixes the timeout of syncimage_status.f90 mentioned in #298.
When I examine the changes my patch has on non gcc-7 compiled executables, I see no change at all. The only change is in line 471 of mpi_caf.c where the MPI_Win_free(*p) is moved out of the conditional. But that statement was always executed on non-gcc-7 compiled executables, because it is on the else path of the #if GCC_GE_7 preprocessor conditional. I could not find a F2015 standard that has a section 11.6.4, so I assume you mean 8.6.4 as in the F2015 standard from https://gcc.gnu.org/wiki/GFortranStandards. At least there is the same formulation as cited. Nevertheless do I disagree, that the test in syncimages_status.f90 is wrong. The test is to check whether calling sync_images() on all images with at least one image stopped succeeds in reporting the stat-code for stopped images. This of course is a race condition. When image 1 is slow to exit (due to load on the machine), the test randomly will fail. An attempt to fix this is made by #299. |
@vehre Section 11.6.4 is on page 208 (which is the 228th page in the PDF document) of the August 31 draft located here. Clearly I was mistaken in my interpretation of of the behavior of I received a new CodeCoverage message a few minutes ago, which I interpret as meaning that your commit (84b4e7) launched a new round of checks, but I can't see that the Travis-CI tests are running and I don't know how to l launch them by hand. If the failure is intermittent due to load, then hopefully they'll pass this time. Let's see. |
@vehre I see now that your latest commit removes the sleep. Thanks. |
Thanks for the pointer to the most recent F2015 standard. |
Syncing a set of images using SYNC IMAGES is done by this simple algorithm: 1. Set up asynchronous mpi-receives for all images in the set to sync to, 2. Send the current image's status to all images in the set to sync to, and 3. Wait for the receives from step 1 to finish. When one of the receives returns a stopped image state, than abort the sync on the current image immediately, else wait until all receives have been answered. This commit also adds a new testcase: syncimages_ring, where each image is syncing to its neighbours with wrap around. The testcase is called for 3 (the minimum number of images required to show that the sync works), 13 and 23 images. This commit fixes the timeout of syncimage_status.f90 mentioned in #298. This commit superseeds pull request 299 (close #299).
Sync images works by setting up asynchronous receivers for each image in the sync set. Next all images check the image status of all other images participating in the sync. Then each image sends a zero int or the stopped image special code to all other images in the sync set. At last all images wait for the asynchronous receivers to get their data. The race here was, that an image could be in the waiting phase while the stopped image had not set its status correctly yet. The waiting image did not return then, because it never got the stopped image code from the stopped image. To solve this two changes had to be made: 1. caf_finalize() now calls sync_image_internal () 2. After waiting sync_image_internal() checks the status of the image, that send its data, again. sync_image() has be renamed to sync_image_internal(). A flag was added to distinguish calls to sync_image_internal() from caf_finalize and regular sync image calls. The latter shall report an error on failure, while the former keeps silent. This commit fixes the timeout of syncimage_status.f90 mentioned in #298.
Superseded by #302 |
This patch ensures that the array descriptor stored in the mpi-caf-token is
set to NULL consistently, when no real descriptor is stored in it.
Further does the patch ensure freeing memory with gcc-7.
The testcases have been actived again for annoucing their passing.
Reffing #293