-
Notifications
You must be signed in to change notification settings - Fork 936
romio Type_create_darray (mirroring previous Type_create_darray fix) #3975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is the previous fix for the Type_create_darray in the main MPI lib: |
|
bot:retest |
1 similar comment
|
bot:retest |
|
bot:ompi:retest |
|
bot:mellanox:retest |
|
bot:ompi:retest |
21ea97c to
584717b
Compare
|
And here's a testcase that can be used to see the original bug and that this fixes it: |
584717b to
3338dcf
Compare
|
I just downloaded the branch, compiled the code, but the testcase that you point to is still not passing for me. I double checked that the modification that is shown in this patch is in fact in the code (which is correct). Not sure how to proceed on this. |
|
@markalle This PR is really old. Is it still needed on master? Can we just close this? |
|
@markalle do we still want this? Status? |
|
Can one of the admins verify this patch? |
|
Hey @markalle Is this PR still needed? |
The romio code for Type_create_darray had the same bug I fixed a while back in the main Type_create_darray. Basically the cyclic() function is the main function that's always considered right, and block() is an optimization that left out a necessary piece of code. Signed-off-by: Mark Allen <[email protected]>
This bug is hit by the same code I used long ago, darray_bug_romio.c. I'm not totally sure on the archeology of when the bug reappeared, but I think it was because of changes to COMBINER_RESIZED in ADIOI_Flatten that weren't mirrored in ADIOI_Count_contiguous_blocks. Count and Flatten are very similar functions, but they diverge a lot around the edges and in degenerate cases. The two changes I made are 1. A book-keeping rule in Flatten is *curr_index should be updated in sync with whatever is being added to flat->indices[] and flat->blocklens[]. There were a couple places (when counts[] arrays have 0s) where that wasn't the case (see the "nonzeroth" var in this commit). 2. The main change I made was to reset flat->count based on Flatten's curr_index. This is because the divergence between the two functions is too great to reliably fix. 3. A third change is just a safety valve, using a FIX_ARRAYS_IN_CASE_OF_WRONG_COUNT() macro just in case the Count function returns a smaller value than Flatten ends up trying to write into the array, and related to this I got rid of the various --(flat->count) lines, since that var now represents the allocated size of the array, up until right after the Flatten function when it's reset to represent the data written to the array like it used to be. Signed-off-by: Mark Allen <[email protected]>
3338dcf to
075ae82
Compare
|
I repushed. Over in Spectrum MPI I had made several related fixes toward syncing up the Count and Flatten functions, but that's been a loosing battle that keeps re-breaking. So in this version I'm taking the simpler fix where Count is only used to allocate the arrays that are filled by Flatten, and even then I put an extra safety valve to resize those arrays if needed (I don't think that code ever gets activated though). Anyway the final flat->count is then reset based on the *curr_index which I think is the best book-keeping variable for keeping track of the count since it was used every time data is written into those arrays |
|
As I stated on the user mailing list, this PR should not be committed. It modifies an imported software package without a corresponding upstream fix. This leaves the OMPI community responsible for maintaining ROMIO - a position we cannot absorb. This PR should instead be filed against ROMIO itself, and then monitored for acceptance so we can update OMPI once ROMIO is updated. |
|
Mark has submitted a ROMIO PR upstream here: pmodels/mpich#4943 |
|
So what about OMPI 4.0.x? That presumably won't get a romio update, so once 4943 goes into mpich I should plan to to merge that into 4.0.x? |
|
Hopefully once the fix gets into romio, they will spin a bugfix release on whatever branch v4.0.x's ROMIO is based on and then we can pull that into v4.0.x But I'm not ceratin of the details or the branch mappings between projects. |
|
I'll close this one. The overall status of getting things into MPICH and getting them back from mpich into our branches is far from done, but this one is at least fixed in MPICH master so it can be closed here. And the job of making sure we have a merge from MPICH master of important fixes back to our various branches is ongoing. |
The romio code for Type_create_darray had the same bug I fixed a while back in the main MPI lib's Type_create_darray. Basically the cyclic() function is the main function that's considered right, and block() is an optimization that left out a necessary piece of code to resize the datatype it created.
For what it's worth I took the conservative approach and made block() match cyclic(). I like the code a bit better in the main MPI library (ompi_datatype_create_darray.c) that uses opal_datatype_resize() where the implementation of cyclic() matches the current MPI standard's description. The romio code uses LB/UB and Type_struct, and I'm presuming that matched the older MPI standard's description of Type_create_darray.
Anyway I didn't change to the new style, I just made block() match cyclic() in the romio code.
I don't have a complete testcase handy, but can make one if needed. Basically it involves creating a new datatype with Type_create_darray (mixing BLOCK and CYCLIC) and then
and then you can read from the file or hexdump it etc and the data will be written in the wrong locations.