Skip to content

Conversation

@markalle
Copy link
Contributor

The romio code for Type_create_darray had the same bug I fixed a while back in the main MPI lib's Type_create_darray. Basically the cyclic() function is the main function that's considered right, and block() is an optimization that left out a necessary piece of code to resize the datatype it created.

For what it's worth I took the conservative approach and made block() match cyclic(). I like the code a bit better in the main MPI library (ompi_datatype_create_darray.c) that uses opal_datatype_resize() where the implementation of cyclic() matches the current MPI standard's description. The romio code uses LB/UB and Type_struct, and I'm presuming that matched the older MPI standard's description of Type_create_darray.

Anyway I didn't change to the new style, I just made block() match cyclic() in the romio code.

I don't have a complete testcase handy, but can make one if needed. Basically it involves creating a new datatype with Type_create_darray (mixing BLOCK and CYCLIC) and then

    MPI_File_set_view(fd, 0, MPI_INT, newtype, "native", MPI_INFO_NULL);
    MPI_File_write(fd, sbuf, count, MPI_INT, MPI_STATUS_IGNORE);

and then you can read from the file or hexdump it etc and the data will be written in the wrong locations.

@markalle
Copy link
Contributor Author

This is the previous fix for the Type_create_darray in the main MPI lib:
#3672

@markalle
Copy link
Contributor Author

markalle commented Sep 1, 2017

bot:retest

1 similar comment
@markalle
Copy link
Contributor Author

markalle commented Oct 4, 2017

bot:retest

@hppritcha
Copy link
Member

bot:ompi:retest

@hppritcha
Copy link
Member

bot:mellanox:retest

@hppritcha
Copy link
Member

bot:ompi:retest

@markalle
Copy link
Contributor Author

And here's a testcase that can be used to see the original bug and that this fixes it:
https://gist.github.com/markalle/940de93d64fd779e304ee124855b8e6a
mpicc -o x darray_bug_romio.c
mpirun -mca io romio321 -np 1 ./x

@edgargabriel
Copy link
Member

I just downloaded the branch, compiled the code, but the testcase that you point to is still not passing for me. I double checked that the modification that is shown in this patch is in fact in the code (which is correct).

gabriel@salmon:~/tmp> mpicc -o darray_bug_romio darray_bug_romio.c
gabriel@salmon:~/tmp> mpirun --mca io romio321 -np 1 ./darray_bug_romio
[pt2pt] what locations in rbuf appear to have data:
data at disp 0 (sz 8)
data at disp 16 (sz 8)
data at disp 64 (sz 8)
data at disp 80 (sz 8)
data at disp 96 (sz 8)
data at disp 112 (sz 8)
data at disp 160 (sz 8)
data at disp 176 (sz 8)
[romio] what locations in rbuf appear to have data:
[romio] data not found in expected locations -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

Not sure how to proceed on this.

@gpaulsen
Copy link
Member

gpaulsen commented Sep 5, 2019

@markalle This PR is really old. Is it still needed on master? Can we just close this?
I'm adding WorkInProgress label until we can verify if this PR is still desirable.

@awlauria
Copy link
Contributor

@markalle do we still want this? Status?

@lanl-ompi
Copy link
Contributor

Can one of the admins verify this patch?

@jsquyres jsquyres marked this pull request as draft October 26, 2020 21:53
@jsquyres
Copy link
Member

Hey @markalle Is this PR still needed?

The romio code for Type_create_darray had the same bug I fixed a while
back in the main Type_create_darray.  Basically the cyclic() function
is the main function that's always considered right, and block() is an
optimization that left out a necessary piece of code.

Signed-off-by: Mark Allen <[email protected]>
This bug is hit by the same code I used long ago, darray_bug_romio.c.
I'm not totally sure on the archeology of when the bug reappeared,
but I think it was because of changes to COMBINER_RESIZED in
ADIOI_Flatten that weren't mirrored in ADIOI_Count_contiguous_blocks.

Count and Flatten are very similar functions, but they diverge a lot
around the edges and in degenerate cases.

The two changes I made are

1. A book-keeping rule in Flatten is *curr_index should be updated
in sync with whatever is being added to flat->indices[] and
flat->blocklens[].  There were a couple places (when counts[] arrays
have 0s) where that wasn't the case (see the "nonzeroth" var in this
commit).

2. The main change I made was to reset flat->count based on
Flatten's curr_index.  This is because the divergence between the
two functions is too great to reliably fix.

3. A third change is just a safety valve, using a
FIX_ARRAYS_IN_CASE_OF_WRONG_COUNT() macro just in case the
Count function returns a smaller value than Flatten ends up
trying to write into the array, and related to this I
got rid of the various --(flat->count) lines, since that var now
represents the allocated size of the array, up until right after
the Flatten function when it's reset to represent the data written
to the array like it used to be.

Signed-off-by: Mark Allen <[email protected]>
@markalle
Copy link
Contributor Author

I repushed. Over in Spectrum MPI I had made several related fixes toward syncing up the Count and Flatten functions, but that's been a loosing battle that keeps re-breaking. So in this version I'm taking the simpler fix where Count is only used to allocate the arrays that are filled by Flatten, and even then I put an extra safety valve to resize those arrays if needed (I don't think that code ever gets activated though). Anyway the final flat->count is then reset based on the *curr_index which I think is the best book-keeping variable for keeping track of the count since it was used every time data is written into those arrays

@rhc54
Copy link
Contributor

rhc54 commented Dec 2, 2020

As I stated on the user mailing list, this PR should not be committed. It modifies an imported software package without a corresponding upstream fix. This leaves the OMPI community responsible for maintaining ROMIO - a position we cannot absorb.

This PR should instead be filed against ROMIO itself, and then monitored for acceptance so we can update OMPI once ROMIO is updated.

@gpaulsen
Copy link
Member

gpaulsen commented Dec 3, 2020

Mark has submitted a ROMIO PR upstream here: pmodels/mpich#4943
we still need to go through their contributor process, so it may take some time.

@markalle
Copy link
Contributor Author

markalle commented Dec 3, 2020

So what about OMPI 4.0.x? That presumably won't get a romio update, so once 4943 goes into mpich I should plan to to merge that into 4.0.x?

@gpaulsen
Copy link
Member

gpaulsen commented Dec 3, 2020

Hopefully once the fix gets into romio, they will spin a bugfix release on whatever branch v4.0.x's ROMIO is based on and then we can pull that into v4.0.x But I'm not ceratin of the details or the branch mappings between projects.

@markalle
Copy link
Contributor Author

markalle commented Apr 5, 2021

I'll close this one. The overall status of getting things into MPICH and getting them back from mpich into our branches is far from done, but this one is at least fixed in MPICH master so it can be closed here. And the job of making sure we have a merge from MPICH master of important fixes back to our various branches is ongoing.

@markalle markalle closed this Apr 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants