Skip to content

Conversation

@lrbison
Copy link
Contributor

@lrbison lrbison commented Feb 15, 2024

cherry-pick #12338 to fix #12270 on v4.1.x

This change fixes open-mpi#12270

Testing on c7g instance type (arm64) confirms this change elminates
hangs and crashes that were previously observed in 1 in 30 runs of
IMB alltoall benchmark.  Tested with over 300 runs and no failures.

The write memory barrier prevents other CPUs from observing the fifo
get updated before they observe the updated contents of the header
itself.  Without the barrier, uninitialized header contents caused
the crashes and invalid data.

Signed-off-by: Luke Robison <[email protected]>
(cherry picked from commit 71f378d)
Copy link
Contributor

@wenduwan wenduwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix merged in main and v5.0.x.

@jsquyres
Copy link
Member

Does this PR replace #12005? Or is this in addition to #12005?

@wenduwan
Copy link
Contributor

This PR is not directly related to #12005 - it addresses #12270.

@jsquyres
Copy link
Member

This PR is not directly related to #12005 - it addresses #12270.

Just to be clear: does that mean we need both this PR and #12005?

@wenduwan
Copy link
Contributor

Just to be clear: does that mean we need both this PR and #12005?

Nope this PR goes alone.

@jsquyres jsquyres merged commit 16721dd into open-mpi:v4.1.x Feb 20, 2024
@lrbison lrbison deleted the smcuda_wmb_41x branch February 21, 2024 04:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants