Skip to content

Conversation

@vspetrov
Copy link

open-mpi/ompi@0fe756d4166eecf2f0ee2598da690c69a7c824c4 Introduced
a bug in coll/hcoll component. The ompi_requests allocated by
libhcoll would be treated as coll_base_nbc_request during
ompi_coll_base_retain_<> call. Afterwards this would lead to a
segv in the request cleanup.

Fix: since libhcoll interface does not distinguish between the
blocling/non-blocking requests use coll_base_nbc_request all the
time and initialize it properly in
coll/hcoll/get_coll_handle(). It is still within 2 cache lines.

@gpaulsen
Copy link
Member

This sounds like a serious issue (segv in request cleanup when using hcoll). Is this an issue only on master, or did 0fe756d make it to any of the release branches?

@amaslenn
Copy link

amaslenn commented Aug 27, 2019

@gpaulsen found in v4.0.x branch

    open-mpi/ompi@0fe756d Introduced
    a bug in coll/hcoll component. The ompi_requests allocated by
    libhcoll would be treated as coll_base_nbc_request during
    ompi_coll_base_retain_<> call. Afterwards this would lead to a
    segv in the request cleanup.

    Fix: since libhcoll interface does not distinguish between the
    blocling/non-blocking requests use coll_base_nbc_request all the
    time and initialize it properly in
    coll/hcoll/get_coll_handle(). It is still within 2 cache lines.

Signed-off-by: Valentin Petrov <[email protected]>
@vspetrov vspetrov force-pushed the coll_hcoll_nbc_request_bugfix branch from 587dba3 to a0d99ad Compare August 27, 2019 14:23
@artpol84
Copy link
Contributor

@vspetrov
Per telecon, let's merge it ASAP and PR to 4.0.x

@jladd-mlnx jladd-mlnx merged commit 8b6f2d9 into open-mpi:master Aug 27, 2019
@jladd-mlnx
Copy link
Member

@vspetrov please cherry-pick to 4.0.x branch.

@vspetrov
Copy link
Author

v4.0.x PR #6935

@vspetrov vspetrov deleted the coll_hcoll_nbc_request_bugfix branch September 2, 2019 08:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants