Skip to content

HCOLL is causing XRC problems in v2.x #4087

@artpol84

Description

@artpol84

This problem was originally treated as an btl/openib issue: #3890.
However more detailed investigation indicating that this is an effect of coll/hcoll component: #4082

Without hcoll it runs ok:

$ bash -x ./run.sh                                                                                                                                                      
+ ./mpirun -np 8 -bind-to none -mca orte_tmpdir_base /tmp/tmp.8mj45mghXh -mca btl_openib_if_include mlx5_0:1 -x MXM_RDMA_PORTS=mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc,cm -mca pml ob1 -mca btl self,openib -mca coll '^hcoll' -mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512 taskset -c 6,7 /hpc/home/USERS/artemp/scrap/OMPI/ompi/examples/hello_c
Hello, world, I am 2 of 8, (Open MPI v2.1.2rc1, package: Open MPI artemp@jenkins03 Distribution, ident: 2.1.2rc1, repo rev: v2.1.1-154-g459e5ae, Unreleased developer copy, 143)
Hello, world, I am 7 of 8, (Open MPI v2.1.2rc1, package: Open MPI artemp@jenkins03 Distribution, ident: 2.1.2rc1, repo rev: v2.1.1-154-g459e5ae, Unreleased developer copy, 143)
Hello, world, I am 4 of 8, (Open MPI v2.1.2rc1, package: Open MPI artemp@jenkins03 Distribution, ident: 2.1.2rc1, repo rev: v2.1.1-154-g459e5ae, Unreleased developer copy, 143)
Hello, world, I am 6 of 8, (Open MPI v2.1.2rc1, package: Open MPI artemp@jenkins03 Distribution, ident: 2.1.2rc1, repo rev: v2.1.1-154-g459e5ae, Unreleased developer copy, 143)
Hello, world, I am 3 of 8, (Open MPI v2.1.2rc1, package: Open MPI artemp@jenkins03 Distribution, ident: 2.1.2rc1, repo rev: v2.1.1-154-g459e5ae, Unreleased developer copy, 143)
Hello, world, I am 5 of 8, (Open MPI v2.1.2rc1, package: Open MPI artemp@jenkins03 Distribution, ident: 2.1.2rc1, repo rev: v2.1.1-154-g459e5ae, Unreleased developer copy, 143)
Hello, world, I am 0 of 8, (Open MPI v2.1.2rc1, package: Open MPI artemp@jenkins03 Distribution, ident: 2.1.2rc1, repo rev: v2.1.1-154-g459e5ae, Unreleased developer copy, 143)
Hello, world, I am 1 of 8, (Open MPI v2.1.2rc1, package: Open MPI artemp@jenkins03 Distribution, ident: 2.1.2rc1, repo rev: v2.1.1-154-g459e5ae, Unreleased developer copy, 143)
+ exit 0

While enabling hcoll introduces the problem:

$ bash -x ./run.sh                                                                                                                                                      
+ ./mpirun -np 8 -bind-to none -mca orte_tmpdir_base /tmp/tmp.8mj45mghXh -mca btl_openib_if_include mlx5_0:1 -x MXM_RDMA_PORTS=mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc,cm -mca pml ob1 -mca btl self,openib -mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512 taskset -c 6,7 /hpc/home/USERS/artemp/scrap/OMPI/ompi/examples/hello_c
[1502826187.186194] [jenkins03:21885:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3499.45
[1502826187.188457] [jenkins03:21888:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3499.45
[1502826187.192728] [jenkins03:21886:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3499.45
[1502826187.194526] [jenkins03:21884:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3499.45
[1502826187.200721] [jenkins03:21887:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3499.45
[1502826187.206316] [jenkins03:21890:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3499.45
[1502826187.209630] [jenkins03:21889:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3499.45
[1502826187.215512] [jenkins03:21891:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3499.45
+ echo 255
255

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions