Skip to content

including openib in btl list causes horrible vader or sm performance. #1252

@gpaulsen

Description

@gpaulsen

On master branch

I observe a strange behavior. I think that openib may be using too large of a hammer for numa membinding, possibly setting the wrong memory binding policy for the vader and sm shared memory segments. I've only come to this conclusion empirically based on performance numbers.

For example, I have a RHEL 6.5 node with a single Mellanox Technologies MT25204 [InfiniHost III Lx HCA] ConnectX-3 card with a single port active.

Bad Latency run single host:

$  mpirun -host "mpi03" -np 4 --bind-to core --report-bindings --mca btl openib,vader,self ./ping_pong_ring.x2
[mpi03:12941] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..]
[mpi03:12941] MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: [../../../../../../../..][BB/../../../../../../..]
[mpi03:12941] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..]
[mpi03:12941] MCW rank 3 bound to socket 1[core 9[hwt 0-1]]: [../../../../../../../..][../BB/../../../../../..]
[0:mpi03] ping-pong 0 bytes ...
0 bytes: 7.11 usec/msg
[1:mpi03] ping-pong 0 bytes ...
0 bytes: 7.10 usec/msg
[2:mpi03] ping-pong 0 bytes ...
0 bytes: 7.15 usec/msg
[3:mpi03] ping-pong 0 bytes ...
0 bytes: 7.17 usec/msg

Similar behavior with sm:

$ mpirun -host "mpi03" -np 4 --bind-to core --report-bindings --mca btl openib,sm,self ./ping_pong_ring.x2
[mpi03:14928] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..]
[mpi03:14928] MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: [../../../../../../../..][BB/../../../../../../..]
[mpi03:14928] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..]
[mpi03:14928] MCW rank 3 bound to socket 1[core 9[hwt 0-1]]: [../../../../../../../..][../BB/../../../../../..]
[0:mpi03] ping-pong 0 bytes ...
0 bytes: 7.45 usec/msg
[1:mpi03] ping-pong 0 bytes ...
0 bytes: 7.38 usec/msg
[2:mpi03] ping-pong 0 bytes ...
0 bytes: 7.35 usec/msg
[3:mpi03] ping-pong 0 bytes ...
0 bytes: 7.38 usec/msg

When I remove openib results look much better:

$ mpirun -host "mpi03" -np 4 --bind-to core --report-bindings --mca btl vader,self ./ping_pong_ring.x2
[mpi03:15819] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..]
[mpi03:15819] MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: [../../../../../../../..][BB/../../../../../../..]
[mpi03:15819] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..]
[mpi03:15819] MCW rank 3 bound to socket 1[core 9[hwt 0-1]]: [../../../../../../../..][../BB/../../../../../..]
[0:mpi03] ping-pong 0 bytes ...
0 bytes: 0.50 usec/msg
[1:mpi03] ping-pong 0 bytes ...
0 bytes: 0.50 usec/msg
[2:mpi03] ping-pong 0 bytes ...
0 bytes: 0.49 usec/msg
[3:mpi03] ping-pong 0 bytes ...
0 bytes: 0.51 usec/msg

Similar behavior with sm (though it's half as fast as vader):

$ mpirun -host "mpi03" -np 4 --bind-to core --report-bindings --mca btl sm,self ./ping_pong_ring.x2
[mpi03:16608] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..]
[mpi03:16608] MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: [../../../../../../../..][BB/../../../../../../..]
[mpi03:16608] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..]
[mpi03:16608] MCW rank 3 bound to socket 1[core 9[hwt 0-1]]: [../../../../../../../..][../BB/../../../../../..]
[0:mpi03] ping-pong 0 bytes ...
0 bytes: 0.98 usec/msg
[1:mpi03] ping-pong 0 bytes ...
0 bytes: 1.00 usec/msg
[2:mpi03] ping-pong 0 bytes ...
0 bytes: 0.95 usec/msg
[3:mpi03] ping-pong 0 bytes ...
0 bytes: 0.93 usec/msg

If I disable binding explicitly with --bind-to none, even when specifying openib I see the expected results (with either vader or sm, but now sm is the same speed as vader... weird):

$ mpirun -host "mpi03" -np 4 --bind-to none --report-bindings --mca btl openib,vader,self ./ping_pong_ring.x2
[mpi03:20206] MCW rank 1 is not bound (or bound to all available processors)
[mpi03:20205] MCW rank 0 is not bound (or bound to all available processors)
[mpi03:20207] MCW rank 2 is not bound (or bound to all available processors)
[mpi03:20208] MCW rank 3 is not bound (or bound to all available processors)
[0:mpi03] ping-pong 0 bytes ...
0 bytes: 0.50 usec/msg
[1:mpi03] ping-pong 0 bytes ...
0 bytes: 0.50 usec/msg
[2:mpi03] ping-pong 0 bytes ...
0 bytes: 0.50 usec/msg
[3:mpi03] ping-pong 0 bytes ...
0 bytes: 0.49 usec/msg
$ mpirun -host "mpi03" -np 4 --bind-to none --report-bindings --mca btl openib,sm,self ./ping_pong_ring.x2
[mpi03:21058] MCW rank 0 is not bound (or bound to all available processors)
[mpi03:21059] MCW rank 1 is not bound (or bound to all available processors)
[mpi03:21060] MCW rank 2 is not bound (or bound to all available processors)
[mpi03:21061] MCW rank 3 is not bound (or bound to all available processors)
[0:mpi03] ping-pong 0 bytes ...
0 bytes: 0.50 usec/msg
[1:mpi03] ping-pong 0 bytes ...
0 bytes: 0.51 usec/msg
[2:mpi03] ping-pong 0 bytes ...
0 bytes: 0.51 usec/msg
[3:mpi03] ping-pong 0 bytes ...
0 bytes: 0.49 usec/msg

Finally just for completeness... the best 0 byte ping pong ring times I could get was with --bind-to core --map-by core:

$ mpirun -host "mpi03" -np 4 --bind-to core --map-by core --report-bindings --mca btl vader,self ./ping_pong_ring.x2
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
[mpi03:32149] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..]
[mpi03:32149] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..]
[mpi03:32149] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../..][../../../../../../../..]
[mpi03:32149] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../..][../../../../../../../..]
[0:mpi03] ping-pong 0 bytes ...
0 bytes: 0.37 usec/msg
[1:mpi03] ping-pong 0 bytes ...
0 bytes: 0.37 usec/msg
[2:mpi03] ping-pong 0 bytes ...
0 bytes: 0.38 usec/msg
[3:mpi03] ping-pong 0 bytes ...
0 bytes: 0.38 usec/msg

I've attached my source for ping_pong_ring.c:

ping_pong_ring.txt

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions