We have observed this both in OSHMEM UCX testing and now in HCOLL shared memory testing on POWER8 Garrison platform. In the 2.x series, when the BTLs are silently loaded to cover OSC we see the following when running HCOLL shared memory collectives.
Running with 160 threads
mpirun -np 160 -mca coll_hcoll_enable 1 ./IBM-MPI1 Allreduce
8-byte allreduce latency:
21.9
mpirun -np 160 -mca btl ^tcp,openib -mca coll_hcoll_enable 1 ./IBM-MPI1 Allreduce
8-byte allreduce latency:
11.58
@hjelmn and I discussed this at the meeting, and the root cause appears to be the asynchronous progress thread in these two BTLs.