-
Notifications
You must be signed in to change notification settings - Fork 14.6k
Description
This code:
export fn foo(x: @Vector(32, u8)) @TypeOf(x) {
return @clz(x);
}
LLVM version:
define dso_local range(i8 0, 9) <32 x i8> @foo(<32 x i8> %0) local_unnamed_addr {
Entry:
%1 = tail call range(i8 0, 9) <32 x i8> @llvm.ctlz.v32i8(<32 x i8> %0, i1 false)
ret <32 x i8> %1
}
declare <32 x i8> @llvm.ctlz.v32i8(<32 x i8>, i1 immarg) #1
Results in this emit for Zen 4:
.LCPI0_0:
.zero 32,24
foo:
vpmovzxbd zmm1, xmm0
vextracti128 xmm0, ymm0, 1
vpmovzxbd zmm0, xmm0
vplzcntd zmm1, zmm1
vplzcntd zmm0, zmm0
vpmovdb xmm1, zmm1
vpmovdb xmm0, zmm0
vinserti128 ymm0, ymm1, xmm0, 1
vpsubb ymm0, ymm0, ymmword ptr [rip + .LCPI0_0]
ret
LLVM mca claims this should take ~23 cycles per iteration.
This is pretty unfortunate because if we downgrade to Zen 3, we get:
.LCPI0_1:
.zero 32,15
.LCPI0_2:
.byte 4
.byte 3
.byte 2
.byte 2
.byte 1
.byte 1
.byte 1
.byte 1
.byte 0
.byte 0
.byte 0
.byte 0
.byte 0
.byte 0
.byte 0
.byte 0
foo:
vbroadcasti128 ymm1, xmmword ptr [rip + .LCPI0_2]
vpxor xmm3, xmm3, xmm3
vpshufb ymm2, ymm1, ymm0
vpsrlw ymm0, ymm0, 4
vpand ymm0, ymm0, ymmword ptr [rip + .LCPI0_1]
vpcmpeqb ymm3, ymm0, ymm3
vpshufb ymm0, ymm1, ymm0
vpand ymm2, ymm2, ymm3
vpaddb ymm0, ymm2, ymm0
ret
LLVM-mca says Zen 3 should be able to compute this in ~11 cycles per iteration.
We can reproduce this functionality in Zig like so:
export fn foo2(x: @Vector(32, u8)) @TypeOf(x) {
const vec: @TypeOf(x) = comptime std.simd.repeat(@sizeOf(@TypeOf(x)), [16]u8{4,3,2,2,1,1,1,1,0,0,0,0,0,0,0,0});
return @select(u8, x == @as(@TypeOf(x), @splat(0)), vpshufb(vec, x), @as(@TypeOf(x), @splat(0))) + vpshufb(vec, x >> @splat(4));
}
Which gives us functionally equivalent assembly, albeit reordered. LLVM-mca says that we can get 10 cycles per iteration with the instructions reordered. Not sure if there is anything to that. Godbolt link here
Upgrading back to Zen 4, foo2
gives us:
.LCPI0_2:
.byte 4
.byte 3
.byte 2
.byte 2
.byte 1
.byte 1
.byte 1
.byte 1
.byte 0
.byte 0
.byte 0
.byte 0
.byte 0
.byte 0
.byte 0
.byte 0
.LCPI0_3:
.byte 0
.byte 0
.byte 0
.byte 0
.byte 128
.byte 64
.byte 32
.byte 16
foo2:
vbroadcasti128 ymm1, xmmword ptr [rip + .LCPI0_2]
vptestnmb k1, ymm0, ymm0
vpshufb ymm2, ymm1, ymm0
vgf2p8affineqb ymm0, ymm0, qword ptr [rip + .LCPI0_3]{1to4}, 0
vpshufb ymm0, ymm1, ymm0
vpaddb ymm0 {k1}, ymm0, ymm2
ret
LLVM-mca says this gives us ~12 cycles of latency per iteration. Which, I notice, is higher than the ~10 latency from Zen 3.
Not sure what the best course of action is, but probably one or both of these things should be done:
- The latter implementation should be used to implement
@clz(32 x u8)
, or at least@clz(64 x u8)
- The Zen 3 implementation should be lifted directly to Zen 4 for
@clz(32 x u8)
.
Thank you!