Skip to content

[AVX-512] clz(32 x u8) and clz(64 x u8) should use an algorithm similar to avx2 #110308

@Validark

Description

@Validark

This code:

export fn foo(x: @Vector(32, u8)) @TypeOf(x) {
    return @clz(x);
}

LLVM version:

define dso_local range(i8 0, 9) <32 x i8> @foo(<32 x i8> %0) local_unnamed_addr {
Entry:
  %1 = tail call range(i8 0, 9) <32 x i8> @llvm.ctlz.v32i8(<32 x i8> %0, i1 false)
  ret <32 x i8> %1
}

declare <32 x i8> @llvm.ctlz.v32i8(<32 x i8>, i1 immarg) #1

Results in this emit for Zen 4:

.LCPI0_0:
        .zero   32,24
foo:
        vpmovzxbd       zmm1, xmm0
        vextracti128    xmm0, ymm0, 1
        vpmovzxbd       zmm0, xmm0
        vplzcntd        zmm1, zmm1
        vplzcntd        zmm0, zmm0
        vpmovdb xmm1, zmm1
        vpmovdb xmm0, zmm0
        vinserti128     ymm0, ymm1, xmm0, 1
        vpsubb  ymm0, ymm0, ymmword ptr [rip + .LCPI0_0]
        ret

LLVM mca claims this should take ~23 cycles per iteration.

This is pretty unfortunate because if we downgrade to Zen 3, we get:

.LCPI0_1:
        .zero   32,15
.LCPI0_2:
        .byte   4
        .byte   3
        .byte   2
        .byte   2
        .byte   1
        .byte   1
        .byte   1
        .byte   1
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   0
foo:
        vbroadcasti128  ymm1, xmmword ptr [rip + .LCPI0_2]
        vpxor   xmm3, xmm3, xmm3
        vpshufb ymm2, ymm1, ymm0
        vpsrlw  ymm0, ymm0, 4
        vpand   ymm0, ymm0, ymmword ptr [rip + .LCPI0_1]
        vpcmpeqb        ymm3, ymm0, ymm3
        vpshufb ymm0, ymm1, ymm0
        vpand   ymm2, ymm2, ymm3
        vpaddb  ymm0, ymm2, ymm0
        ret

LLVM-mca says Zen 3 should be able to compute this in ~11 cycles per iteration.

We can reproduce this functionality in Zig like so:

export fn foo2(x: @Vector(32, u8)) @TypeOf(x) {
    const vec: @TypeOf(x) = comptime std.simd.repeat(@sizeOf(@TypeOf(x)), [16]u8{4,3,2,2,1,1,1,1,0,0,0,0,0,0,0,0});
    return @select(u8, x == @as(@TypeOf(x), @splat(0)), vpshufb(vec, x), @as(@TypeOf(x), @splat(0))) + vpshufb(vec, x >> @splat(4));
}

Which gives us functionally equivalent assembly, albeit reordered. LLVM-mca says that we can get 10 cycles per iteration with the instructions reordered. Not sure if there is anything to that. Godbolt link here

Upgrading back to Zen 4, foo2 gives us:

.LCPI0_2:
        .byte   4
        .byte   3
        .byte   2
        .byte   2
        .byte   1
        .byte   1
        .byte   1
        .byte   1
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   0
.LCPI0_3:
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   128
        .byte   64
        .byte   32
        .byte   16
foo2:
        vbroadcasti128  ymm1, xmmword ptr [rip + .LCPI0_2]
        vptestnmb       k1, ymm0, ymm0
        vpshufb ymm2, ymm1, ymm0
        vgf2p8affineqb  ymm0, ymm0, qword ptr [rip + .LCPI0_3]{1to4}, 0
        vpshufb ymm0, ymm1, ymm0
        vpaddb  ymm0 {k1}, ymm0, ymm2
        ret

LLVM-mca says this gives us ~12 cycles of latency per iteration. Which, I notice, is higher than the ~10 latency from Zen 3.

Not sure what the best course of action is, but probably one or both of these things should be done:

  1. The latter implementation should be used to implement @clz(32 x u8), or at least @clz(64 x u8)
  2. The Zen 3 implementation should be lifted directly to Zen 4 for @clz(32 x u8).

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions