Skip to content

[LLVM] Suboptimal code generated for rounding right shifts on NEON, AArch64 SIMD, LSX, and RISC-V V #147863

Open
@johnplatts

Description

@johnplatts

Here is a link to a snippet that generates suboptimal code on NEON, AArch64 SIMD, LSX, and RISC-V V:
https://alive2.llvm.org/ce/z/V82FtF

Alive2 has determined that the transformation of (a >> b) + ((b == 0) ? 0 : ((a >> (b - 1)) & 1)) to (b == 0) ? a : ((a >> (b - 1)) - (a >> b)) seems to be correct.

The above snippet can be further optimized to the following on ARMv7 NEON (arm-linux-gnueabihf):

src1:                                   @ @src1
        vneg.s32        q9, q1
        vrshl.u32        q0, q0, q9
        mov     pc, lr
tgt1:                                   @ @tgt1
        vneg.s32        q9, q1
        vrshl.u32        q0, q0, q9
        mov     pc, lr
src2:                                   @ @src2
        vneg.s32        q9, q1
        vrshl.s32        q0, q0, q9
        mov     pc, lr
tgt2:                                   @ @tgt2
        vneg.s32        q9, q1
        vrshl.s32        q0, q0, q9
        mov     pc, lr

The above snippet can be further optimized to the following on AArch64:

src1:                                   // @src1
        neg     v1.4s, v1.4s
        urshl    v0.4s, v0.4s, v1.4s
        ret
tgt1:                                   // @tgt1
        neg     v1.4s, v1.4s
        urshl    v0.4s, v0.4s, v1.4s
        ret
src2:                                   // @src2
        neg     v1.4s, v1.4s
        srshl    v0.4s, v0.4s, v1.4s
        ret
tgt2:                                   // @tgt2
        neg     v1.4s, v1.4s
        srshl    v0.4s, v0.4s, v1.4s
        ret

The above snippet can be further optimized to the following on LoongArch64 with LSX:

src1:                                   # @src1
        vsrlr.w  $vr0, $vr0, $vr1
        ret
tgt1:                                   # @tgt1
        vsrlr.w  $vr0, $vr0, $vr1
        ret
src2:                                   # @src2
        vsrar.w  $vr0, $vr0, $vr1
        ret
tgt2:                                   # @tgt2
        vsrar.w  $vr0, $vr0, $vr1
        ret

The above snippet can be further optimized to the following on 64-bit RISC-V with the "V" extension:

src1:                                   # @src1
        csrwi   vxrm, 0
        vsetivli        zero, 4, e32, m1, ta, ma
        vssrl.vv        v8, v8, v9
        ret
tgt1:                                   # @tgt1
        csrwi   vxrm, 0
        vsetivli        zero, 4, e32, m1, ta, ma
        vssrl.vv        v8, v8, v9
        ret
src2:                                   # @src2
        csrwi   vxrm, 0
        vsetivli        zero, 4, e32, m1, ta, ma
        vssra.vv        v8, v8, v9
        ret
tgt2:                                   # @tgt2
        csrwi   vxrm, 0
        vsetivli        zero, 4, e32, m1, ta, ma
        vssra.vv        v8, v8, v9
        ret

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions