Skip to content

Conversation

@frasercrmck
Copy link
Contributor

@frasercrmck frasercrmck commented Dec 11, 2024

These functions all map to the corresponding LLVM intrinsics, but the vector intrinsics weren't being generated. The intrinsic mapping from CLC vector function to vector intrinsic was working correctly, but the mapping from OpenCL builtin to CLC function was suboptimally recursively splitting vectors in halves.

For example, with this change, ceil(float16) calls llvm.ceil.v16f32 directly once optimizations are applied.

Now also, instead of generating LLVM intrinsics through __asm we now call clang elementwise builtins for each CLC builtin. This should be a more standard way of achieving the same result

The CLC versions of each of these builtins are also now built and enabled for SPIR-V targets. The LLVM -> SPIR-V translator maps the intrinsics to the appropriate OpExtInst, so there should be no difference in semantics, despite the newly introduced indirection from OpenCL builtin through the CLC builtin to the intrinsic.

The AMDGPU targets make use of the same _CLC_DEFINE_UNARY_BUILTIN macro to override sqrt, so those functions also appear more optimal with this change, calling the vector llvm.sqrt.vXf32 intrinsics directly.

These functions all map to the corresponding LLVM intrinsics, but the
vector intrinsics weren't being generated. The intrinsic mapping from
CLC vector function to vector intrinsic was working correctly, but the
mapping from OpenCL builtin to CLC function was suboptimally recursively
splitting vectors in halves.

For example, with this change, `ceil(float16)` calls `llvm.ceil.v16f32`
directly.

The CLC versions of each of these builtins are also now enabled for
SPIR-V targets. The LLVM -> SPIR-V translator maps the intrinsics to the
appropriate OpExtInst. As such, there is no diff to the SPIR-V binaries
before/after this change.

The clspv targets show a difference, but it's not expected to be a
problem:

    >   %call = tail call spir_func double @llvm.fabs.f64(double noundef %x) llvm#9
    <   %call = tail call spir_func double @_Z4fabsd(double noundef %x) llvm#9

The AMDGPU targets make use of the same _CLC_DEFINE_UNARY_BUILTIN macro
to override sqrt, so those functions also appear more optimal with this
change, calling the vector `llvm.sqrt.vXf32` intrinsics directly.
@frasercrmck frasercrmck added the libclc libclc OpenCL library label Dec 11, 2024
@frasercrmck frasercrmck requested a review from arsenm December 11, 2024 17:40
@frasercrmck
Copy link
Contributor Author

CC @rjodinchr, @karolherbst

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'm not sure how this all ends up expanding, I was expecting to see the elementwise builtins used.

It would be great if we had update_cc_test_checks style testing for the resulting implementation

@frasercrmck
Copy link
Contributor Author

LGTM. I'm not sure how this all ends up expanding, I was expecting to see the elementwise builtins used.

Yes, I suspect that this code originates from before the builtins were available? The builtins would probably make more sense, tbh. The current method is that we have the OpenCL builtin call the corresponding CLC builtin, which in its header uses this strange __asm__ method of calling LLVM intrinsics directly. It should maybe just do: OpenCL builtin -> CLC builtin -> clang builtin?

It would be great if we had update_cc_test_checks style testing for the resulting implementation

Oh yes, I agree. My efforts to introduce testing stalled somewhat. Maybe we can pick up that discussion on #87989?

@arsenm
Copy link
Contributor

arsenm commented Dec 12, 2024

which in its header uses this strange asm method of calling LLVM intrinsics directly.

That's something that's always surprised me it works. It's rather unsafe (you can bypass immarg validation for instance). Plus asm callsites get infected with overly conservative attributes (like convergent, which you can't remove)

It should maybe just do: OpenCL builtin -> CLC builtin -> clang builtin?

That's the simplest way to go

@frasercrmck
Copy link
Contributor Author

which in its header uses this strange asm method of calling LLVM intrinsics directly.

That's something that's always surprised me it works. It's rather unsafe (you can bypass immarg validation for instance). Plus asm callsites get infected with overly conservative attributes (like convergent, which you can't remove)

Yeah, good point.

It should maybe just do: OpenCL builtin -> CLC builtin -> clang builtin?

That's the simplest way to go

I've updated the patch to do just that, using the builtins. I'll update the description accordingly.

@frasercrmck frasercrmck merged commit 06789cc into llvm:main Dec 13, 2024
8 checks passed
@frasercrmck frasercrmck deleted the libclc-optimize-intrinsics branch December 13, 2024 08:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

libclc libclc OpenCL library

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants