-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[libclc] Optimize ceil/fabs/floor/rint/trunc #119596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[libclc] Optimize ceil/fabs/floor/rint/trunc #119596
Conversation
These functions all map to the corresponding LLVM intrinsics, but the
vector intrinsics weren't being generated. The intrinsic mapping from
CLC vector function to vector intrinsic was working correctly, but the
mapping from OpenCL builtin to CLC function was suboptimally recursively
splitting vectors in halves.
For example, with this change, `ceil(float16)` calls `llvm.ceil.v16f32`
directly.
The CLC versions of each of these builtins are also now enabled for
SPIR-V targets. The LLVM -> SPIR-V translator maps the intrinsics to the
appropriate OpExtInst. As such, there is no diff to the SPIR-V binaries
before/after this change.
The clspv targets show a difference, but it's not expected to be a
problem:
> %call = tail call spir_func double @llvm.fabs.f64(double noundef %x) llvm#9
< %call = tail call spir_func double @_Z4fabsd(double noundef %x) llvm#9
The AMDGPU targets make use of the same _CLC_DEFINE_UNARY_BUILTIN macro
to override sqrt, so those functions also appear more optimal with this
change, calling the vector `llvm.sqrt.vXf32` intrinsics directly.
arsenm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'm not sure how this all ends up expanding, I was expecting to see the elementwise builtins used.
It would be great if we had update_cc_test_checks style testing for the resulting implementation
Yes, I suspect that this code originates from before the builtins were available? The builtins would probably make more sense, tbh. The current method is that we have the OpenCL builtin call the corresponding CLC builtin, which in its header uses this strange
Oh yes, I agree. My efforts to introduce testing stalled somewhat. Maybe we can pick up that discussion on #87989? |
That's something that's always surprised me it works. It's rather unsafe (you can bypass immarg validation for instance). Plus asm callsites get infected with overly conservative attributes (like convergent, which you can't remove)
That's the simplest way to go |
Yeah, good point.
I've updated the patch to do just that, using the builtins. I'll update the description accordingly. |
These functions all map to the corresponding LLVM intrinsics, but the vector intrinsics weren't being generated. The intrinsic mapping from CLC vector function to vector intrinsic was working correctly, but the mapping from OpenCL builtin to CLC function was suboptimally recursively splitting vectors in halves.
For example, with this change,
ceil(float16)callsllvm.ceil.v16f32directly once optimizations are applied.Now also, instead of generating LLVM intrinsics through
__asmwe now call clang elementwise builtins for each CLC builtin. This should be a more standard way of achieving the same resultThe CLC versions of each of these builtins are also now built and enabled for SPIR-V targets. The LLVM -> SPIR-V translator maps the intrinsics to the appropriate OpExtInst, so there should be no difference in semantics, despite the newly introduced indirection from OpenCL builtin through the CLC builtin to the intrinsic.
The AMDGPU targets make use of the same
_CLC_DEFINE_UNARY_BUILTINmacro to overridesqrt, so those functions also appear more optimal with this change, calling the vectorllvm.sqrt.vXf32intrinsics directly.