You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SYCL][NVPTX][AMDGCN] Move devicelib cmath to header (#18706)
### Overview
Currently to support C++ builtins in SYCL kernels, we rely on
`libdevice` which
provides implementations for standard library builtins. This library is
built
either to bitcode or SPIR-V and linked in our kernels.
On some targets this causes issues because clang sometimes turns
standard
library calls into LLVM intrinsics that not all targets support.
Specifically on
NVPTX and AMDGCN we can't easily support these intrinsics because we
currently
use implementations provided by CUDA and HIP in the form of a bitcode
library,
which is not something we can use from the LLVM backend.
In upstream LLVM for CUDA and HIP kernels, the way this is handled is
that they
have clang headers providing device-side overloads of C++ library
functions that
hook into the target specific versions of the builtins (for example
`std::sin`
to `__nv_sin`). This way on the device side C++ builtins are hijacked
before
clang can turn them to intrinsics which solves the issue mentioned
above.
This patch is adding the infrastructure to support handling C++ builtins
in SYCL
in the same way as it is done for CUDA and HIP in upstream LLVM. And is
using it
to support `cmath` in NVPTX and AMDGCN compilation.
## Breakdown
* Add `sycl_device_only` attribute: This new attribute allows functions
marked
with it to be treated as device-side overload of existing functions.
This is
what allows us to overload C++ library functions for device in SYCL.
* Remove clang hack to prevent generating LLVM intrinsics from standard
library
builtins for NVPTX and AMDGCN. In theory since this is only moving
`cmath`, the
hack could still be needed, but it looks fine in testing and if we run
into
issues we should just move the problematic builtins to this solution.
The test
`sycl-libdevice-cmath.cpp` was testing this hack, so it was removed.
* `cmath` support for NVPTX and AMDGCN in `libdevice` was disabled. To
limit the
scope of the patch `libdevice` is still fully wired up for these
targets, but it
just won't provide the `cmath` functions.
* Added a `cmath-fallback.h` header providing the device-side math
function
overloads. They are defined using SPIR-V builtins, so in theory this
header
could be used as-is for other targets.
* Use our existing `cmath` stl wrapper to include `cmath-fallback.h` for
NVPTX
and AMDGCN. In upstream LLVM `clang-cuda` always includes with
`-include` the
header with these overloads, using the stl wrappers is a bit more
selective.
* Add `rint` to device lib tests and stl wrapper, this was added in
#18857 but wasn't in E2E testing.
## Compile-time performance
A quick check of compile-time shows that this seems to provide a small
performance improvement. Using two samples, one using cmath (the E2E
`cmath_test.cpp`), and a sample not using cmath, over 10 iterations, I'm
getting the following results:
| Run | Mean | Stdev |
|:--:|:--:|:--:|
|With patch, cmath sample | 4.2229s | 0.0294s |
|With patch, no cmath sample | 5.7484s | 0.0525s |
|Without patch, cmath sample | 4.3817s | 0.0424s |
|Without patch, no cmath sample | 5.7941s | 0.0452s |
Which suggest that the no cmath compile time performance is pretty much
equivalent, and the cmath compile-time performance is faster by roughly
~0.12s.
And this is with the whole `libdevice` setup still in place, so it's
possible this approach could be even more beneficial with more work.
## Future work
* Investigate commented out standard math builtins in
`cmath-fallback.h`, these
weren't defined in libdevice, we should either remove the commented out
lines or
implement them properly.
* Untangle `cmath` and `math.h`, the current `cmath-fallback.h`
implements both
which seems to work fine, but ideally we should split it up.
* Deal with `nearbyint`, this was only implemented for NVPTX and AMDGCN
in
`libdevice`, this patch keeps it the same, but we should look into
proper
support and testing for this.
* Move more of `libdevice` into headers (complex, assert, crt, etc ...).
* Try this approach for SPIR-V or other targets.
---------
Co-authored-by: Alexey Bader <[email protected]>
0 commit comments