Skip to content

Commit 4f0afd5

Browse files
npmillerbader
andauthored
[SYCL][NVPTX][AMDGCN] Move devicelib cmath to header (#18706)
### Overview Currently to support C++ builtins in SYCL kernels, we rely on `libdevice` which provides implementations for standard library builtins. This library is built either to bitcode or SPIR-V and linked in our kernels. On some targets this causes issues because clang sometimes turns standard library calls into LLVM intrinsics that not all targets support. Specifically on NVPTX and AMDGCN we can't easily support these intrinsics because we currently use implementations provided by CUDA and HIP in the form of a bitcode library, which is not something we can use from the LLVM backend. In upstream LLVM for CUDA and HIP kernels, the way this is handled is that they have clang headers providing device-side overloads of C++ library functions that hook into the target specific versions of the builtins (for example `std::sin` to `__nv_sin`). This way on the device side C++ builtins are hijacked before clang can turn them to intrinsics which solves the issue mentioned above. This patch is adding the infrastructure to support handling C++ builtins in SYCL in the same way as it is done for CUDA and HIP in upstream LLVM. And is using it to support `cmath` in NVPTX and AMDGCN compilation. ## Breakdown * Add `sycl_device_only` attribute: This new attribute allows functions marked with it to be treated as device-side overload of existing functions. This is what allows us to overload C++ library functions for device in SYCL. * Remove clang hack to prevent generating LLVM intrinsics from standard library builtins for NVPTX and AMDGCN. In theory since this is only moving `cmath`, the hack could still be needed, but it looks fine in testing and if we run into issues we should just move the problematic builtins to this solution. The test `sycl-libdevice-cmath.cpp` was testing this hack, so it was removed. * `cmath` support for NVPTX and AMDGCN in `libdevice` was disabled. To limit the scope of the patch `libdevice` is still fully wired up for these targets, but it just won't provide the `cmath` functions. * Added a `cmath-fallback.h` header providing the device-side math function overloads. They are defined using SPIR-V builtins, so in theory this header could be used as-is for other targets. * Use our existing `cmath` stl wrapper to include `cmath-fallback.h` for NVPTX and AMDGCN. In upstream LLVM `clang-cuda` always includes with `-include` the header with these overloads, using the stl wrappers is a bit more selective. * Add `rint` to device lib tests and stl wrapper, this was added in #18857 but wasn't in E2E testing. ## Compile-time performance A quick check of compile-time shows that this seems to provide a small performance improvement. Using two samples, one using cmath (the E2E `cmath_test.cpp`), and a sample not using cmath, over 10 iterations, I'm getting the following results: | Run | Mean | Stdev | |:--:|:--:|:--:| |With patch, cmath sample | 4.2229s | 0.0294s | |With patch, no cmath sample | 5.7484s | 0.0525s | |Without patch, cmath sample | 4.3817s | 0.0424s | |Without patch, no cmath sample | 5.7941s | 0.0452s | Which suggest that the no cmath compile time performance is pretty much equivalent, and the cmath compile-time performance is faster by roughly ~0.12s. And this is with the whole `libdevice` setup still in place, so it's possible this approach could be even more beneficial with more work. ## Future work * Investigate commented out standard math builtins in `cmath-fallback.h`, these weren't defined in libdevice, we should either remove the commented out lines or implement them properly. * Untangle `cmath` and `math.h`, the current `cmath-fallback.h` implements both which seems to work fine, but ideally we should split it up. * Deal with `nearbyint`, this was only implemented for NVPTX and AMDGCN in `libdevice`, this patch keeps it the same, but we should look into proper support and testing for this. * Move more of `libdevice` into headers (complex, assert, crt, etc ...). * Try this approach for SPIR-V or other targets. --------- Co-authored-by: Alexey Bader <[email protected]>
1 parent ff5d4bd commit 4f0afd5

File tree

23 files changed

+1252
-213
lines changed

23 files changed

+1252
-213
lines changed

clang/include/clang/Basic/Attr.td

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1607,6 +1607,14 @@ def SYCLDevice : InheritableAttr {
16071607
let Documentation = [SYCLDeviceDocs];
16081608
}
16091609

1610+
def SYCLDeviceOnly : InheritableAttr {
1611+
let Spellings = [Clang<"sycl_device_only">];
1612+
let Subjects = SubjectList<[Function]>;
1613+
let LangOpts = [SYCLIsDevice, SilentlyIgnoreSYCLIsHost];
1614+
let Documentation = [SYCLDeviceOnlyDocs];
1615+
}
1616+
def : MutualExclusions<[SYCLDevice, SYCLDeviceOnly]>;
1617+
16101618
def SYCLGlobalVar : InheritableAttr {
16111619
let Spellings = [GNU<"sycl_global_var">];
16121620
let Subjects = SubjectList<[GlobalStorageNonLocalVar], ErrorDiag>;

clang/include/clang/Basic/AttrDocs.td

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4518,6 +4518,20 @@ implicitly inherit this attribute.
45184518
}];
45194519
}
45204520

4521+
def SYCLDeviceOnlyDocs : Documentation {
4522+
let Category = DocCatFunction;
4523+
let Heading = "sycl_device_only";
4524+
let Content = [{
4525+
This attribute can only be applied to functions and indicates that the function
4526+
is only available for the device. It allows functions marked with it to
4527+
overload existing functions without the attribute, in which case the overload
4528+
with the attribute will be used on the device side and the overload without
4529+
will be used on the host side. Note: as opposed to ``sycl_device`` this does
4530+
not mark the function as being exported, both attributes are incompatible and
4531+
can't be used together.
4532+
}];
4533+
}
4534+
45214535
def RISCVInterruptDocs : Documentation {
45224536
let Category = DocCatFunction;
45234537
let Heading = "interrupt (RISC-V)";

clang/lib/AST/Decl.cpp

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3729,6 +3729,13 @@ unsigned FunctionDecl::getBuiltinID(bool ConsiderWrapperFunctions) const {
37293729
!(BuiltinID == Builtin::BIprintf || BuiltinID == Builtin::BImalloc))
37303730
return 0;
37313731

3732+
// SYCL doesn't have a device-side standard library. SYCLDeviceOnlyAttr may
3733+
// be used to provide device-side definitions of standard functions, so
3734+
// anything with that attribute shouldn't be treated as a builtin.
3735+
if (Context.getLangOpts().isSYCL() && hasAttr<SYCLDeviceOnlyAttr>()) {
3736+
return 0;
3737+
}
3738+
37323739
// As AMDGCN implementation of OpenMP does not have a device-side standard
37333740
// library, none of the predefined library functions except printf and malloc
37343741
// should be treated as a builtin i.e. 0 should be returned for them.

clang/lib/CodeGen/CGBuiltin.cpp

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2782,10 +2782,7 @@ RValue CodeGenFunction::EmitBuiltinExpr(const GlobalDecl GD, unsigned BuiltinID,
27822782
GenerateIntrinsics =
27832783
ConstWithoutErrnoOrExceptions && ErrnoOverridenToFalseWithOpt;
27842784
}
2785-
bool IsSYCLDeviceWithoutIntrinsics =
2786-
getLangOpts().SYCLIsDevice &&
2787-
(getTarget().getTriple().isNVPTX() || getTarget().getTriple().isAMDGCN());
2788-
if (GenerateIntrinsics && !IsSYCLDeviceWithoutIntrinsics) {
2785+
if (GenerateIntrinsics) {
27892786
switch (BuiltinIDIfNoAsmLabel) {
27902787
case Builtin::BIacos:
27912788
case Builtin::BIacosf:
@@ -3885,7 +3882,7 @@ RValue CodeGenFunction::EmitBuiltinExpr(const GlobalDecl GD, unsigned BuiltinID,
38853882
case Builtin::BI__builtin_modf:
38863883
case Builtin::BI__builtin_modff:
38873884
case Builtin::BI__builtin_modfl:
3888-
if (Builder.getIsFPConstrained() || IsSYCLDeviceWithoutIntrinsics)
3885+
if (Builder.getIsFPConstrained())
38893886
break; // TODO: Emit constrained modf intrinsic once one exists.
38903887
return RValue::get(emitModfBuiltin(*this, E, Intrinsic::modf));
38913888
case Builtin::BI__builtin_isgreater:

clang/lib/CodeGen/CodeGenModule.cpp

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4357,6 +4357,12 @@ void CodeGenModule::EmitGlobal(GlobalDecl GD) {
43574357
}
43584358
}
43594359

4360+
// Don't emit 'sycl_device_only' function in SYCL host compilation.
4361+
if (LangOpts.SYCLIsHost && isa<FunctionDecl>(Global) &&
4362+
Global->hasAttr<SYCLDeviceOnlyAttr>()) {
4363+
return;
4364+
}
4365+
43604366
if (LangOpts.OpenMP) {
43614367
// If this is OpenMP, check if it is legal to emit this global normally.
43624368
if (OpenMPRuntime && OpenMPRuntime->emitTargetGlobal(GD))
@@ -4446,6 +4452,34 @@ void CodeGenModule::EmitGlobal(GlobalDecl GD) {
44464452
}
44474453
}
44484454

4455+
// When using SYCLDeviceOnlyAttr, there can be two functions with the same
4456+
// mangling, the host function and the device overload. So when compiling for
4457+
// device we need to make sure we're selecting the SYCLDeviceOnlyAttr
4458+
// overload and dropping the host overload.
4459+
if (LangOpts.SYCLIsDevice) {
4460+
StringRef MangledName = getMangledName(GD);
4461+
auto DDI = DeferredDecls.find(MangledName);
4462+
// If we have an existing declaration with the same mangling for this
4463+
// symbol it may be a SYCLDeviceOnlyAttr case.
4464+
if (DDI != DeferredDecls.end()) {
4465+
auto *PreviousGlobal = cast<ValueDecl>(DDI->second.getDecl());
4466+
// If the host declaration was already processed, replace it with the
4467+
// device only declaration.
4468+
if (!PreviousGlobal->hasAttr<SYCLDeviceOnlyAttr>() &&
4469+
Global->hasAttr<SYCLDeviceOnlyAttr>()) {
4470+
DeferredDecls[MangledName] = GD;
4471+
return;
4472+
}
4473+
4474+
// If the device only declaration was already processed, skip the
4475+
// host declaration.
4476+
if (PreviousGlobal->hasAttr<SYCLDeviceOnlyAttr>() &&
4477+
!Global->hasAttr<SYCLDeviceOnlyAttr>()) {
4478+
return;
4479+
}
4480+
}
4481+
}
4482+
44494483
// clang::ParseAST ensures that we emit the SYCL devices at the end, so
44504484
// anything that is a device (or indirectly called) will be handled later.
44514485
if (LangOpts.SYCLIsDevice && MustBeEmitted(Global)) {

clang/lib/Sema/SemaDecl.cpp

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1486,6 +1486,17 @@ void Sema::ActOnExitFunctionContext() {
14861486
static bool AllowOverloadingOfFunction(const LookupResult &Previous,
14871487
ASTContext &Context,
14881488
const FunctionDecl *New) {
1489+
// SYCLDeviceOnlyAttr allows device side overloads of SYCL function, but it
1490+
// is incompatible with SYCLDeviceAttr, so don't allow overloads when both
1491+
// attributes are present.
1492+
if (Context.getLangOpts().isSYCL() &&
1493+
Previous.getResultKind() == LookupResultKind::Found &&
1494+
((New->hasAttr<SYCLDeviceOnlyAttr>() &&
1495+
Previous.getFoundDecl()->hasAttr<SYCLDeviceAttr>()) ||
1496+
(New->hasAttr<SYCLDeviceAttr>() &&
1497+
Previous.getFoundDecl()->hasAttr<SYCLDeviceOnlyAttr>())))
1498+
return false;
1499+
14891500
if (Context.getLangOpts().CPlusPlus || New->hasAttr<OverloadableAttr>())
14901501
return true;
14911502

@@ -3702,6 +3713,11 @@ bool Sema::MergeFunctionDecl(FunctionDecl *New, NamedDecl *&OldD, Scope *S,
37023713
return true;
37033714
}
37043715

3716+
// Never merge SYCLDeviceOnlyAttr functions in their host variant
3717+
if (getLangOpts().isSYCL() &&
3718+
Old->hasAttr<SYCLDeviceOnlyAttr>() != New->hasAttr<SYCLDeviceOnlyAttr>())
3719+
return false;
3720+
37053721
diag::kind PrevDiag;
37063722
SourceLocation OldLocation;
37073723
std::tie(PrevDiag, OldLocation) =
@@ -7354,6 +7370,10 @@ static bool isIncompleteDeclExternC(Sema &S, const T *D) {
73547370
if (S.getLangOpts().CUDA && (D->template hasAttr<CUDADeviceAttr>() ||
73557371
D->template hasAttr<CUDAHostAttr>()))
73567372
return false;
7373+
7374+
// So does SYCL's device_only attribute.
7375+
if (S.getLangOpts().isSYCL() && D->template hasAttr<SYCLDeviceOnlyAttr>())
7376+
return false;
73577377
}
73587378
return D->isExternC();
73597379
}

clang/lib/Sema/SemaDeclAttr.cpp

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7224,6 +7224,9 @@ ProcessDeclAttribute(Sema &S, Scope *scope, Decl *D, const ParsedAttr &AL,
72247224
case ParsedAttr::AT_SYCLDevice:
72257225
S.SYCL().handleSYCLDeviceAttr(D, AL);
72267226
break;
7227+
case ParsedAttr::AT_SYCLDeviceOnly:
7228+
handleSimpleAttribute<SYCLDeviceOnlyAttr>(S, D, AL);
7229+
break;
72277230
case ParsedAttr::AT_SYCLScope:
72287231
S.SYCL().handleSYCLScopeAttr(D, AL);
72297232
break;

clang/lib/Sema/SemaOverload.cpp

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1629,6 +1629,23 @@ static bool IsOverloadOrOverrideImpl(Sema &SemaRef, FunctionDecl *New,
16291629
}
16301630
}
16311631

1632+
// Allow overloads with SYCLDeviceOnlyAttr
1633+
if (SemaRef.getLangOpts().isSYCL() && (Old->hasAttr<SYCLDeviceOnlyAttr>() !=
1634+
New->hasAttr<SYCLDeviceOnlyAttr>())) {
1635+
// SYCLDeviceOnlyAttr and SYCLDeviceAttr functions can't overload
1636+
if (((New->hasAttr<SYCLDeviceOnlyAttr>() &&
1637+
Old->hasAttr<SYCLDeviceAttr>()) ||
1638+
(New->hasAttr<SYCLDeviceAttr>() &&
1639+
Old->hasAttr<SYCLDeviceOnlyAttr>()))) {
1640+
SemaRef.Diag(New->getLocation(), diag::err_redefinition)
1641+
<< New->getDeclName();
1642+
SemaRef.notePreviousDefinition(Old, New->getLocation());
1643+
return false;
1644+
}
1645+
1646+
return true;
1647+
}
1648+
16321649
// The signatures match; this is not an overload.
16331650
return false;
16341651
}
@@ -11020,6 +11037,15 @@ bool clang::isBetterOverloadCandidate(
1102011037
S.CUDA().IdentifyPreference(Caller, Cand2.Function);
1102111038
}
1102211039

11040+
// In SYCL device compilation mode prefer the overload with the
11041+
// SYCLDeviceOnly attribute.
11042+
if (S.getLangOpts().SYCLIsDevice && Cand1.Function && Cand2.Function) {
11043+
if (Cand1.Function->hasAttr<SYCLDeviceOnlyAttr>() !=
11044+
Cand2.Function->hasAttr<SYCLDeviceOnlyAttr>()) {
11045+
return Cand1.Function->hasAttr<SYCLDeviceOnlyAttr>();
11046+
}
11047+
}
11048+
1102311049
// General member function overloading is handled above, so this only handles
1102411050
// constructors with address spaces.
1102511051
// This only handles address spaces since C++ has no other
@@ -11374,6 +11400,15 @@ OverloadingResult OverloadCandidateSet::BestViableFunctionImpl(
1137411400
if (S.getLangOpts().CUDA)
1137511401
CudaExcludeWrongSideCandidates(S, Candidates);
1137611402

11403+
// In SYCL host compilation remove candidates marked SYCLDeviceOnly.
11404+
if (S.getLangOpts().SYCLIsHost) {
11405+
auto IsDeviceCand = [&](const OverloadCandidate *Cand) {
11406+
return Cand->Viable && Cand->Function &&
11407+
Cand->Function->hasAttr<SYCLDeviceOnlyAttr>();
11408+
};
11409+
llvm::erase_if(Candidates, IsDeviceCand);
11410+
}
11411+
1137711412
Best = end();
1137811413
for (auto *Cand : Candidates) {
1137911414
Cand->Best = false;
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
// RUN: %clang_cc1 -fsycl-is-device -triple spir64-unknown-unknown -disable-llvm-passes -emit-llvm %s -o - | FileCheck %s --check-prefix=CHECKD
2+
// RUN: %clang_cc1 -fsycl-is-host -triple spir64-unknown-unknown -disable-llvm-passes -emit-llvm %s -o - | FileCheck %s --check-prefix=CHECKH
3+
// Test code generation for sycl_device_only attribute.
4+
5+
// Verify that the device overload is used on device.
6+
//
7+
// CHECK-LABEL: _Z3fooi
8+
// CHECKH: %add = add nsw i32 %0, 10
9+
// CHECKD: %add = add nsw i32 %0, 20
10+
int foo(int a) { return a + 10; }
11+
__attribute__((sycl_device_only)) int foo(int a) { return a + 20; }
12+
13+
// Use a `sycl_device` function as entry point
14+
__attribute__((sycl_device)) int bar(int b) { return foo(b); }
15+
16+
// Verify that the order of declaration doesn't change the behavior.
17+
//
18+
// CHECK-LABEL: _Z3fooswapi
19+
// CHECKH: %add = add nsw i32 %0, 10
20+
// CHECKD: %add = add nsw i32 %0, 20
21+
__attribute__((sycl_device_only)) int fooswap(int a) { return a + 20; }
22+
int fooswap(int a) { return a + 10; }
23+
24+
// Use a `sycl_device` function as entry point.
25+
__attribute__((sycl_device)) int barswap(int b) { return fooswap(b); }
26+
27+
// Verify that in extern C the attribute enables mangling.
28+
extern "C" {
29+
// CHECK-LABEL: _Z3fooci
30+
// CHECKH: %add = add nsw i32 %0, 10
31+
// CHECKD: %add = add nsw i32 %0, 20
32+
int fooc(int a) { return a + 10; }
33+
__attribute__((sycl_device_only)) int fooc(int a) { return a + 20; }
34+
35+
// Use a `sycl_device` function as entry point.
36+
__attribute__((sycl_device)) int barc(int b) { return fooc(b); }
37+
}

0 commit comments

Comments
 (0)