Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 101 additions & 9 deletions sycl/doc/CompilerAndRuntimeDesign.md
Original file line number Diff line number Diff line change
Expand Up @@ -510,14 +510,18 @@ down to the NVPTX Back End. All produced bitcode depends on two libraries,
`libdevice.bc` (provided by the CUDA SDK) and `libspirv-nvptx64--nvidiacl.bc`
(built by the libclc project).

During the device linking step (device linker box in the
[Separate Compilation and Linking](#separate-compilation-and-linking)
illustration), llvm bitcode objects for the CUDA target are linked together
alongside `libspirv-nvptx64--nvidiacl.bc` and `libdevice.bc`, compiled to PTX
using the NVPTX backend, and assembled into a cubin using the `ptxas` tool (part
of the CUDA SDK). The PTX file and cubin are assembled together using
`fatbinary` to produce a CUDA fatbin. The CUDA fatbin is then passed to the
offload wrapper tool.
##### Device code post-link step

During the "PTX target processing" in the device linking step [Device
code post-link step](#device-code-post-link-step), the llvm bitcode
objects for the CUDA target are linked together alongside
`libspirv-nvptx64--nvidiacl.bc` and `libdevice.bc`, compiled to PTX
using the NVPTX backend and assembled into a cubin using the `ptxas`
tool (part of the CUDA SDK). The PTX file and cubin are assembled
together using `fatbinary` to produce a CUDA fatbin. The CUDA fatbin
is then passed to the offload wrapper tool.

![NVPTX AOT build](images/DevicePTXProcessing.svg)

##### Checking if the compiler is targeting NVPTX

Expand Down Expand Up @@ -592,9 +596,97 @@ define void @SYCL_generated_kernel(i32 %local_ptr_offset, i32 %arg, i32 %local_p

On the runtime side, when setting local memory arguments, the CUDA PI
implementation will internally set the argument as the offset with respect to
the accumulated size of used local memory. This approach preserves the exisiting
the accumulated size of used local memory. This approach preserves the existing
PI interface.

##### Global offset support

The CUDA API does not natively support the global offset parameter
expected by the SYCL.

In order to emulate this and make generated kernel compliant, an
intrinsic `llvm.nvvm.implicit.offset` (clang builtin
`__builtin_ptx_implicit_offset`) was introduced materializing the use
of this implicit parameter for the NVPTX backend. The intrinsic returns
a pointer to `i32` referring to a 3 elements array.

Each non-kernel function reaching the implicit offset intrinsic in the
call graph is augmented with an extra implicit parameter of type
pointer to `i32`. Kernels calling one of these functions using
this intrinsic are cloned:

- the original kernel initializes an array of 3 `i32` to 0 and passes
the pointer to this array to each function with the implicit
parameter;
- the cloned function type is augmented with an implicit parameter of
type array of 3 `i32`. The pointer to this array is then passed each
function with the implicit parameter.

The runtime will query both kernels and call the appropriate one based
on the following logic:

- If the 2 versions exist, the original kernel is called if global
offset is 0 otherwise it will call the cloned one and pass the
offset by value;
- If only 1 function exist, it is assumed that the kernel makes no use
of this parameter and therefore ignores it.

As an example, the following code:
```
declare i32* @llvm.nvvm.implicit.offset()

define weak_odr dso_local i64 @other_function() {
%1 = tail call i32* @llvm.nvvm.implicit.offset()
%2 = getelementptr inbounds i32, i32* %1, i64 2
%3 = load i32, i32* %2, align 4
%4 = zext i32 %3 to i64
ret i64 %4
}

define weak_odr dso_local void @other_function2() {
ret
}

define weak_odr dso_local void @example_kernel() {
entry:
%0 = call i64 @other_function()
call void @other_function2()
ret void
}
```

Is transformed into this in the `sycldevice` environment:
```
define weak_odr dso_local i64 @other_function(i32* %0) {
%2 = getelementptr inbounds i32, i32* %0, i64 2
%3 = load i32, i32* %2, align 4
%4 = zext i32 %3 to i64

ret i64 %4
}

define weak_odr dso_local void @example_kernel() {
entry:
%0 = alloca [3 x i32], align 4
%1 = bitcast [3 x i32]* %0 to i8*
call void @llvm.memset.p0i8.i64(i8* nonnull align 4 dereferenceable(12) %1, i8 0, i64 12, i1 false)
%2 = getelementptr inbounds [3 x i32], [3 x i32]* %0, i32 0, i32 0
%3 = call i64 @other_function(i32* %2)
call void @other_function2()
ret void
}

define weak_odr dso_local void @example_kernel_with_offset([3 x i32]* byval([3 x i32]) %0) {
entry:
%1 = bitcast [3 x i32]* %0 to i32*
%2 = call i64 @other_function(i32* %1)
call void @other_function2()
ret void
}
```

Note: Kernel naming is not fully stable for now.

### Integration with SPIR-V format

This section explains how to generate SPIR-V specific types and operations from
Expand Down
Loading