intel
diff --git a/‎sycl/doc/CompilerAndRuntimeDesign.md‎
Lines changed: 101 additions & 9 deletions b/‎sycl/doc/CompilerAndRuntimeDesign.md‎
Lines changed: 101 additions & 9 deletions
@@ -510,14 +510,18 @@ down to the NVPTX Back End. All produced bitcode depends on two libraries,
 `libdevice.bc` (provided by the CUDA SDK) and `libspirv-nvptx64--nvidiacl.bc`
 (built by the libclc project).
 
-During the device linking step (device linker box in the
-[Separate Compilation and Linking](#separate-compilation-and-linking)
-illustration), llvm bitcode objects for the CUDA target are linked together
-alongside `libspirv-nvptx64--nvidiacl.bc` and `libdevice.bc`, compiled to PTX
-using the NVPTX backend, and assembled into a cubin using the `ptxas` tool (part
-of the CUDA SDK). The PTX file and cubin are assembled together using
-`fatbinary` to produce a CUDA fatbin. The CUDA fatbin is then passed to the
-offload wrapper tool.
+##### Device code post-link step
+
+During the "PTX target processing" in the device linking step [Device
+code post-link step](#device-code-post-link-step), the llvm bitcode
+objects for the CUDA target are linked together alongside
+`libspirv-nvptx64--nvidiacl.bc` and `libdevice.bc`, compiled to PTX
+using the NVPTX backend and assembled into a cubin using the `ptxas`
+tool (part of the CUDA SDK). The PTX file and cubin are assembled
+together using `fatbinary` to produce a CUDA fatbin. The CUDA fatbin
+is then passed to the offload wrapper tool.
+
+![NVPTX AOT build](images/DevicePTXProcessing.svg)
 
 ##### Checking if the compiler is targeting NVPTX
 
@@ -592,9 +596,97 @@ define void @SYCL_generated_kernel(i32 %local_ptr_offset, i32 %arg, i32 %local_p
 
 On the runtime side, when setting local memory arguments, the CUDA PI
 implementation will internally set the argument as the offset with respect to
-the accumulated size of used local memory. This approach preserves the exisiting
+the accumulated size of used local memory. This approach preserves the existing
 PI interface.
 
+##### Global offset support
+
+The CUDA API does not natively support the global offset parameter
+expected by the SYCL.
+
+In order to emulate this and make generated kernel compliant, an
+intrinsic `llvm.nvvm.implicit.offset` (clang builtin
+`__builtin_ptx_implicit_offset`) was introduced materializing the use
+of this implicit parameter for the NVPTX backend. The intrinsic returns
+a pointer to `i32` referring to a 3 elements array.
+
+Each non-kernel function reaching the implicit offset intrinsic in the
+call graph is augmented with an extra implicit parameter of type
+pointer to `i32`. Kernels calling one of these functions using
+this intrinsic are cloned:
+
+- the original kernel initializes an array of 3 `i32` to 0 and passes
+  the pointer to this array to each function with the implicit
+  parameter;
+- the cloned function type is augmented with an implicit parameter of
+  type array of 3 `i32`. The pointer to this array is then passed each
+  function with the implicit parameter.
+
+The runtime will query both kernels and call the appropriate one based
+on the following logic:
+
+- If the 2 versions exist, the original kernel is called if global
+  offset is 0 otherwise it will call the cloned one and pass the
+  offset by value;
+- If only 1 function exist, it is assumed that the kernel makes no use
+  of this parameter and therefore ignores it.
+
+As an example, the following code:
+```
+declare i32* @llvm.nvvm.implicit.offset()
+
+define weak_odr dso_local i64 @other_function() {
+  %1 = tail call i32* @llvm.nvvm.implicit.offset()
+  %2 = getelementptr inbounds i32, i32* %1, i64 2
+  %3 = load i32, i32* %2, align 4
+  %4 = zext i32 %3 to i64
+  ret i64 %4
+}
+
+define weak_odr dso_local void @other_function2() {
+  ret
+}
+
+define weak_odr dso_local void @example_kernel() {
+entry:
+  %0 = call i64 @other_function()
+  call void @other_function2()
+  ret void
+}
+```
+
+Is transformed into this in the `sycldevice` environment:
+```
+define weak_odr dso_local i64 @other_function(i32* %0) {
+  %2 = getelementptr inbounds i32, i32* %0, i64 2
+  %3 = load i32, i32* %2, align 4
+  %4 = zext i32 %3 to i64
+
+  ret i64 %4
+}
+
+define weak_odr dso_local void @example_kernel() {
+entry:
+  %0 = alloca [3 x i32], align 4
+  %1 = bitcast [3 x i32]* %0 to i8*
+  call void @llvm.memset.p0i8.i64(i8* nonnull align 4 dereferenceable(12) %1, i8 0, i64 12, i1 false)
+  %2 = getelementptr inbounds [3 x i32], [3 x i32]* %0, i32 0, i32 0
+  %3 = call i64 @other_function(i32* %2)
+  call void @other_function2()
+  ret void
+}
+
+define weak_odr dso_local void @example_kernel_with_offset([3 x i32]* byval([3 x i32]) %0) {
+entry:
+  %1 = bitcast [3 x i32]* %0 to i32*
+  %2 = call i64 @other_function(i32* %1)
+  call void @other_function2()
+  ret void
+}
+```
+
+Note: Kernel naming is not fully stable for now.
+
 ### Integration with SPIR-V format
 
 This section explains how to generate SPIR-V specific types and operations from