@@ -510,14 +510,18 @@ down to the NVPTX Back End. All produced bitcode depends on two libraries,
510510` libdevice.bc ` (provided by the CUDA SDK) and ` libspirv-nvptx64--nvidiacl.bc `
511511(built by the libclc project).
512512
513- During the device linking step (device linker box in the
514- [ Separate Compilation and Linking] ( #separate-compilation-and-linking )
515- illustration), llvm bitcode objects for the CUDA target are linked together
516- alongside ` libspirv-nvptx64--nvidiacl.bc ` and ` libdevice.bc ` , compiled to PTX
517- using the NVPTX backend, and assembled into a cubin using the ` ptxas ` tool (part
518- of the CUDA SDK). The PTX file and cubin are assembled together using
519- ` fatbinary ` to produce a CUDA fatbin. The CUDA fatbin is then passed to the
520- offload wrapper tool.
513+ ##### Device code post-link step
514+
515+ During the "PTX target processing" in the device linking step [ Device
516+ code post-link step] ( #device-code-post-link-step ) , the llvm bitcode
517+ objects for the CUDA target are linked together alongside
518+ ` libspirv-nvptx64--nvidiacl.bc ` and ` libdevice.bc ` , compiled to PTX
519+ using the NVPTX backend and assembled into a cubin using the ` ptxas `
520+ tool (part of the CUDA SDK). The PTX file and cubin are assembled
521+ together using ` fatbinary ` to produce a CUDA fatbin. The CUDA fatbin
522+ is then passed to the offload wrapper tool.
523+
524+ ![ NVPTX AOT build] ( images/DevicePTXProcessing.svg )
521525
522526##### Checking if the compiler is targeting NVPTX
523527
@@ -592,9 +596,97 @@ define void @SYCL_generated_kernel(i32 %local_ptr_offset, i32 %arg, i32 %local_p
592596
593597On the runtime side, when setting local memory arguments, the CUDA PI
594598implementation will internally set the argument as the offset with respect to
595- the accumulated size of used local memory. This approach preserves the exisiting
599+ the accumulated size of used local memory. This approach preserves the existing
596600PI interface.
597601
602+ ##### Global offset support
603+
604+ The CUDA API does not natively support the global offset parameter
605+ expected by the SYCL.
606+
607+ In order to emulate this and make generated kernel compliant, an
608+ intrinsic `llvm.nvvm.implicit.offset` (clang builtin
609+ `__builtin_ptx_implicit_offset`) was introduced materializing the use
610+ of this implicit parameter for the NVPTX backend. The intrinsic returns
611+ a pointer to `i32` referring to a 3 elements array.
612+
613+ Each non-kernel function reaching the implicit offset intrinsic in the
614+ call graph is augmented with an extra implicit parameter of type
615+ pointer to `i32`. Kernels calling one of these functions using
616+ this intrinsic are cloned:
617+
618+ - the original kernel initializes an array of 3 `i32` to 0 and passes
619+ the pointer to this array to each function with the implicit
620+ parameter;
621+ - the cloned function type is augmented with an implicit parameter of
622+ type array of 3 `i32`. The pointer to this array is then passed each
623+ function with the implicit parameter.
624+
625+ The runtime will query both kernels and call the appropriate one based
626+ on the following logic:
627+
628+ - If the 2 versions exist, the original kernel is called if global
629+ offset is 0 otherwise it will call the cloned one and pass the
630+ offset by value;
631+ - If only 1 function exist, it is assumed that the kernel makes no use
632+ of this parameter and therefore ignores it.
633+
634+ As an example, the following code:
635+ ```
636+ declare i32* @llvm .nvvm.implicit.offset()
637+
638+ define weak_odr dso_local i64 @other_function() {
639+ %1 = tail call i32* @llvm .nvvm.implicit.offset()
640+ %2 = getelementptr inbounds i32, i32* %1, i64 2
641+ %3 = load i32, i32* %2, align 4
642+ %4 = zext i32 %3 to i64
643+ ret i64 %4
644+ }
645+
646+ define weak_odr dso_local void @other_function2() {
647+ ret
648+ }
649+
650+ define weak_odr dso_local void @example_kernel() {
651+ entry:
652+ %0 = call i64 @other_function()
653+ call void @other_function2()
654+ ret void
655+ }
656+ ```
657+
658+ Is transformed into this in the `sycldevice` environment:
659+ ```
660+ define weak_odr dso_local i64 @other_function(i32* %0) {
661+ %2 = getelementptr inbounds i32, i32* %0, i64 2
662+ %3 = load i32, i32* %2, align 4
663+ %4 = zext i32 %3 to i64
664+
665+ ret i64 %4
666+ }
667+
668+ define weak_odr dso_local void @example_kernel() {
669+ entry:
670+ %0 = alloca [ 3 x i32] , align 4
671+ %1 = bitcast [ 3 x i32] * %0 to i8*
672+ call void @llvm .memset.p0i8.i64(i8* nonnull align 4 dereferenceable(12) %1, i8 0, i64 12, i1 false)
673+ %2 = getelementptr inbounds [ 3 x i32] , [ 3 x i32] * %0, i32 0, i32 0
674+ %3 = call i64 @other_function(i32* %2)
675+ call void @other_function2()
676+ ret void
677+ }
678+
679+ define weak_odr dso_local void @example_kernel_with_offset([ 3 x i32] * byval([ 3 x i32] ) %0) {
680+ entry:
681+ %1 = bitcast [ 3 x i32] * %0 to i32*
682+ %2 = call i64 @other_function(i32* %1)
683+ call void @other_function2()
684+ ret void
685+ }
686+ ```
687+
688+ Note: Kernel naming is not fully stable for now.
689+
598690### Integration with SPIR-V format
599691
600692This section explains how to generate SPIR-V specific types and operations from
0 commit comments