|  | 
|  | 1 | +# Integrate Intel GPU to OpenXLA | 
|  | 2 | +| Status | Proposed | | 
|  | 3 | +| ------ | -------- | | 
|  | 4 | +| RFC#   | [99](https://github.com/openxla/community/pull/99) | | 
|  | 5 | +| Author(s)	|  Teng, Lu ([email protected] ), Yang, Sheng ([email protected] ), Zhoulong, Jiang ([email protected] ), Jianhui, Li ([email protected] ) | | 
|  | 6 | +| Updated | 2023-11-30 | | 
|  | 7 | + | 
|  | 8 | +## Objective | 
|  | 9 | +`XLA GPU` is a mechanism to extend new GPU to OpenXLA as in-tree build device. This RFC is to introduce the related changes to integrate Intel GPU to `XLA GPU`. | 
|  | 10 | +In more generic speaking, it will support `SPIRV target` based on `SYCL` runtime. | 
|  | 11 | + | 
|  | 12 | +## Motivation | 
|  | 13 | +Intel has experimental released [Intel® Extension for OpenXLA*](https://github.com/intel/intel-extension-for-openxla) based on `PJRT C API` to support runing applications on Intel GPU when use OpenXLA,  | 
|  | 14 | +but in-tree build is a better way to maximize the capabilities of OpenXLA and improve user experience, | 
|  | 15 | +Intel would like to upstream the related changes inside **Intel® Extension for OpenXLA*** to OpenXLA to make Intel GPU as in-tree build device. | 
|  | 16 | +Besides, it will extend OpenXLA to support new `SPIRV target` devices based on `SYCL` runtime. | 
|  | 17 | + | 
|  | 18 | +## User benifit | 
|  | 19 | +This RFC allows user to run their applications on Intel GPU directly when use OpenXLA, w/o installing any extra extensions or modifying any code. | 
|  | 20 | + | 
|  | 21 | +## Deisgn proposal | 
|  | 22 | +### Overview | 
|  | 23 | +Below marked components in OpenXLA will be modified to support Intel GPU: | 
|  | 24 | + | 
|  | 25 | + | 
|  | 26 | + | 
|  | 27 | +Here we would like to distinguish these components as 2 different priorities: | 
|  | 28 | +* **P1**: Related to basic functionality and is covered by this RFC | 
|  | 29 | +  - `LLVM IR`: Basic code generator for Intel GPU | 
|  | 30 | +  - `Lib Call`: Advanced `oneDNN` library call to replace `LLVM IR` for core ops, to improve performance for Intel GPU | 
|  | 31 | +  - `XLA GPU Runtime` (based on Stream Executor): Basic runtime for Intel GPU  | 
|  | 32 | +* **P2**: Related to performance or user experience and is not covered by this RFC. We will propose new RFCs to track these features | 
|  | 33 | +  - `Global Cost Model` | 
|  | 34 | +  - `Tool` (including Profiler, Debug API, etc.) | 
|  | 35 | + | 
|  | 36 | +In future, we will follow community to enable more **advanced code generator** rather than `LLVM IR` for Intel GPU. | 
|  | 37 | +   | 
|  | 38 | +### Code integration & binary release | 
|  | 39 | +** **Note: Below integration method based on macro is an intermediate states,  | 
|  | 40 | +all code will be merged with other in-tree devices once the software stack is stable.** ** | 
|  | 41 | + | 
|  | 42 | +We would like to introduce a new macro `INTEL_GPU` (Tentative) for code integration: | 
|  | 43 | +```c++ | 
|  | 44 | +#ifndef INTEL_GPU | 
|  | 45 | +// Original functions | 
|  | 46 | +#else | 
|  | 47 | +// Intel GPU functions | 
|  | 48 | +#endif | 
|  | 49 | +``` | 
|  | 50 | +And only enable it with `config=xpu` (Tentative) in OpenXLA, to differentiate Intel GPU from other devices. | 
|  | 51 | +In this way we separate the binary release of Intel GPU from the original OpenXLA release to minimize the impact on other in-tree devices. | 
|  | 52 | + | 
|  | 53 | +### LLVM IR | 
|  | 54 | +Most `LLVM IR` work in community can be reused directly, and only a few modification are needed for Intel GPU as below: | 
|  | 55 | +* Integrate [SPIRV tranlator](https://github.com/KhronosGroup/SPIRV-LLVM-Translator). Intel GPU can't use `LLVM IR` directly and need to converted it to `SPIRV IR` by this component first | 
|  | 56 | +* Add target specific intrinsics. Here's an example to show what the OpenXLA function [`TargetIntrinsicID()`](https://github.com/openxla/xla/blob/fb9e7064dade52134a0858a865f4be97e894bb81/xla/service/gpu/target_util.cc#L52) looks like for Intel GPU: | 
|  | 57 | +  ```c++ | 
|  | 58 | +  // Gets the llvm intrinsic ids on different platforms (NVPTX, AMDGPU) | 
|  | 59 | +  // corresponding to the give TargetIntrinsicID. | 
|  | 60 | +  struct TargetIntrinsics GetIntrinsic(TargetIntrinsicID intrin) { | 
|  | 61 | +    switch (intrin) { | 
|  | 62 | +      case TargetIntrinsicID::kThreadIdx: { | 
|  | 63 | +        return { | 
|  | 64 | +            llvm::Intrinsic::nvvm_read_ptx_sreg_tid_x, | 
|  | 65 | +            llvm::Intrinsic::amdgcn_workitem_id_x, | 
|  | 66 | +            [](llvm::IRBuilder<>* b_) -> llvm::CallInst* { | 
|  | 67 | +              return EmitDeviceFunctionCall("__builtin_IB_get_local_id_x", {}, {}, | 
|  | 68 | +                                            U32, {b_->getContext()}, b_); | 
|  | 69 | +            }, | 
|  | 70 | +        }; | 
|  | 71 | +      } | 
|  | 72 | +      ... | 
|  | 73 | +  ``` | 
|  | 74 | +* Change the index of address space. Intel GPU has no extra pass in OpenXLA to handle its address space, | 
|  | 75 | +  so it needs to use index `1` in OpenXLA function [`BuildKernelPrototype()`](https://github.com/openxla/xla/blob/main/xla/service/gpu/fusions/fusion_emitter.cc#L83C1-L116) which is different as other in-tree devices: | 
|  | 76 | +  ```c++ | 
|  | 77 | +  IrEmitterUnnested::KernelAndIrArrays IrEmitterUnnested::BuildKernelPrototype( | 
|  | 78 | +    absl::string_view suggested_name, | 
|  | 79 | +    absl::Span<const KernelArgument> arguments, | 
|  | 80 | +    const LaunchDimensions& launch_dimensions) { | 
|  | 81 | +  ... | 
|  | 82 | +  // Create the kernel and add it to the module. | 
|  | 83 | +  llvm::LLVMContext& context = module_->getContext(); | 
|  | 84 | +  llvm::FunctionType* kernel_type = llvm::FunctionType::get( | 
|  | 85 | +      /*Result=*/llvm::Type::getVoidTy(context), | 
|  | 86 | +      // SYCL: Hardcode to global device addrspace. | 
|  | 87 | +      std::vector<llvm::Type*>( | 
|  | 88 | +          kNumLlvmArgs, | 
|  | 89 | +          llvm::Type::getInt8PtrTy(b_.getInt8PtrTy()->getContext(), 1)), | 
|  | 90 | +      /*isVarArg=*/false); | 
|  | 91 | +  ... | 
|  | 92 | +  ``` | 
|  | 93 | +* Turn off advanced LLVM optimization pass to avoid unsupported LLVM features on Intel GPU | 
|  | 94 | + | 
|  | 95 | +**~250 LoC** are estimated for all of `LLVM IR` changes. | 
|  | 96 | + | 
|  | 97 | +### Lib Call | 
|  | 98 | +Some core ops (Conv/MatMul) will be lowered to [`oneDNN`](https://github.com/oneapi-src/oneDNN) library call instead of `LLVM IR` for better performance, so `oneDNN` will be integrated as third-party depedency. | 
|  | 99 | +Currently the lib call list is hard coded for specific core ops, and it will be combined with `Global Cost Model` in future for dynamic dispatching. | 
|  | 100 | + | 
|  | 101 | +### XLA GPU Runtime | 
|  | 102 | +Intel GPU is based on `SYCL` runtime from [Intel® oneAPI DPC++/C++ Compiler](https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compiler.html),  | 
|  | 103 | +so `Intel® oneAPI DPC++/C++ Compiler` will be required as runtime environment for user to execute applications on Intel GPU. | 
|  | 104 | +Based on current `XLA GPU` runtime implementation, we chose `Stream Executor` as runtime infrastruct and reimplemented it by `SYCL` runtime including `Allocator`, `Event`, `Executor`, `Kernel`, `Platform`, `Stream`... | 
|  | 105 | +The initial implementation can be found in [Intel® Extension for OpenXLA*](https://github.com/intel/intel-extension-for-openxla/tree/main/xla/stream_executor/sycl). It will be addressed to align OpenXLA code style before upstreaming. | 
|  | 106 | + | 
|  | 107 | +**~3000 LoC** are estimated for all of `XLA GPU Runtime` changes. | 
|  | 108 | + | 
|  | 109 | +### Performance Implications | 
|  | 110 | +We don’t expect performance impact due to this RFC. The functions described by this RFC are realized at the initialization stage. | 
|  | 111 | + | 
|  | 112 | +### Dependencies | 
|  | 113 | +* Build dependency: | 
|  | 114 | +  - [OneDNN](https://github.com/oneapi-src/oneDNN) | 
|  | 115 | +  - [OneMKL](https://github.com/oneapi-src/oneMKL) | 
|  | 116 | +  - [SPIRV-LLVM-Translator](https://github.com/KhronosGroup/SPIRV-LLVM-Translator) | 
|  | 117 | +* Execution (runtime) dependency: [Intel® oneAPI DPC++/C++ Compiler](https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compiler.html) | 
|  | 118 | + | 
|  | 119 | +This RFC also relies on some upcoming RFCs of `XLA GPU` from OpenXLA team, so some details will be changed by these upcoming RFCs progress, E.g.: | 
|  | 120 | +  - Command Buffer: A new propasal from OpenXLA community and we haven't implemented it in **Intel® Extension for OpenXLA**. | 
|  | 121 | +    As early initialization stage, this should not block current work based on `LLVM IR` + `Thunk` + `Stream Excutor` | 
|  | 122 | + | 
|  | 123 | +### Engineering Impact | 
|  | 124 | +The impact to binary size / startup time / build time are minimum, but test time will be increased due to new added device. | 
|  | 125 | + | 
|  | 126 | +The whole OpenXLA community (Intel is contributor as part of the community) will maintain this code. Intel will help to setup CI in below way to ensure project quality: | 
|  | 127 | +* Enable CI on Intel Dev Cloud with Intel® Data Center GPU MAX Series | 
|  | 128 | + | 
|  | 129 | +### Platforms and Environments | 
|  | 130 | +Intel GPU hardware (with correct driver) and `Intel® oneAPI DPC++/C++ Compiler` runtime environment are required. Other dependencies are the same as original OpenXLA. | 
|  | 131 | + | 
|  | 132 | +### Compatibility | 
|  | 133 | +The RFC follows `XLA GPU` [roadmap](https://docs.google.com/presentation/d/1FPVjZUkTApV80TKJ-WbPvLynjIxb3sdFGwn6Qs9UCrw/edit#slide=id.g224a3cf318c_0_1047) (WW33'2023) to integrate new GPU to OpenXLA. | 
|  | 134 | +We don't expect this proposal to impact with other parts of the OpenXLA ecosystem. In this moment it only supports basic functionality of OpenXLA and some advanced features including `Profiler` and `Debug API` are still not supported yet, they will be supported in next RFC. | 
0 commit comments