openxla
diff --git a/‎rfcs/20231102-intel-gpu.md‎
Lines changed: 134 additions & 0 deletions b/‎rfcs/20231102-intel-gpu.md‎
Lines changed: 134 additions & 0 deletions
diff --git a/‎rfcs/20231102-intel-gpu/structre.png‎
29.9 KB b/‎rfcs/20231102-intel-gpu/structre.png‎
29.9 KB
@@ -0,0 +1,134 @@
+# Integrate Intel GPU to OpenXLA
+| Status | Proposed |
+| ------ | -------- |
+| RFC#   | [99](https://github.com/openxla/community/pull/99) |
+| Author(s)	| Teng, Lu ([email protected]), Yang, Sheng ([email protected]), Zhoulong, Jiang ([email protected]), Jianhui, Li ([email protected]) |
+| Updated | 2023-11-30 |
+
+## Objective
+`XLA GPU` is a mechanism to extend new GPU to OpenXLA as in-tree build device. This RFC is to introduce the related changes to integrate Intel GPU to `XLA GPU`.
+In more generic speaking, it will support `SPIRV target` based on `SYCL` runtime.
+
+## Motivation
+Intel has experimental released [Intel® Extension for OpenXLA*](https://github.com/intel/intel-extension-for-openxla) based on `PJRT C API` to support runing applications on Intel GPU when use OpenXLA, 
+but in-tree build is a better way to maximize the capabilities of OpenXLA and improve user experience,
+Intel would like to upstream the related changes inside **Intel® Extension for OpenXLA*** to OpenXLA to make Intel GPU as in-tree build device.
+Besides, it will extend OpenXLA to support new `SPIRV target` devices based on `SYCL` runtime.
+
+## User benifit
+This RFC allows user to run their applications on Intel GPU directly when use OpenXLA, w/o installing any extra extensions or modifying any code.
+
+## Deisgn proposal
+### Overview
+Below marked components in OpenXLA will be modified to support Intel GPU:
+
+![image](https://github.com/Zantares/community/blob/tenglu/intel_gpu_rfc/rfcs/20231102-intel-gpu/structre.png?raw=true)
+
+Here we would like to distinguish these components as 2 different priorities:
+* **P1**: Related to basic functionality and is covered by this RFC
+  - `LLVM IR`: Basic code generator for Intel GPU
+  - `Lib Call`: Advanced `oneDNN` library call to replace `LLVM IR` for core ops, to improve performance for Intel GPU
+  - `XLA GPU Runtime` (based on Stream Executor): Basic runtime for Intel GPU 
+* **P2**: Related to performance or user experience and is not covered by this RFC. We will propose new RFCs to track these features
+  - `Global Cost Model`
+  - `Tool` (including Profiler, Debug API, etc.)
+
+In future, we will follow community to enable more **advanced code generator** rather than `LLVM IR` for Intel GPU.
+  
+### Code integration & binary release
+** **Note: Below integration method based on macro is an intermediate states, 
+all code will be merged with other in-tree devices once the software stack is stable.** **
+
+We would like to introduce a new macro `INTEL_GPU` (Tentative) for code integration:
+```c++
+#ifndef INTEL_GPU
+// Original functions
+#else
+// Intel GPU functions
+#endif
+```
+And only enable it with `config=xpu` (Tentative) in OpenXLA, to differentiate Intel GPU from other devices.
+In this way we separate the binary release of Intel GPU from the original OpenXLA release to minimize the impact on other in-tree devices.
+
+### LLVM IR
+Most `LLVM IR` work in community can be reused directly, and only a few modification are needed for Intel GPU as below:
+* Integrate [SPIRV tranlator](https://github.com/KhronosGroup/SPIRV-LLVM-Translator). Intel GPU can't use `LLVM IR` directly and need to converted it to `SPIRV IR` by this component first
+* Add target specific intrinsics. Here's an example to show what the OpenXLA function [`TargetIntrinsicID()`](https://github.com/openxla/xla/blob/fb9e7064dade52134a0858a865f4be97e894bb81/xla/service/gpu/target_util.cc#L52) looks like for Intel GPU:
+  ```c++
+  // Gets the llvm intrinsic ids on different platforms (NVPTX, AMDGPU)
+  // corresponding to the give TargetIntrinsicID.
+  struct TargetIntrinsics GetIntrinsic(TargetIntrinsicID intrin) {
+    switch (intrin) {
+      case TargetIntrinsicID::kThreadIdx: {
+        return {
+            llvm::Intrinsic::nvvm_read_ptx_sreg_tid_x,
+            llvm::Intrinsic::amdgcn_workitem_id_x,
+            [](llvm::IRBuilder<>* b_) -> llvm::CallInst* {
+              return EmitDeviceFunctionCall("__builtin_IB_get_local_id_x", {}, {},
+                                            U32, {b_->getContext()}, b_);
+            },
+        };
+      }
+      ...
+  ```
+* Change the index of address space. Intel GPU has no extra pass in OpenXLA to handle its address space,
+  so it needs to use index `1` in OpenXLA function [`BuildKernelPrototype()`](https://github.com/openxla/xla/blob/main/xla/service/gpu/fusions/fusion_emitter.cc#L83C1-L116) which is different as other in-tree devices:
+  ```c++
+  IrEmitterUnnested::KernelAndIrArrays IrEmitterUnnested::BuildKernelPrototype(
+    absl::string_view suggested_name,
+    absl::Span<const KernelArgument> arguments,
+    const LaunchDimensions& launch_dimensions) {
+  ...
+  // Create the kernel and add it to the module.
+  llvm::LLVMContext& context = module_->getContext();
+  llvm::FunctionType* kernel_type = llvm::FunctionType::get(
+      /*Result=*/llvm::Type::getVoidTy(context),
+      // SYCL: Hardcode to global device addrspace.
+      std::vector<llvm::Type*>(
+          kNumLlvmArgs,
+          llvm::Type::getInt8PtrTy(b_.getInt8PtrTy()->getContext(), 1)),
+      /*isVarArg=*/false);
+  ...
+  ```
+* Turn off advanced LLVM optimization pass to avoid unsupported LLVM features on Intel GPU
+
+**~250 LoC** are estimated for all of `LLVM IR` changes.
+
+### Lib Call
+Some core ops (Conv/MatMul) will be lowered to [`oneDNN`](https://github.com/oneapi-src/oneDNN) library call instead of `LLVM IR` for better performance, so `oneDNN` will be integrated as third-party depedency.
+Currently the lib call list is hard coded for specific core ops, and it will be combined with `Global Cost Model` in future for dynamic dispatching.
+
+### XLA GPU Runtime
+Intel GPU is based on `SYCL` runtime from [Intel® oneAPI DPC++/C++ Compiler](https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compiler.html), 
+so `Intel® oneAPI DPC++/C++ Compiler` will be required as runtime environment for user to execute applications on Intel GPU.
+Based on current `XLA GPU` runtime implementation, we chose `Stream Executor` as runtime infrastruct and reimplemented it by `SYCL` runtime including `Allocator`, `Event`, `Executor`, `Kernel`, `Platform`, `Stream`...
+The initial implementation can be found in [Intel® Extension for OpenXLA*](https://github.com/intel/intel-extension-for-openxla/tree/main/xla/stream_executor/sycl). It will be addressed to align OpenXLA code style before upstreaming.
+
+**~3000 LoC** are estimated for all of `XLA GPU Runtime` changes.
+
+### Performance Implications
+We don’t expect performance impact due to this RFC. The functions described by this RFC are realized at the initialization stage.
+
+### Dependencies
+* Build dependency:
+  - [OneDNN](https://github.com/oneapi-src/oneDNN)
+  - [OneMKL](https://github.com/oneapi-src/oneMKL)
+  - [SPIRV-LLVM-Translator](https://github.com/KhronosGroup/SPIRV-LLVM-Translator)
+* Execution (runtime) dependency: [Intel® oneAPI DPC++/C++ Compiler](https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compiler.html)
+
+This RFC also relies on some upcoming RFCs of `XLA GPU` from OpenXLA team, so some details will be changed by these upcoming RFCs progress, E.g.:
+  - Command Buffer: A new propasal from OpenXLA community and we haven't implemented it in **Intel® Extension for OpenXLA**.
+    As early initialization stage, this should not block current work based on `LLVM IR` + `Thunk` + `Stream Excutor`
+
+### Engineering Impact
+The impact to binary size / startup time / build time are minimum, but test time will be increased due to new added device.
+
+The whole OpenXLA community (Intel is contributor as part of the community) will maintain this code. Intel will help to setup CI in below way to ensure project quality:
+* Enable CI on Intel Dev Cloud with Intel® Data Center GPU MAX Series
+
+### Platforms and Environments
+Intel GPU hardware (with correct driver) and `Intel® oneAPI DPC++/C++ Compiler` runtime environment are required. Other dependencies are the same as original OpenXLA.
+
+### Compatibility
+The RFC follows `XLA GPU` [roadmap](https://docs.google.com/presentation/d/1FPVjZUkTApV80TKJ-WbPvLynjIxb3sdFGwn6Qs9UCrw/edit#slide=id.g224a3cf318c_0_1047) (WW33'2023) to integrate new GPU to OpenXLA.
+We don't expect this proposal to impact with other parts of the OpenXLA ecosystem. In this moment it only supports basic functionality of OpenXLA and some advanced features including `Profiler` and `Debug API` are still not supported yet, they will be supported in next RFC.