oneapi-src · anjgola · Aug 21, 2020 · Aug 21, 2020
diff --git a/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/CMakeLists.txt b/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/CMakeLists.txt
@@ -0,0 +1,11 @@
+set(CMAKE_CXX_COMPILER "dpcpp")
+
+cmake_minimum_required (VERSION 2.8)
+
+project(FPGARegister)
+
+set(CMAKE_ARCHIVE_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})
+set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})
+set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})
+
+add_subdirectory (src)
diff --git a/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/License.txt b/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/License.txt
@@ -0,0 +1,7 @@
+Copyright Intel Corporation
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
diff --git a/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/README.md b/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/README.md
@@ -0,0 +1,188 @@
+# Explicit Pipeline Register Insertion with `fpga_reg`
+
+This FPGA tutorial demonstrates how a power user can apply the DPC++ extension  `intel::fpga_reg` to tweak the hardware generated by the compiler.
+
+***Documentation***: The [oneAPI DPC++ FPGA Optimization Guide](https://software.intel.com/content/www/us/en/develop/documentation/oneapi-fpga-optimization-guide) provides comprehensive instructions for targeting FPGAs through DPC++. The [oneAPI Programming Guide](https://software.intel.com/en-us/oneapi-programming-guide) is a general resource for target-independent DPC++ programming.
+
+| Optimized for                     | Description
+---                                 |---
+| OS                                | Linux* Ubuntu* 18.04
+| Hardware                          | Intel® Programmable Acceleration Card (PAC) with Intel Arria® 10 GX FPGA; <br> Intel® Programmable Acceleration Card (PAC) with Intel Stratix® 10 SX FPGA
+| Software                          | Intel® oneAPI DPC++ Compiler (Beta) <br> Intel® FPGA Add-On for oneAPI Base Toolkit
+| What you will learn               | How to use the `intel::fpga_reg` extension <br> How `intel::fpga_reg` can be used to re-structure the compiler-generated hardware <br> Situations in which applying  `intel::fpga_reg` might be beneficial
+| Time to complete                  | 20 minutes
+
+_Notice: This code sample is not yet supported in Windows*_
+
+## Purpose
+
+This FPGA tutorial demonstrates an example of using the `intel::fpga_reg` extension to:
+
+* Help reduce the fanout of specific signals in the DPC++ design
+* Improve the overall f<sub>MAX</sub> of the generated hardware
+
+Note that this is an advanced tutorial for FPGA power users.
+
+### Simple Code Example
+
+The signature of `intel::fpga_reg` is as follows:
+
+```cpp
+template <typenameT>
+T intel::fpga_reg(T input)
+```
+
+To use this function in your code, you must include the following header:
+
+```cpp
+#include <CL/sycl/intel/fpga_extensions.hpp>
+```
+
+When you use this function on any value in your code, the compiler will insert at least one register stage between the input and output of `intel::fpga_reg` function. For example:
+
+```cpp
+int func (int input) {
+  int output = intel::fpga_reg(input)
+  return output;
+}
+```
+
+This forces the compiler to insert a register between the input and output. You can observe this in the optimization report's System Viewer.
+
+### Understanding the Tutorial Design
+
+The basic function performed by the tutorial kernel is a vector dot product with a pre-adder. The loop is unrolled so that the core part of the algorithm is a feed-forward datapath. The coefficient array is implemented as a circular shift register and rotates by one for each iteration of the outer loop.
+
+The optimization applied in this tutorial impacts the system f<sub>MAX</sub> or the maximum frequency that the design can run at. Since the compiler implements all kernels in a common clock domain, f<sub>MAX</sub> is a global system parameter. To see the impact of the `intel::fpga_reg` optimization in this tutorial, you will need to compile the design twice.
+
+Part 1 compiles the kernel code without setting the `USE_FPGA_REG` macro, whereas Part 2 compiles the kernel while setting this macro. This chooses between two code segments that are functionally equivalent, but the latter version makes use of `intel::fpga_reg`. In the `USE_FPGA_REG` version of the code, the compiler is guaranteed to insert at least one register stage between the input and output of each of the calls to `intel::fpga_reg` function.
+
+#### Part 1: Without `USE_FPGA_REG`
+
+The compiler will generate the following hardware for Part 1. The diagram below has been simplified for illustration.
+
+<img src="no_fpga_reg.png" alt="Part 1" title="Part 1" width="400" />
+
+Note the following:
+
+* The compiler automatically infers a tree structure for the series of adders.
+* There is a large fanout (of up to 4 in this simplified example) from `val` to each of the adders.
+
+The fanout grows linearly with the unroll factor in this tutorial. In FPGA designs, signals with large fanout can sometimes degrade system f<sub>MAX</sub>. This happens because the FPGA placement algorithm cannot place *all* of the fanout logic elements physically close to the fanout source, leading to longer wires.  In this situation, it can be helpful to add explicit fanout control in your DPC++ code via `intel::fpga_reg`. This is an advanced optimization for FPGA power-users.
+
+#### Part 2: with `USE_FPGA_REG`
+
+In this part, we added two sets of `intel::fpga_reg` within the unrolled loop. The first is added to pipeline `val` once per iteration. This reduce the fanout of `val` from 4 in the example in Part 1 to just 2. The second `intel::fpga_reg` is inserted between accumulation into the `acc` value. This generates the following structure in hardware.
+
+<img src="fpga_reg.png" alt="Part 2" title="Part 2" width="400" />
+
+In this version, the adder tree has been transformed into a vine-like structure. This increases latency, but it helps us achieve our goal of reducing the fanout and improving f<sub>MAX</sub>.
+Since the outer loop in this tutorial is pipelined and has a high trip count, the increased latency of the inner loop has negligible impact on throughput. The tradeoff pays off, as the f<sub>MAX</sub> improvement yields a higher performing design.
+
+## Key Concepts
+
+* How to use the `intel::fpga_reg` extension
+* How `intel::fpga_reg` can be used to re-structure the compiler-generated hardware
+* Situations in which applying  `intel::fpga_reg` might be beneficial
+
+## License
+
+This code sample is licensed under MIT license.
+
+## Building the `fpga_reg` Design
+
+### Include Files
+
+The included header `dpc_common.hpp` is located at `%ONEAPI_ROOT%\dev-utilities\latest\include` on your development system.
+
+### Running Samples in DevCloud
+
+If running a sample in the Intel DevCloud, remember that you must specify the compute node (fpga_compile or fpga_runtime) as well as whether to run in batch or interactive mode. For more information see the Intel® oneAPI Base Toolkit Get Started Guide ([https://devcloud.intel.com/oneapi/get-started/base-toolkit/](https://devcloud.intel.com/oneapi/get-started/base-toolkit/)).
+
+When compiling for FPGA hardware, it is recommended to increase the job timeout to 12h.
+
+### On a Linux* System
+
+1. Install the design in `build` directory from the design directory by running `cmake`:
+
+   ```bash
+   mkdir build
+   cd build
+   ```
+
+   If you are compiling for the Intel® PAC with Intel Arria® 10 GX FPGA, run `cmake` using the command:
+
+   ```bash
+   cmake ..
+   ```
+
+   Alternatively, to compile for the Intel® PAC with Intel Stratix® 10 SX FPGA, run `cmake` using the command:
+
+   ```bash
+   cmake .. -DFPGA_BOARD=intel_s10sx_pac:pac_s10
+   ```
+
+2. Compile the design using the generated `Makefile`. The following four build targets are provided that match the recommended development flow:
+
+   * Compile and run for emulation (fast compile time, targets emulates an FPGA device) using:
+
+     ```bash
+     make fpga_emu
+     ```
+
+   * Generate HTML optimization reports using:
+
+     ```bash
+     make report
+     ```
+
+   * Compile and run on FPGA hardware (longer compile time, targets an FPGA device) using:
+
+     ```bash
+     make fpga
+     ```
+
+3. (Optional) As the above hardware compile may take several hours to complete, an Intel® PAC with Intel Arria® 10 GX FPGA pre-compiled binary can be downloaded <a href="https://software.intel.com/content/dam/develop/external/us/en/documents/fpga_reg.fpga.tar.gz" download>here</a>.
+
+
+### In Third-Party Integrated Development Environments (IDEs)
+
+You can compile and run this tutorial in the Eclipse* IDE (in Linux*).
+For instructions, refer to the following link: [Intel® oneAPI DPC++ FPGA Workflows on Third-Party IDEs](https://software.intel.com/en-us/articles/intel-oneapi-dpcpp-fpga-workflow-on-ide)
+
+## Examining the Reports
+
+Locate the pair of `report.html` files in either:
+
+* **Report-only compile**:  `fpga_reg_report.prj` and `fpga_reg_registered_report.prj`
+* **FPGA hardware compile**: `fpga_reg.prj` and `fpga_reg_registered.prj`
+
+Open the reports in any of Chrome*, Firefox*, Edge*, or Internet Explorer*. Observe the structure of the design in the optimization report's System Viewer and notice the changes within `Cluster 2` of the `SimpleMath.B1` block. You can notice that in the report for Part 1, the viewer shows a much more shallow graph as compared to the one in Part 2. This is because the operations are performed much closer to one another in Part 1 as compared to Part 2. By transforming the code in Part 2, with more register stages, the compiler was able to achieve an higher f<sub>MAX</sub>.
+
+>**NOTE**: Only the report generated after the FPGA hardware compile will reflect the performance benefit of using the `fpga_reg` extension. The difference is *not* apparent in the reports generated by `make report` because a design's f<sub>MAX</sub> cannot be predicted. The final achieved f<sub>MAX</sub> can be found in `fpga_reg.prj/reports/report.html` and `fpga_reg_registered.prj/reports/report.html` (after `make fpga` completes).
+
+## Running the Sample
+
+1. Run the sample on the FPGA emulator (the kernel executes on the CPU):
+
+   ```bash
+   ./fpga_reg.fpga_emu    # Linux
+   ```
+
+2. Run the sample on the FPGA device
+
+   ```bash
+   ./fpga_reg.fpga             # Linux
+   ./fpga_reg_registered.fpga  # Linux
+   ```
+
+### Example of Output
+
+```txt
+Throughput for kernel with input size 1000000 and coefficient array size 64: 2.819272 GFlops
+PASSED: Results are correct.
+```
+
+### Discussion of Results
+
+You will be able to observe the improvement in the throughput going from Part 1 to Part 2. You will also note that the f<sub>MAX</sub> of Part 2 is significantly larger than of Part 1.
diff --git a/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/fpga_reg.png b/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/fpga_reg.png
diff --git a/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/no_fpga_reg.png b/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/no_fpga_reg.png
diff --git a/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/sample.json b/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/sample.json
@@ -0,0 +1,34 @@
+{
+  "guid": "D661A5C2-5FE0-40F2-BFE7-70E3BA60F088",
+  "name": "Explicit Pipeline Register Insertion with fpga_reg",
+  "categories": ["Toolkit/Intel® oneAPI Base Toolkit/FPGA/Tutorials"],
+  "description": "FPGA advanced tutorial demonstrating how to apply the DPC++ extension intel::fpga_reg",
+  "toolchain": ["dpcpp"],
+  "os": ["linux"],
+  "targetDevice": ["FPGA"],
+  "builder": ["cmake"],
+  "languages": [{"cpp":{}}],
+  "ciTests": {
+    "linux": [
+      {
+        "id": "fpga_emu",
+        "steps": [
+          "mkdir build",
+          "cd build",
+          "cmake ..",
+          "make fpga_emu",
+          "./fpga_reg.fpga_emu"
+        ]
+      },
+      {
+        "id": "report",
+        "steps": [
+          "mkdir build",
+          "cd build",
+          "cmake ..",
+          "make report"
+        ]
+      }
+    ]
+  }
+}
diff --git a/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/src/CMakeLists.txt b/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/src/CMakeLists.txt
@@ -0,0 +1,111 @@
+set(SOURCE_FILE fpga_reg.cpp)
+set(TARGET_NAME fpga_reg)
+set(TARGET_NAME_REG fpga_reg_registered)
+set(EMULATOR_TARGET ${TARGET_NAME}.fpga_emu)
+set(FPGA_TARGET ${TARGET_NAME}.fpga)
+set(FPGA_TARGET_REG ${TARGET_NAME_REG}.fpga)
+
+# Intel supported FPGA Boards and their names
+set(A10_PAC_BOARD_NAME "intel_a10gx_pac:pac_a10")
+set(S10_PAC_BOARD_NAME "intel_s10sx_pac:pac_s10")
+
+# Assume target is the Intel(R) PAC with Intel Arria(R) 10 GX FPGA 
+SET(_FPGA_BOARD ${A10_PAC_BOARD_NAME})
+
+# Check if target is the Intel(R) PAC with Intel Stratix(R) 10 SX FPGA
+IF (NOT DEFINED FPGA_BOARD)
+    MESSAGE(STATUS "\tFPGA_BOARD was not specified. Configuring the design to run on the Intel(R) Programmable Acceleration Card (PAC) with Intel Arria(R) 10 GX FPGA. Please refer to the README for more information on how to run the design on the Intel(R) PAC with Intel Stratix(R) 10 SX FPGA.")
+
+ELSEIF(FPGA_BOARD STREQUAL ${A10_PAC_BOARD_NAME})
+    MESSAGE(STATUS "\tConfiguring the design to run on the Intel(R) Programmable Acceleration Card (PAC) with Intel Arria(R) 10 GX FPGA.")
+
+ELSEIF(FPGA_BOARD STREQUAL ${S10_PAC_BOARD_NAME})
+    MESSAGE(STATUS "\tConfiguring the design to run on the Intel(R) Programmable Acceleration Card (PAC) with Intel Stratix(R) 10 SX FPGA.")
+    SET(_FPGA_BOARD ${S10_PAC_BOARD_NAME})
+
+ELSE()
+    MESSAGE(STATUS "\tAn invalid board name was passed in using the FPGA_BOARD flag. Configuring the design to run on the Intel(R) Programmable Acceleration Card (PAC) with Intel Arria(R) 10 GX FPGA. Please refer to the README for the list of valid board names.")
+ENDIF()
+
+set(HARDWARE_COMPILE_FLAGS "-fintelfpga")
+
+# use cmake -D USER_HARDWARE_FLAGS=<flags> to set extra flags for FPGA backend compilation
+set(HARDWARE_LINK_FLAGS "-fintelfpga -Xshardware -Xsboard=${_FPGA_BOARD} ${USER_HARDWARE_FLAGS}")
+
+set(EMULATOR_COMPILE_FLAGS "-fintelfpga -DFPGA_EMULATOR")
+set(EMULATOR_LINK_FLAGS "-fintelfpga")
+
+# fpga emulator
+if(WIN32)
+    set(WIN_EMULATOR_TARGET ${EMULATOR_TARGET}.exe)
+    add_custom_target(fpga_emu DEPENDS ${WIN_EMULATOR_TARGET})
+    separate_arguments(WIN_EMULATOR_COMPILE_FLAGS WINDOWS_COMMAND "${EMULATOR_COMPILE_FLAGS}")
+    add_custom_command(OUTPUT ${WIN_EMULATOR_TARGET} 
+             COMMAND ${CMAKE_CXX_COMPILER} ${WIN_EMULATOR_COMPILE_FLAGS} /GX ${CMAKE_CURRENT_SOURCE_DIR}/${SOURCE_FILE} -o ${CMAKE_BINARY_DIR}/${WIN_EMULATOR_TARGET}
+             DEPENDS ${SOURCE_FILE})
+
+else()
+    add_executable(${EMULATOR_TARGET} ${SOURCE_FILE})
+    add_custom_target(fpga_emu DEPENDS ${EMULATOR_TARGET})
+    set_target_properties(${EMULATOR_TARGET} PROPERTIES COMPILE_FLAGS ${EMULATOR_COMPILE_FLAGS})
+    set_target_properties(${EMULATOR_TARGET} PROPERTIES LINK_FLAGS ${EMULATOR_LINK_FLAGS})
+endif()
+
+# fpga
+if(WIN32)
+    add_custom_target(fpga
+            COMMAND echo "FPGA hardware flow is not supported in Windows")
+else()
+    add_executable(${FPGA_TARGET} EXCLUDE_FROM_ALL ${SOURCE_FILE})
+    add_executable(${FPGA_TARGET_REG} EXCLUDE_FROM_ALL ${SOURCE_FILE})
+    add_custom_target(fpga DEPENDS ${FPGA_TARGET} ${FPGA_TARGET_REG})
+
+    set_target_properties(${FPGA_TARGET} PROPERTIES COMPILE_FLAGS ${HARDWARE_COMPILE_FLAGS})
+    set_target_properties(${FPGA_TARGET} PROPERTIES LINK_FLAGS ${HARDWARE_LINK_FLAGS})
+
+    set_target_properties(${FPGA_TARGET_REG} PROPERTIES COMPILE_FLAGS "${HARDWARE_COMPILE_FLAGS} -DUSE_FPGA_REG")
+    set_target_properties(${FPGA_TARGET_REG} PROPERTIES LINK_FLAGS ${HARDWARE_LINK_FLAGS})
+endif()
+
+# report
+if(WIN32)
+    set(REPORT ${TARGET_NAME}_report.a)
+    set(REPORT_REG ${TARGET_NAME_REG}_report.a)
+
+    add_custom_target(report DEPENDS ${REPORT} ${REPORT_REG})
+
+    separate_arguments(HARDWARE_LINK_FLAGS_LIST WINDOWS_COMMAND "${HARDWARE_LINK_FLAGS}")
+
+    configure_file(${CMAKE_CURRENT_SOURCE_DIR}/${SOURCE_FILE} ${CMAKE_BINARY_DIR}/${TARGET_NAME}/${SOURCE_FILE} COPYONLY)
+    configure_file(${CMAKE_CURRENT_SOURCE_DIR}/${SOURCE_FILE} ${CMAKE_BINARY_DIR}/${TARGET_NAME_REG}/${SOURCE_FILE} COPYONLY)
+
+    add_custom_command(OUTPUT ${REPORT}
+        COMMAND ${CMAKE_CXX_COMPILER} /EHsc ${CMAKE_CXX_FLAGS} ${HARDWARE_LINK_FLAGS_LIST} -fsycl-link ${CMAKE_BINARY_DIR}/${TARGET_NAME}/${SOURCE_FILE} -o ${CMAKE_BINARY_DIR}/${REPORT}
+                 DEPENDS ${SOURCE_FILE})
+
+    add_custom_command(OUTPUT ${REPORT_REG}
+        COMMAND ${CMAKE_CXX_COMPILER} /EHsc ${CMAKE_CXX_FLAGS} ${HARDWARE_LINK_FLAGS_LIST} -DUSE_FPGA_REG -fsycl-link ${CMAKE_BINARY_DIR}/${TARGET_NAME_REG}/${SOURCE_FILE} -o ${CMAKE_BINARY_DIR}/${REPORT_REG}
+                 DEPENDS ${SOURCE_FILE})
+
+else()
+    set(REPORT ${TARGET_NAME}_report.a)
+    set(REPORT_REG ${TARGET_NAME_REG}_report.a)
+
+    add_custom_target(report DEPENDS ${REPORT} ${REPORT_REG})
+
+    configure_file(${CMAKE_CURRENT_SOURCE_DIR}/${SOURCE_FILE} ${SOURCE_FILE} COPYONLY)
+
+    separate_arguments(HARDWARE_LINK_FLAGS_LIST UNIX_COMMAND "${HARDWARE_LINK_FLAGS}")
+    add_custom_command(OUTPUT ${REPORT}
+                 COMMAND ${CMAKE_CXX_COMPILER} ${CMAKE_CXX_FLAGS} ${HARDWARE_LINK_FLAGS_LIST} -fsycl-link ${SOURCE_FILE} -o ${CMAKE_BINARY_DIR}/${REPORT}
+                 DEPENDS ${SOURCE_FILE})
+
+    add_custom_command(OUTPUT ${REPORT_REG}
+                 COMMAND ${CMAKE_CXX_COMPILER} ${CMAKE_CXX_FLAGS} ${HARDWARE_LINK_FLAGS_LIST} -DUSE_FPGA_REG -fsycl-link ${SOURCE_FILE} -o ${CMAKE_BINARY_DIR}/${REPORT_REG}
+                 DEPENDS ${SOURCE_FILE})
+endif()
+
+# run
+add_custom_target(run
+            COMMAND ../${TARGET_NAME}.fpga_emu
+            DEPENDS ${TARGET_NAME}.fpga_emu)