diff --git a/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/CMakeLists.txt b/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/CMakeLists.txt new file mode 100755 index 0000000000..325cc3fa42 --- /dev/null +++ b/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/CMakeLists.txt @@ -0,0 +1,11 @@ +set(CMAKE_CXX_COMPILER "dpcpp") + +cmake_minimum_required (VERSION 2.8) + +project(FPGARegister) + +set(CMAKE_ARCHIVE_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}) +set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}) +set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}) + +add_subdirectory (src) diff --git a/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/License.txt b/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/License.txt new file mode 100755 index 0000000000..e63c6e13dc --- /dev/null +++ b/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/License.txt @@ -0,0 +1,7 @@ +Copyright Intel Corporation + +Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. diff --git a/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/README.md b/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/README.md new file mode 100755 index 0000000000..18e2a1f244 --- /dev/null +++ b/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/README.md @@ -0,0 +1,188 @@ +# Explicit Pipeline Register Insertion with `fpga_reg` + +This FPGA tutorial demonstrates how a power user can apply the DPC++ extension `intel::fpga_reg` to tweak the hardware generated by the compiler. + +***Documentation***: The [oneAPI DPC++ FPGA Optimization Guide](https://software.intel.com/content/www/us/en/develop/documentation/oneapi-fpga-optimization-guide) provides comprehensive instructions for targeting FPGAs through DPC++. The [oneAPI Programming Guide](https://software.intel.com/en-us/oneapi-programming-guide) is a general resource for target-independent DPC++ programming. + +| Optimized for | Description +--- |--- +| OS | Linux* Ubuntu* 18.04 +| Hardware | Intel® Programmable Acceleration Card (PAC) with Intel Arria® 10 GX FPGA;
Intel® Programmable Acceleration Card (PAC) with Intel Stratix® 10 SX FPGA +| Software | Intel® oneAPI DPC++ Compiler (Beta)
Intel® FPGA Add-On for oneAPI Base Toolkit +| What you will learn | How to use the `intel::fpga_reg` extension
How `intel::fpga_reg` can be used to re-structure the compiler-generated hardware
Situations in which applying `intel::fpga_reg` might be beneficial +| Time to complete | 20 minutes + +_Notice: This code sample is not yet supported in Windows*_ + +## Purpose + +This FPGA tutorial demonstrates an example of using the `intel::fpga_reg` extension to: + +* Help reduce the fanout of specific signals in the DPC++ design +* Improve the overall fMAX of the generated hardware + +Note that this is an advanced tutorial for FPGA power users. + +### Simple Code Example + +The signature of `intel::fpga_reg` is as follows: + +```cpp +template +T intel::fpga_reg(T input) +``` + +To use this function in your code, you must include the following header: + +```cpp +#include +``` + +When you use this function on any value in your code, the compiler will insert at least one register stage between the input and output of `intel::fpga_reg` function. For example: + +```cpp +int func (int input) { + int output = intel::fpga_reg(input) + return output; +} +``` + +This forces the compiler to insert a register between the input and output. You can observe this in the optimization report's System Viewer. + +### Understanding the Tutorial Design + +The basic function performed by the tutorial kernel is a vector dot product with a pre-adder. The loop is unrolled so that the core part of the algorithm is a feed-forward datapath. The coefficient array is implemented as a circular shift register and rotates by one for each iteration of the outer loop. + +The optimization applied in this tutorial impacts the system fMAX or the maximum frequency that the design can run at. Since the compiler implements all kernels in a common clock domain, fMAX is a global system parameter. To see the impact of the `intel::fpga_reg` optimization in this tutorial, you will need to compile the design twice. + +Part 1 compiles the kernel code without setting the `USE_FPGA_REG` macro, whereas Part 2 compiles the kernel while setting this macro. This chooses between two code segments that are functionally equivalent, but the latter version makes use of `intel::fpga_reg`. In the `USE_FPGA_REG` version of the code, the compiler is guaranteed to insert at least one register stage between the input and output of each of the calls to `intel::fpga_reg` function. + +#### Part 1: Without `USE_FPGA_REG` + +The compiler will generate the following hardware for Part 1. The diagram below has been simplified for illustration. + +Part 1 + +Note the following: + +* The compiler automatically infers a tree structure for the series of adders. +* There is a large fanout (of up to 4 in this simplified example) from `val` to each of the adders. + +The fanout grows linearly with the unroll factor in this tutorial. In FPGA designs, signals with large fanout can sometimes degrade system fMAX. This happens because the FPGA placement algorithm cannot place *all* of the fanout logic elements physically close to the fanout source, leading to longer wires. In this situation, it can be helpful to add explicit fanout control in your DPC++ code via `intel::fpga_reg`. This is an advanced optimization for FPGA power-users. + +#### Part 2: with `USE_FPGA_REG` + +In this part, we added two sets of `intel::fpga_reg` within the unrolled loop. The first is added to pipeline `val` once per iteration. This reduce the fanout of `val` from 4 in the example in Part 1 to just 2. The second `intel::fpga_reg` is inserted between accumulation into the `acc` value. This generates the following structure in hardware. + +Part 2 + +In this version, the adder tree has been transformed into a vine-like structure. This increases latency, but it helps us achieve our goal of reducing the fanout and improving fMAX. +Since the outer loop in this tutorial is pipelined and has a high trip count, the increased latency of the inner loop has negligible impact on throughput. The tradeoff pays off, as the fMAX improvement yields a higher performing design. + +## Key Concepts + +* How to use the `intel::fpga_reg` extension +* How `intel::fpga_reg` can be used to re-structure the compiler-generated hardware +* Situations in which applying `intel::fpga_reg` might be beneficial + +## License + +This code sample is licensed under MIT license. + +## Building the `fpga_reg` Design + +### Include Files + +The included header `dpc_common.hpp` is located at `%ONEAPI_ROOT%\dev-utilities\latest\include` on your development system. + +### Running Samples in DevCloud + +If running a sample in the Intel DevCloud, remember that you must specify the compute node (fpga_compile or fpga_runtime) as well as whether to run in batch or interactive mode. For more information see the Intel® oneAPI Base Toolkit Get Started Guide ([https://devcloud.intel.com/oneapi/get-started/base-toolkit/](https://devcloud.intel.com/oneapi/get-started/base-toolkit/)). + +When compiling for FPGA hardware, it is recommended to increase the job timeout to 12h. + +### On a Linux* System + +1. Install the design in `build` directory from the design directory by running `cmake`: + + ```bash + mkdir build + cd build + ``` + + If you are compiling for the Intel® PAC with Intel Arria® 10 GX FPGA, run `cmake` using the command: + + ```bash + cmake .. + ``` + + Alternatively, to compile for the Intel® PAC with Intel Stratix® 10 SX FPGA, run `cmake` using the command: + + ```bash + cmake .. -DFPGA_BOARD=intel_s10sx_pac:pac_s10 + ``` + +2. Compile the design using the generated `Makefile`. The following four build targets are provided that match the recommended development flow: + + * Compile and run for emulation (fast compile time, targets emulates an FPGA device) using: + + ```bash + make fpga_emu + ``` + + * Generate HTML optimization reports using: + + ```bash + make report + ``` + + * Compile and run on FPGA hardware (longer compile time, targets an FPGA device) using: + + ```bash + make fpga + ``` + +3. (Optional) As the above hardware compile may take several hours to complete, an Intel® PAC with Intel Arria® 10 GX FPGA pre-compiled binary can be downloaded here. + + +### In Third-Party Integrated Development Environments (IDEs) + +You can compile and run this tutorial in the Eclipse* IDE (in Linux*). +For instructions, refer to the following link: [Intel® oneAPI DPC++ FPGA Workflows on Third-Party IDEs](https://software.intel.com/en-us/articles/intel-oneapi-dpcpp-fpga-workflow-on-ide) + +## Examining the Reports + +Locate the pair of `report.html` files in either: + +* **Report-only compile**: `fpga_reg_report.prj` and `fpga_reg_registered_report.prj` +* **FPGA hardware compile**: `fpga_reg.prj` and `fpga_reg_registered.prj` + +Open the reports in any of Chrome*, Firefox*, Edge*, or Internet Explorer*. Observe the structure of the design in the optimization report's System Viewer and notice the changes within `Cluster 2` of the `SimpleMath.B1` block. You can notice that in the report for Part 1, the viewer shows a much more shallow graph as compared to the one in Part 2. This is because the operations are performed much closer to one another in Part 1 as compared to Part 2. By transforming the code in Part 2, with more register stages, the compiler was able to achieve an higher fMAX. + +>**NOTE**: Only the report generated after the FPGA hardware compile will reflect the performance benefit of using the `fpga_reg` extension. The difference is *not* apparent in the reports generated by `make report` because a design's fMAX cannot be predicted. The final achieved fMAX can be found in `fpga_reg.prj/reports/report.html` and `fpga_reg_registered.prj/reports/report.html` (after `make fpga` completes). + +## Running the Sample + +1. Run the sample on the FPGA emulator (the kernel executes on the CPU): + + ```bash + ./fpga_reg.fpga_emu # Linux + ``` + +2. Run the sample on the FPGA device + + ```bash + ./fpga_reg.fpga # Linux + ./fpga_reg_registered.fpga # Linux + ``` + +### Example of Output + +```txt +Throughput for kernel with input size 1000000 and coefficient array size 64: 2.819272 GFlops +PASSED: Results are correct. +``` + +### Discussion of Results + +You will be able to observe the improvement in the throughput going from Part 1 to Part 2. You will also note that the fMAX of Part 2 is significantly larger than of Part 1. diff --git a/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/fpga_reg.png b/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/fpga_reg.png new file mode 100755 index 0000000000..fe33916939 Binary files /dev/null and b/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/fpga_reg.png differ diff --git a/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/no_fpga_reg.png b/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/no_fpga_reg.png new file mode 100755 index 0000000000..5383063625 Binary files /dev/null and b/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/no_fpga_reg.png differ diff --git a/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/sample.json b/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/sample.json new file mode 100755 index 0000000000..57573588f4 --- /dev/null +++ b/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/sample.json @@ -0,0 +1,34 @@ +{ + "guid": "D661A5C2-5FE0-40F2-BFE7-70E3BA60F088", + "name": "Explicit Pipeline Register Insertion with fpga_reg", + "categories": ["Toolkit/Intel® oneAPI Base Toolkit/FPGA/Tutorials"], + "description": "FPGA advanced tutorial demonstrating how to apply the DPC++ extension intel::fpga_reg", + "toolchain": ["dpcpp"], + "os": ["linux"], + "targetDevice": ["FPGA"], + "builder": ["cmake"], + "languages": [{"cpp":{}}], + "ciTests": { + "linux": [ + { + "id": "fpga_emu", + "steps": [ + "mkdir build", + "cd build", + "cmake ..", + "make fpga_emu", + "./fpga_reg.fpga_emu" + ] + }, + { + "id": "report", + "steps": [ + "mkdir build", + "cd build", + "cmake ..", + "make report" + ] + } + ] + } +} diff --git a/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/src/CMakeLists.txt b/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/src/CMakeLists.txt new file mode 100755 index 0000000000..2880b9dcf9 --- /dev/null +++ b/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/src/CMakeLists.txt @@ -0,0 +1,111 @@ +set(SOURCE_FILE fpga_reg.cpp) +set(TARGET_NAME fpga_reg) +set(TARGET_NAME_REG fpga_reg_registered) +set(EMULATOR_TARGET ${TARGET_NAME}.fpga_emu) +set(FPGA_TARGET ${TARGET_NAME}.fpga) +set(FPGA_TARGET_REG ${TARGET_NAME_REG}.fpga) + +# Intel supported FPGA Boards and their names +set(A10_PAC_BOARD_NAME "intel_a10gx_pac:pac_a10") +set(S10_PAC_BOARD_NAME "intel_s10sx_pac:pac_s10") + +# Assume target is the Intel(R) PAC with Intel Arria(R) 10 GX FPGA +SET(_FPGA_BOARD ${A10_PAC_BOARD_NAME}) + +# Check if target is the Intel(R) PAC with Intel Stratix(R) 10 SX FPGA +IF (NOT DEFINED FPGA_BOARD) + MESSAGE(STATUS "\tFPGA_BOARD was not specified. Configuring the design to run on the Intel(R) Programmable Acceleration Card (PAC) with Intel Arria(R) 10 GX FPGA. Please refer to the README for more information on how to run the design on the Intel(R) PAC with Intel Stratix(R) 10 SX FPGA.") + +ELSEIF(FPGA_BOARD STREQUAL ${A10_PAC_BOARD_NAME}) + MESSAGE(STATUS "\tConfiguring the design to run on the Intel(R) Programmable Acceleration Card (PAC) with Intel Arria(R) 10 GX FPGA.") + +ELSEIF(FPGA_BOARD STREQUAL ${S10_PAC_BOARD_NAME}) + MESSAGE(STATUS "\tConfiguring the design to run on the Intel(R) Programmable Acceleration Card (PAC) with Intel Stratix(R) 10 SX FPGA.") + SET(_FPGA_BOARD ${S10_PAC_BOARD_NAME}) + +ELSE() + MESSAGE(STATUS "\tAn invalid board name was passed in using the FPGA_BOARD flag. Configuring the design to run on the Intel(R) Programmable Acceleration Card (PAC) with Intel Arria(R) 10 GX FPGA. Please refer to the README for the list of valid board names.") +ENDIF() + +set(HARDWARE_COMPILE_FLAGS "-fintelfpga") + +# use cmake -D USER_HARDWARE_FLAGS= to set extra flags for FPGA backend compilation +set(HARDWARE_LINK_FLAGS "-fintelfpga -Xshardware -Xsboard=${_FPGA_BOARD} ${USER_HARDWARE_FLAGS}") + +set(EMULATOR_COMPILE_FLAGS "-fintelfpga -DFPGA_EMULATOR") +set(EMULATOR_LINK_FLAGS "-fintelfpga") + +# fpga emulator +if(WIN32) + set(WIN_EMULATOR_TARGET ${EMULATOR_TARGET}.exe) + add_custom_target(fpga_emu DEPENDS ${WIN_EMULATOR_TARGET}) + separate_arguments(WIN_EMULATOR_COMPILE_FLAGS WINDOWS_COMMAND "${EMULATOR_COMPILE_FLAGS}") + add_custom_command(OUTPUT ${WIN_EMULATOR_TARGET} + COMMAND ${CMAKE_CXX_COMPILER} ${WIN_EMULATOR_COMPILE_FLAGS} /GX ${CMAKE_CURRENT_SOURCE_DIR}/${SOURCE_FILE} -o ${CMAKE_BINARY_DIR}/${WIN_EMULATOR_TARGET} + DEPENDS ${SOURCE_FILE}) + +else() + add_executable(${EMULATOR_TARGET} ${SOURCE_FILE}) + add_custom_target(fpga_emu DEPENDS ${EMULATOR_TARGET}) + set_target_properties(${EMULATOR_TARGET} PROPERTIES COMPILE_FLAGS ${EMULATOR_COMPILE_FLAGS}) + set_target_properties(${EMULATOR_TARGET} PROPERTIES LINK_FLAGS ${EMULATOR_LINK_FLAGS}) +endif() + +# fpga +if(WIN32) + add_custom_target(fpga + COMMAND echo "FPGA hardware flow is not supported in Windows") +else() + add_executable(${FPGA_TARGET} EXCLUDE_FROM_ALL ${SOURCE_FILE}) + add_executable(${FPGA_TARGET_REG} EXCLUDE_FROM_ALL ${SOURCE_FILE}) + add_custom_target(fpga DEPENDS ${FPGA_TARGET} ${FPGA_TARGET_REG}) + + set_target_properties(${FPGA_TARGET} PROPERTIES COMPILE_FLAGS ${HARDWARE_COMPILE_FLAGS}) + set_target_properties(${FPGA_TARGET} PROPERTIES LINK_FLAGS ${HARDWARE_LINK_FLAGS}) + + set_target_properties(${FPGA_TARGET_REG} PROPERTIES COMPILE_FLAGS "${HARDWARE_COMPILE_FLAGS} -DUSE_FPGA_REG") + set_target_properties(${FPGA_TARGET_REG} PROPERTIES LINK_FLAGS ${HARDWARE_LINK_FLAGS}) +endif() + +# report +if(WIN32) + set(REPORT ${TARGET_NAME}_report.a) + set(REPORT_REG ${TARGET_NAME_REG}_report.a) + + add_custom_target(report DEPENDS ${REPORT} ${REPORT_REG}) + + separate_arguments(HARDWARE_LINK_FLAGS_LIST WINDOWS_COMMAND "${HARDWARE_LINK_FLAGS}") + + configure_file(${CMAKE_CURRENT_SOURCE_DIR}/${SOURCE_FILE} ${CMAKE_BINARY_DIR}/${TARGET_NAME}/${SOURCE_FILE} COPYONLY) + configure_file(${CMAKE_CURRENT_SOURCE_DIR}/${SOURCE_FILE} ${CMAKE_BINARY_DIR}/${TARGET_NAME_REG}/${SOURCE_FILE} COPYONLY) + + add_custom_command(OUTPUT ${REPORT} + COMMAND ${CMAKE_CXX_COMPILER} /EHsc ${CMAKE_CXX_FLAGS} ${HARDWARE_LINK_FLAGS_LIST} -fsycl-link ${CMAKE_BINARY_DIR}/${TARGET_NAME}/${SOURCE_FILE} -o ${CMAKE_BINARY_DIR}/${REPORT} + DEPENDS ${SOURCE_FILE}) + + add_custom_command(OUTPUT ${REPORT_REG} + COMMAND ${CMAKE_CXX_COMPILER} /EHsc ${CMAKE_CXX_FLAGS} ${HARDWARE_LINK_FLAGS_LIST} -DUSE_FPGA_REG -fsycl-link ${CMAKE_BINARY_DIR}/${TARGET_NAME_REG}/${SOURCE_FILE} -o ${CMAKE_BINARY_DIR}/${REPORT_REG} + DEPENDS ${SOURCE_FILE}) + +else() + set(REPORT ${TARGET_NAME}_report.a) + set(REPORT_REG ${TARGET_NAME_REG}_report.a) + + add_custom_target(report DEPENDS ${REPORT} ${REPORT_REG}) + + configure_file(${CMAKE_CURRENT_SOURCE_DIR}/${SOURCE_FILE} ${SOURCE_FILE} COPYONLY) + + separate_arguments(HARDWARE_LINK_FLAGS_LIST UNIX_COMMAND "${HARDWARE_LINK_FLAGS}") + add_custom_command(OUTPUT ${REPORT} + COMMAND ${CMAKE_CXX_COMPILER} ${CMAKE_CXX_FLAGS} ${HARDWARE_LINK_FLAGS_LIST} -fsycl-link ${SOURCE_FILE} -o ${CMAKE_BINARY_DIR}/${REPORT} + DEPENDS ${SOURCE_FILE}) + + add_custom_command(OUTPUT ${REPORT_REG} + COMMAND ${CMAKE_CXX_COMPILER} ${CMAKE_CXX_FLAGS} ${HARDWARE_LINK_FLAGS_LIST} -DUSE_FPGA_REG -fsycl-link ${SOURCE_FILE} -o ${CMAKE_BINARY_DIR}/${REPORT_REG} + DEPENDS ${SOURCE_FILE}) +endif() + +# run +add_custom_target(run + COMMAND ../${TARGET_NAME}.fpga_emu + DEPENDS ${TARGET_NAME}.fpga_emu) diff --git a/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/src/fpga_reg.cpp b/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/src/fpga_reg.cpp new file mode 100755 index 0000000000..c15255631b --- /dev/null +++ b/DirectProgramming/DPC++FPGA/Tutorials/Features/fpga_reg/src/fpga_reg.cpp @@ -0,0 +1,216 @@ +//============================================================== +// Copyright Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= +#include +#include +#include +#include +#include +#include "dpc_common.hpp" + +using namespace sycl; +using namespace std; + +// Artificial coefficient and offset data for our math function +constexpr size_t kSize = 64; +constexpr std::array kCoeff = { + 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, + 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, + 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, + 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64}; +constexpr std::array kOffset = { + 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, + 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, + 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, + 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1}; + +// The function our kernel will compute +// The "golden result" will be computed on the host to check the kernel result. +vector GoldenResult(vector vec) { + + // The coefficients will be modified with each iteration of the outer loop. + std::array coeff = kCoeff; + + for (int &val : vec) { + // Do some arithmetic + int acc = 0; + for (size_t i = 0; i < kSize; i++) { + acc += coeff[i] * (val + kOffset[i]); + } + + // Update coeff by rotating the values of the array + int tmp = coeff[0]; + for (size_t i = 0; i < kSize - 1; i++) { + coeff[i] = coeff[i + 1]; + } + coeff[kSize - 1] = tmp; + + // Result + val = acc; + } + + return vec; +} + +// Forward declaration of the kernel name +// (This will become unnecessary in a future compiler version.) +class SimpleMath; + +void RunKernel(const device_selector &selector, + const std::vector &vec_a, + std::vector &vec_r) { + + size_t input_size = vec_a.size(); + + try { + queue q(selector, dpc_common::exception_handler, + property::queue::enable_profiling{}); + + buffer device_a(vec_a); + // Use verbose SYCL 1.2 syntax for the output buffer. + // (This will become unnecessary in a future compiler version.) + buffer device_r(vec_r.data(), input_size); + + event e = q.submit([&](handler &h) { + auto a = device_a.get_access(h); + auto r = device_r.get_access(h); + + // FPGA-optimized kernel + // Using kernel_args_restrict tells the compiler that the input + // and output buffers won't alias. + h.single_task([=]() [[intel::kernel_args_restrict]] { + + // Force the compiler to implement the coefficient array in FPGA + // pipeline registers rather than in on-chip memory. + [[intelfpga::register]] std::array coeff = kCoeff; + + // The compiler will pipeline the outer loop. + for (size_t i = 0; i < input_size; ++i) { + int acc = 0; + int val = a[i]; + + // Fully unroll the accumulator loop. + // All of the unrolled operations can be freely scheduled by the + // DPC++ compiler's FPGA backend as part of a common data pipeline. + #pragma unroll + for (size_t j = 0; j < kSize; j++) { +#ifdef USE_FPGA_REG + // Use fpga_reg to insert a register between the copy of val used + // in each unrolled iteration. + val = intel::fpga_reg(val); + // Since val is held constant across the kSize unrolled iterations, + // the FPGA hardware structure of val's distribution changes from a + // kSize-way fanout (without fpga_reg) to a chain of of registers + // with intermediate tap offs. Refer to the diagram in the README. + + // Use fpga_reg to insert a register between each step in the acc + // adder chain. + acc = intel::fpga_reg(acc) + (coeff[j] * (val + kOffset[j])); + // This transforms a compiler-inferred adder tree into an adder + // chain, altering the structure of the pipeline. Refer to the + // diagram in the README. +#else + // Without fpga_reg, the compiler schedules the operations here + // according to its default optimization heuristics. + acc += (coeff[j] * (val + kOffset[j])); +#endif + } + + // Rotate the values of the coefficient array. + // The loop is fully unrolled. This is a cannonical code structure; + // the DPC++ compiler's FPGA backend infers a shift register here. + int tmp = coeff[0]; + #pragma unroll + for (size_t j = 0; j < kSize - 1; j++) { + coeff[j] = coeff[j + 1]; + } + coeff[kSize - 1] = tmp; + + // Result + r[i] = acc; + } + }); + }); + + // Measure kernel execution time + double start = e.get_profiling_info(); + double end = e.get_profiling_info(); + // Convert from nanoseconds to milliseconds. + double kernel_time = (end - start) * 1e-6; + + // Kernel consists of two nested loops with 3 operations in the innermost + // loop: 2 additions and 1 multiplication operation. + size_t num_ops_per_kernel = input_size * kSize * 3; + cout << "Throughput for kernel with input size " << input_size + << " and coefficient array size " << kSize << ": "; + cout << std::fixed << std::setprecision(6) + << ((double)num_ops_per_kernel / kernel_time) / 1.0e6 << " GFlops\n"; + + } catch (sycl::exception const &e) { + // Catches exceptions in the host code + std::cout << "Caught a SYCL host exception:\n" << e.what() << "\n"; + + // Most likely the runtime couldn't find FPGA hardware! + if (e.get_cl_code() == CL_DEVICE_NOT_FOUND) { + std::cout << "If you are targeting an FPGA, please ensure that your " + "system has a correctly configured FPGA board.\n"; + std::cout << "If you are targeting the FPGA emulator, compile with " + "-DFPGA_EMULATOR.\n"; + } + std::terminate(); + } +} + +int main(int argc, char *argv[]) { + size_t input_size = 1e6; + + // Optional command line override of default input size + if (argc > 1) { + string option(argv[1]); + if (option == "-h" || option == "--help") { + cout << "Usage: \n \n\nFAILED\n"; + return 1; + } else { + input_size = stoi(option); + } + } + + // Initialize input vector + constexpr int max_val = 1<<10; // Conservative max to avoid integer overflow + vector vec_a(input_size); + for (size_t i = 0; i < input_size; i++) { + vec_a[i] = rand() % max_val; + } + // Kernel result vector + vector vec_r(input_size); + + // Run the kernel on either the FPGA emulator, or FPGA +#if defined(FPGA_EMULATOR) + intel::fpga_emulator_selector selector; +#else + intel::fpga_selector selector; +#endif + RunKernel(selector, vec_a, vec_r); + + // Test the results. + vector golden_ref = GoldenResult(vec_a); + bool correct = true; + for (size_t i = 0; i < input_size; i++) { + if (vec_r[i] != golden_ref[i]) { + cout << "Found mismatch at " << i << ", " + << vec_r[i] << " != " << golden_ref[i] << "\n"; + correct = false; + } + } + + if (correct) { + cout << "PASSED: Results are correct.\n"; + } else { + cout << "FAILED: Results are incorrect.\n"; + return 1; + } + + return 0; +} diff --git a/DirectProgramming/DPC++FPGA/Tutorials/Features/loop_unroll/CMakeLists.txt b/DirectProgramming/DPC++FPGA/Tutorials/Features/loop_unroll/CMakeLists.txt new file mode 100755 index 0000000000..a94ffc91b3 --- /dev/null +++ b/DirectProgramming/DPC++FPGA/Tutorials/Features/loop_unroll/CMakeLists.txt @@ -0,0 +1,11 @@ +set(CMAKE_CXX_COMPILER "dpcpp") + +cmake_minimum_required (VERSION 2.8) + +project(LoopUnroll) + +set(CMAKE_ARCHIVE_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}) +set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}) +set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}) + +add_subdirectory (src) diff --git a/DirectProgramming/DPC++FPGA/Tutorials/Features/loop_unroll/License.txt b/DirectProgramming/DPC++FPGA/Tutorials/Features/loop_unroll/License.txt new file mode 100755 index 0000000000..e63c6e13dc --- /dev/null +++ b/DirectProgramming/DPC++FPGA/Tutorials/Features/loop_unroll/License.txt @@ -0,0 +1,7 @@ +Copyright Intel Corporation + +Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. diff --git a/DirectProgramming/DPC++FPGA/Tutorials/Features/loop_unroll/README.md b/DirectProgramming/DPC++FPGA/Tutorials/Features/loop_unroll/README.md new file mode 100755 index 0000000000..5c2528eeb0 --- /dev/null +++ b/DirectProgramming/DPC++FPGA/Tutorials/Features/loop_unroll/README.md @@ -0,0 +1,188 @@ + +# Unrolling Loops +This FPGA tutorial demonstrates a simple example of unrolling loops to improve the throughput of a DPC++ FPGA program. + +***Documentation***: The [oneAPI DPC++ FPGA Optimization Guide](https://software.intel.com/content/www/us/en/develop/documentation/oneapi-fpga-optimization-guide) provides comprehensive instructions for targeting FPGAs through DPC++. The [oneAPI Programming Guide](https://software.intel.com/en-us/oneapi-programming-guide) is a general resource for target-independent DPC++ programming. + +| Optimized for | Description +--- |--- +| OS | Linux* Ubuntu* 18.04 +| Hardware | Intel® Programmable Acceleration Card (PAC) with Intel Arria® 10 GX FPGA;
Intel® Programmable Acceleration Card (PAC) with Intel Stratix® 10 SX FPGA +| Software | Intel® oneAPI DPC++ Compiler (Beta)
Intel® FPGA Add-On for oneAPI Base Toolkit +| What you will learn | Basics of loop unrolling
How to unroll loops in your program
Determining the optimal unroll factor for your program +| Time to complete | 15 minutes + +_Notice: This code sample is not yet supported in Windows*_ + +## Purpose + +The loop unrolling mechanism is used to increase program parallelism by duplicating the compute logic within a loop. The number of times the loop logic is duplicated is called the *unroll factor*. Depending on whether the *unroll factor* is equal to the number of loop iterations or not, loop unroll methods can be categorized as *full-loop unrolling* and *partial-loop unrolling*. + +### Example: Full-Loop Unrolling +```c++ +// Before unrolling loop +#pragma unroll +for(i = 0 ; i < 5; i++){ + a[i] += 1; +} + +// Equivalent code after unrolling +// There is no longer any loop +a[0] += 1; +a[1] += 1; +a[2] += 1; +a[3] += 1; +a[4] += 1; +``` +A full unroll is a special case where the unroll factor is equal to the number of loop iterations. Here, the the Intel® oneAPI DPC++ Compiler for FPGA instantiates five adders instead of the one adder. + +### Example: Partial-Loop Unrolling + +```c++ +// Before unrolling loop +#pragma unroll 4 +for(i = 0 ; i < 20; i++){ + a[i] += 1; +} + +// Equivalent code after unrolling by a factor of 4 +// The resulting loop has five (20 / 4) iterations +for(i = 0 ; i < 5; i++){ + a[i * 4] += 1; + a[i * 4 + 1] += 1; + a[i * 4 + 2] += 1; + a[i * 4 + 3] += 1; +} +``` +Each loop iteration in the "equivalent code" contains four unrolled invocations of the first. The Intel® oneAPI DPC++ Compiler (Beta) for FPGA instantiates four adders instead of one adder. Because there is no data dependency between iterations in the loop in this case, the compiler schedules all four adds in parallel. + +### Determining the optimal unroll factor +In an FPGA design, unrolling loops is a common strategy to directly trade off on-chip resources for increased throughput. When selecting the unroll factor for specific loop, the intent is to improve throughput while minimizing resource utilization. It is also important to be mindful of other throughput constraints in your system, such as memory bandwidth. + +### Tutorial design +This tutorial demonstrates this trade-off with a simple vector add kernel. The tutorial shows how increasing the unroll factor on a loop increases throughput... until another bottleneck is encountered. This example is constructed to run up against global memory bandwidth constraints. + +The memory bandwidth on an Intel® Programmable Acceleration Card with Intel Arria® 10 GX FPGA system is about 6 GB/s. The tutorial design will likely run at around 300 MHz. In this design, the FPGA design processes a new iterations every cycle in a pipeline-parallel fashion. The theoretical computation limit for 1 adder is: + +**GFlops**: 300 MHz \* 1 float = 0.3 GFlops + +**Computation Bandwidth**: 300 MHz \* 1 float * 4 Bytes = 1.2 GB/s + +You repeat this back-of-the-envelope calculation for different unroll factors: + +Unroll Factor | GFlops (GB/s) | Compuation Bandwidth (GB/s) +------------- | ------------- | ----------------------- +1 | 0.3 | 1.2 +2 | 0.6 | 2.4 +4 | 1.2 | 4.8 +8 | 2.4 | 9.6 +16 | 4.8 | 19.2 + +On an Intel® Programmable Acceleration Card with Intel Arria® 10 GX FPGA, it is reasonable to predict that this program will become memory-bandwidth limited when unroll factor grows from 4 to 8. Check this prediction by running the design following the instructions below. + + +## Key Concepts +* Basics of loop unrolling. +* How to unroll loops in your program. +* Determining the optimal unroll factor for your program. + +## License +This code sample is licensed under MIT license. + + +## Building the `loop_unroll` Tutorial + +### Include Files +The included header `dpc_common.hpp` is located at `%ONEAPI_ROOT%\dev-utilities\latest\include` on your development system. + +### Running Samples in DevCloud +If running a sample in the Intel DevCloud, remember that you must specify the compute node (fpga_compile or fpga_runtime) as well as whether to run in batch or interactive mode. For more information see the Intel® oneAPI Base Toolkit Get Started Guide ([https://devcloud.intel.com/oneapi/get-started/base-toolkit/](https://devcloud.intel.com/oneapi/get-started/base-toolkit/)). + +When compiling for FPGA hardware, it is recommended to increase the job timeout to 12h. + +### On a Linux* System + +1. Generate the `Makefile` by running `cmake`. + ``` + mkdir build + cd build + ``` + To compile for the Intel® PAC with Intel Arria® 10 GX FPGA, run `cmake` using the command: + ``` + cmake .. + ``` + Alternatively, to compile for the Intel® PAC with Intel Stratix® 10 SX FPGA, run `cmake` using the command: + + ``` + cmake .. -DFPGA_BOARD=intel_s10sx_pac:pac_s10 + ``` + +2. Compile the design through the generated `Makefile`. The following build targets are provided, matching the recommended development flow: + + * Compile for emulation (fast compile time, targets emulated FPGA device): + ``` + make fpga_emu + ``` + * Generate the optimization report: + ``` + make report + ``` + * Compile for FPGA hardware (longer compile time, targets FPGA device): + ``` + make fpga + ``` +3. (Optional) As the above hardware compile may take several hours to complete, an Intel® PAC with Intel Arria® 10 GX FPGA precompiled binary can be downloaded here. + + + ### In Third-Party Integrated Development Environments (IDEs) + +You can compile and run this tutorial in the Eclipse* IDE (in Linux*). For instructions, refer to the following link: [Intel® oneAPI DPC++ FPGA Workflows on Third-Party IDEs](https://software.intel.com/en-us/articles/intel-oneapi-dpcpp-fpga-workflow-on-ide) + +## Examining the Reports +Locate `report.html` in the `loop_unroll_report.prj/reports/` or `loop_unroll_s10_pac_report.prj/reports/` directory. Open the report in any of Chrome*, Firefox*, Edge*, or Internet Explorer*. + +Navigate to the Area Report and compare the FPGA resource utilization of the kernels with unroll factors of 1, 2, 4, 8, and 16. In particular, check the number of DSP resources consumed. You should see the area grow roughly linearly with the unroll factor. + +You can also check the achieved system fMAX in order to verify the earlier calculations. + +## Running the Sample + + 1. Run the sample on the FPGA emulator (the kernel executes on the CPU): + ``` + ./loop_unroll.fpga_emu (Linux) + ``` +2. Run the sample on the FPGA device: + ``` + ./loop_unroll.fpga (Linux) + ``` + +### Example of Output +``` +Input Array Size: 67108864 +UnrollFactor 1 kernel time : 255.749 ms +Throughput for kernel with UnrollFactor 1: 0.262 GFlops +UnrollFactor 2 kernel time : 140.285 ms +Throughput for kernel with UnrollFactor 2: 0.478 GFlops +UnrollFactor 4 kernel time : 68.296 ms +Throughput for kernel with UnrollFactor 4: 0.983 GFlops +UnrollFactor 8 kernel time : 44.567 ms +Throughput for kernel with UnrollFactor 8: 1.506 GFlops +UnrollFactor 16 kernel time : 39.175 ms +Throughput for kernel with UnrollFactor 16: 1.713 GFlops +PASSED: The results are correct +``` + +### Discussion of Results +The following table summarizes the execution time (in ms), throughput (in GFlops), and number of DSPs used for unroll factors of 1, 2, 4, 8, and 16 for a default input array size of 64M floats (2 ^ 26 floats) on Intel® Programmable Acceleration Card with Intel® Arria® 10 GX FPGA: + +Unroll Factor | Kernel Time (ms) | Throughput (GFlops) | Num of DSPs +------------- | ------------- | -----------------------| ------- +1 | 242 | 0.277 | 1 +2 | 127 | 0.528 | 2 +4 | 63 | 1.065 | 4 +8 | 46 | 1.459 | 8 +16 | 44 | 1.525 | 16 + +Notice that when the unroll factor increases from 1 to 2 and from 2 to 4, the kernel execution time decreases by a factor of two. Correspondingly, the kernel throughput doubles. However, when the unroll factor is increase from 4 to 8 and from 8 to 16, the throughput does no longer scales by a factor of two at each step. The design is now bound by memory bandwidth limitations instead of compute unit limitations even though the hardware is replicated. + +These performance differences will be apparent only when running on FPGA hardware. The emulator, while useful for verifying functionality, will generally not reflect differences in performance. diff --git a/DirectProgramming/DPC++FPGA/Tutorials/Features/loop_unroll/sample.json b/DirectProgramming/DPC++FPGA/Tutorials/Features/loop_unroll/sample.json new file mode 100755 index 0000000000..3863df9d59 --- /dev/null +++ b/DirectProgramming/DPC++FPGA/Tutorials/Features/loop_unroll/sample.json @@ -0,0 +1,34 @@ +{ + "guid": "2760C1B6-25E5-4280-9F8F-34CA8DDEDA7C", + "name": "Unrolling Loops", + "categories": ["Toolkit/Intel® oneAPI Base Toolkit/FPGA/Tutorials"], + "description": "FPGA tutorial design demonstrating the loop_unroll pragma", + "toolchain": ["dpcpp"], + "os": ["linux"], + "targetDevice": ["FPGA"], + "builder": ["cmake"], + "languages": [{"cpp":{}}], + "ciTests": { + "linux": [ + { + "id": "fpga_emu", + "steps": [ + "mkdir build", + "cd build", + "cmake ..", + "make fpga_emu", + "./loop_unroll.fpga_emu" + ] + }, + { + "id": "report", + "steps": [ + "mkdir build", + "cd build", + "cmake ..", + "make report" + ] + } + ] + } +} diff --git a/DirectProgramming/DPC++FPGA/Tutorials/Features/loop_unroll/src/CMakeLists.txt b/DirectProgramming/DPC++FPGA/Tutorials/Features/loop_unroll/src/CMakeLists.txt new file mode 100755 index 0000000000..3ca0487ff3 --- /dev/null +++ b/DirectProgramming/DPC++FPGA/Tutorials/Features/loop_unroll/src/CMakeLists.txt @@ -0,0 +1,89 @@ +set(SOURCE_FILE loop_unroll.cpp) +set(TARGET_NAME loop_unroll) + +set(EMULATOR_TARGET ${TARGET_NAME}.fpga_emu) +set(FPGA_TARGET ${TARGET_NAME}.fpga) + +# Intel supported FPGA Boards and their names +set(A10_PAC_BOARD_NAME "intel_a10gx_pac:pac_a10") +set(S10_PAC_BOARD_NAME "intel_s10sx_pac:pac_s10") + +# Assume target is the Intel(R) PAC with Intel Arria(R) 10 GX FPGA +SET(_FPGA_BOARD ${A10_PAC_BOARD_NAME}) + +# Check if target is the Intel(R) PAC with Intel Stratix(R) 10 SX FPGA +IF (NOT DEFINED FPGA_BOARD) + MESSAGE(STATUS "\tFPGA_BOARD was not specified. Configuring the design to run on the Intel(R) Programmable Acceleration Card (PAC) with Intel Arria(R) 10 GX FPGA. Please refer to the README for more information on how to run the design on the Intel(R) PAC with Intel Stratix(R) 10 SX FPGA.") + +ELSEIF(FPGA_BOARD STREQUAL ${A10_PAC_BOARD_NAME}) + MESSAGE(STATUS "\tConfiguring the design to run on the Intel(R) Programmable Acceleration Card (PAC) with Intel Arria(R) 10 GX FPGA.") + +ELSEIF(FPGA_BOARD STREQUAL ${S10_PAC_BOARD_NAME}) + MESSAGE(STATUS "\tConfiguring the design to run on the Intel(R) Programmable Acceleration Card (PAC) with Intel Stratix(R) 10 SX FPGA.") + SET(_FPGA_BOARD ${S10_PAC_BOARD_NAME}) + +ELSE() + MESSAGE(STATUS "\tAn invalid board name was passed in using the FPGA_BOARD flag. Configuring the design to run on the Intel(R) Programmable Acceleration Card (PAC) with Intel Arria(R) 10 GX FPGA. Please refer to the README for the list of valid board names.") +ENDIF() + +set(HARDWARE_COMPILE_FLAGS "-fintelfpga") + +# use cmake -D USER_HARDWARE_FLAGS= to set extra flags for FPGA backend compilation +set(HARDWARE_LINK_FLAGS "-fintelfpga -Xshardware -Xsboard=${_FPGA_BOARD} ${USER_HARDWARE_FLAGS}") + +set(EMULATOR_COMPILE_FLAGS "-fintelfpga -DFPGA_EMULATOR") +set(EMULATOR_LINK_FLAGS "-fintelfpga") + +# fpga emulator +if(WIN32) + set(WIN_EMULATOR_TARGET ${EMULATOR_TARGET}.exe) + add_custom_target(fpga_emu DEPENDS ${WIN_EMULATOR_TARGET}) + separate_arguments(WIN_EMULATOR_COMPILE_FLAGS WINDOWS_COMMAND "${EMULATOR_COMPILE_FLAGS}") + add_custom_command(OUTPUT ${WIN_EMULATOR_TARGET} + COMMAND ${CMAKE_CXX_COMPILER} ${WIN_EMULATOR_COMPILE_FLAGS} /GX ${CMAKE_CURRENT_SOURCE_DIR}/${SOURCE_FILE} -o ${CMAKE_BINARY_DIR}/${WIN_EMULATOR_TARGET} + DEPENDS ${SOURCE_FILE}) +else() + add_executable(${EMULATOR_TARGET} ${SOURCE_FILE}) + add_custom_target(fpga_emu DEPENDS ${EMULATOR_TARGET}) + set_target_properties(${EMULATOR_TARGET} PROPERTIES COMPILE_FLAGS ${EMULATOR_COMPILE_FLAGS}) + set_target_properties(${EMULATOR_TARGET} PROPERTIES LINK_FLAGS ${EMULATOR_LINK_FLAGS}) +endif() + + +# fpga +if(WIN32) + add_custom_target(fpga + COMMAND echo "FPGA hardware flow is not supported in Windows") +else() + add_executable(${FPGA_TARGET} EXCLUDE_FROM_ALL ${SOURCE_FILE}) + add_custom_target(fpga DEPENDS ${FPGA_TARGET}) + set_target_properties(${FPGA_TARGET} PROPERTIES COMPILE_FLAGS ${HARDWARE_COMPILE_FLAGS}) + set_target_properties(${FPGA_TARGET} PROPERTIES LINK_FLAGS ${HARDWARE_LINK_FLAGS}) +endif() + +# generate report +if(WIN32) + set(DEVICE_OBJ_FILE ${TARGET_NAME}_report.a) + add_custom_target(report DEPENDS ${DEVICE_OBJ_FILE}) + + separate_arguments(HARDWARE_LINK_FLAGS_LIST WINDOWS_COMMAND "${HARDWARE_LINK_FLAGS}") + add_custom_command(OUTPUT ${DEVICE_OBJ_FILE} + COMMAND ${CMAKE_CXX_COMPILER} /EHsc ${CMAKE_CXX_FLAGS} ${HARDWARE_LINK_FLAGS_LIST} -fsycl-link ${CMAKE_CURRENT_SOURCE_DIR}/${SOURCE_FILE} -o ${CMAKE_BINARY_DIR}/${DEVICE_OBJ_FILE} + DEPENDS ${SOURCE_FILE}) + +else() + set(DEVICE_OBJ_FILE ${TARGET_NAME}_report.a) + add_custom_target(report DEPENDS ${DEVICE_OBJ_FILE}) + + configure_file(${CMAKE_CURRENT_SOURCE_DIR}/${SOURCE_FILE} ${SOURCE_FILE} COPYONLY) + + separate_arguments(HARDWARE_LINK_FLAGS_LIST UNIX_COMMAND "${HARDWARE_LINK_FLAGS}") + add_custom_command(OUTPUT ${DEVICE_OBJ_FILE} + COMMAND ${CMAKE_CXX_COMPILER} ${CMAKE_CXX_FLAGS} ${HARDWARE_LINK_FLAGS_LIST} -fsycl-link ${SOURCE_FILE} -o ${CMAKE_BINARY_DIR}/${DEVICE_OBJ_FILE} + DEPENDS ${SOURCE_FILE}) +endif() + +# run +add_custom_target(run + COMMAND ../${TARGET_NAME}.fpga_emu + DEPENDS ${TARGET_NAME}.fpga_emu) diff --git a/DirectProgramming/DPC++FPGA/Tutorials/Features/loop_unroll/src/loop_unroll.cpp b/DirectProgramming/DPC++FPGA/Tutorials/Features/loop_unroll/src/loop_unroll.cpp new file mode 100755 index 0000000000..bab7954bea --- /dev/null +++ b/DirectProgramming/DPC++FPGA/Tutorials/Features/loop_unroll/src/loop_unroll.cpp @@ -0,0 +1,138 @@ +//============================================================== +// Copyright Intel Corporation +// +// SPDX-License-Identifier: MIT +// ============================================================= +#include +#include +#include +#include +#include +#include "dpc_common.hpp" + +using namespace sycl; + +template class VAdd; + +// This function instantiates the vector add kernel, which contains +// a loop that adds up the two summand arrays and stores the result +// into sum. This loop will be unrolled by the specified unroll_factor. +template +void VecAdd(const std::vector &summands1, + const std::vector &summands2, std::vector &sum, + size_t array_size) { + + +#if defined(FPGA_EMULATOR) + intel::fpga_emulator_selector device_selector; +#else + intel::fpga_selector device_selector; +#endif + + try { + queue q(device_selector, dpc_common::exception_handler, + property::queue::enable_profiling{}); + + buffer buffer_summands1(summands1); + buffer buffer_summands2(summands2); + // Use verbose SYCL 1.2 syntax for the output buffer. + // (This will become unnecessary in a future compiler version.) + buffer buffer_sum(sum.data(), array_size); + + event e = q.submit([&](handler &h) { + auto acc_summands1 = buffer_summands1.get_access(h); + auto acc_summands2 = buffer_summands2.get_access(h); + auto acc_sum = buffer_sum.get_access(h); + + h.single_task>([=]() + [[intel::kernel_args_restrict]] { + // Unroll the loop fully or partially, depending on unroll_factor + #pragma unroll unroll_factor + for (size_t i = 0; i < array_size; i++) { + acc_sum[i] = acc_summands1[i] + acc_summands2[i]; + } + }); + }); + + double start = e.get_profiling_info(); + double end = e.get_profiling_info(); + // convert from nanoseconds to ms + double kernel_time = (double)(end - start) * 1e-6; + + std::cout << "unroll_factor " << unroll_factor + << " kernel time : " << kernel_time << " ms\n"; + std::cout << "Throughput for kernel with unroll_factor " << unroll_factor + << ": "; + std::cout << std::fixed << std::setprecision(3) + << ((double)array_size / kernel_time) / 1e6f << " GFlops\n"; + + } catch (sycl::exception const &e) { + // Catches exceptions in the host code + std::cout << "Caught a SYCL host exception:\n" << e.what() << "\n"; + + // Most likely the runtime couldn't find FPGA hardware! + if (e.get_cl_code() == CL_DEVICE_NOT_FOUND) { + std::cout << "If you are targeting an FPGA, please ensure that your " + "system has a correctly configured FPGA board.\n"; + std::cout << "If you are targeting the FPGA emulator, compile with " + "-DFPGA_EMULATOR.\n"; + } + std::terminate(); + } +} + +int main(int argc, char *argv[]) { + size_t array_size = 1 << 26; + + if (argc > 1) { + std::string option(argv[1]); + if (option == "-h" || option == "--help") { + std::cout << "Usage: \n \n\nFAILED\n"; + return 1; + } else { + array_size = std::stoi(option); + } + } + + std::vector summands1(array_size); + std::vector summands2(array_size); + + std::vector sum_unrollx1(array_size); + std::vector sum_unrollx2(array_size); + std::vector sum_unrollx4(array_size); + std::vector sum_unrollx8(array_size); + std::vector sum_unrollx16(array_size); + + // Initialize the two summand arrays (arrays to be added to each other) to + // 1:N and N:1, so that the sum of all elements is N + 1 + for (size_t i = 0; i < array_size; i++) { + summands1[i] = static_cast(i + 1); + summands2[i] = static_cast(array_size - i); + } + + std::cout << "Input Array Size: " << array_size << "\n"; + + // Instantiate VecAdd kernel with different unroll factors: 1, 2, 4, 8, 16 + // The VecAdd kernel contains a loop that adds up the two summand arrays. + // This loop will be unrolled by the specified unroll factor. + // The sum array is expected to be identical, regardless of the unroll factor. + VecAdd<1>(summands1, summands2, sum_unrollx1, array_size); + VecAdd<2>(summands1, summands2, sum_unrollx2, array_size); + VecAdd<4>(summands1, summands2, sum_unrollx4, array_size); + VecAdd<8>(summands1, summands2, sum_unrollx8, array_size); + VecAdd<16>(summands1, summands2, sum_unrollx16, array_size); + + // Verify that the output data is the same for every unroll factor + for (size_t i = 0; i < array_size; i++) { + if (sum_unrollx1[i] != summands1[i] + summands2[i] || + sum_unrollx1[i] != sum_unrollx2[i] || + sum_unrollx1[i] != sum_unrollx4[i] || + sum_unrollx1[i] != sum_unrollx8[i] || + sum_unrollx1[i] != sum_unrollx16[i]) { + std::cout << "FAILED: The results are incorrect\n"; + return 1; + } + } + std::cout << "PASSED: The results are correct\n"; + return 0; +}