diff --git a/.github/images/FileStructure.png b/.github/images/FileStructure.png new file mode 100644 index 0000000000..d7b9f36809 Binary files /dev/null and b/.github/images/FileStructure.png differ diff --git a/.github/images/README.md b/.github/images/README.md new file mode 100644 index 0000000000..8b13789179 --- /dev/null +++ b/.github/images/README.md @@ -0,0 +1 @@ + diff --git a/DirectProgramming/DPC++/GraphAlgorithms/all-pairs-shortest-paths/README.md b/DirectProgramming/DPC++/GraphAlgorithms/all-pairs-shortest-paths/README.md index 23db7ecd8a..ade650a2d2 100644 --- a/DirectProgramming/DPC++/GraphAlgorithms/all-pairs-shortest-paths/README.md +++ b/DirectProgramming/DPC++/GraphAlgorithms/all-pairs-shortest-paths/README.md @@ -1,5 +1,5 @@ # All Pairs Shortest Paths sample -`All Pairs Shortest Paths` uses the Floyd Warshall algorithm to find the shortest paths between pairs of vertices in a graph. It uses a parallel blocked algorithm that enables the application to efficiently offload compute intensive work to the GPU. +`All Pairs Shortest Paths` uses the Floyd-Warshall algorithm to find the shortest paths between pairs of vertices in a graph. It uses a parallel blocked algorithm that enables the application to efficiently offload compute intensive work to the GPU. For comprehensive instructions regarding DPC++ Programming, go to https://software.intel.com/en-us/oneapi-programming-guide and search based on relevant terms noted in the comments. @@ -11,23 +11,27 @@ For comprehensive instructions regarding DPC++ Programming, go to https://softwa | What you will learn | The All Pairs Shortest Paths sample demonstrates the following using the Intel® oneAPI DPC++/C++ Compiler | Time to complete | 15 minutes -## Purpose -This sample uses blocked Floyd-Warshall all pairs shortest paths algorithm to compute a matrix that represents the minimum distance from any node to all other nodes in the graph. Using parallel blocked processing, blocks can be calculated simultaneously by distributing task computations to the GPU. -## Key implementation details -The basic DPC++ implementation explained in the code includes device selector, unified shared memory, kernel, and command groups. +## Purpose +This sample uses blocked Floyd-Warshall all pairs shortest paths algorithm to compute a matrix that represents the minimum distance from any node to all other nodes in the graph. Using parallel blocked processing, blocks can be calculated simultaneously by distributing task computations to the GPU. For comparison, the application is run sequentially and parallel with run times for each displayed in the application's output. The device where the code us run is also identified. -The parallel implementation of blocked Floyd Warshall algorithm has three phases. Given a prior round of these computation phases are complete, phase 1 is independent; Phase 2 can only execute after phase 1 completes; Similarly phase 3 depends on phase 2 so can only execute after phase 2 is complete. +The parallel implementation of blocked Floyd-Warshall algorithm has three phases. Given a prior round of these computation phases are complete, phase 1 is independent; Phase 2 can only execute after phase 1 completes; Similarly phase 3 depends on phase 2 so can only execute after phase 2 is complete. The inner loop of the sequential implementation is: g[i][j] = min(g[i][j], g[i][k] + g[k][j]) -A careful observation shows that for the kth iteration of the outer loop, the computation depends on cells either on the kth column, g[i][k] or on the kth row, g[k][j] of the graph. Phase 1 handles g[k][k], phase 2 handles g[\*][k] and g[k][\*], and phase 3 handles g[\*][\*] in that sequence. This cell level observations largely propagate to the blocks as well. +A careful observation shows that for the kth iteration of the outer loop, the computation depends on cells either on the kth column, g[i][k] or on the kth row, g[k][j] of the graph. Phase 1 handles g[k][k], phase 2 handles g[\*][k] and g[k][\*], and phase 3 handles g[\*][\*] in that sequence. This cell level observations largely propagate to the blocks as well. + +In each phase computation within a block can proceed independently in parallel. + + +## Key implementation details +Includes device selector, unified shared memory, kernel, and command groups in order to implement a solution using parallel block method targeting the GPU. -In each phase computation within a block can proceed independently in parallel. ## License -This code sample is licensed under MIT license +This code sample is licensed under MIT license. + ## Building the Program for CPU and GPU @@ -61,6 +65,7 @@ Perform the following steps: * Build the program using VS2017 or VS2019: Right click on the solution file and open using either VS2017 or VS2019 IDE. Right click on the project in Solution explorer and select Rebuild. From top menu select Debug -> Start without Debugging. * Build the program using MSBuild: Open "x64 Native Tools Command Prompt for VS2017" or "x64 Native Tools Command Prompt for VS2019". Run - MSBuild all-pairs-shortest-paths.sln /t:Rebuild /p:Configuration="Release" + ## Running the sample ### Example Output diff --git a/DirectProgramming/DPC++/GraphAlgorithms/all-pairs-shortest-paths/sample.json b/DirectProgramming/DPC++/GraphAlgorithms/all-pairs-shortest-paths/sample.json index 4550d7da16..9a1b7b9dad 100644 --- a/DirectProgramming/DPC++/GraphAlgorithms/all-pairs-shortest-paths/sample.json +++ b/DirectProgramming/DPC++/GraphAlgorithms/all-pairs-shortest-paths/sample.json @@ -2,7 +2,7 @@ "guid": "4F0F5FBD-C237-4A9F-B259-2C56ABFB40D9", "name": "all-pairs-shortest-paths", "categories": [ "Toolkit/Intel® oneAPI Base Toolkit/oneAPI DPC++/C++ Compiler/CPU and GPU" ], - "description": "all-pairs-shortest-paths using Intel® oneAPI DPC++ Language", + "description": "All Pairs Shortest Paths uses finds the shortest paths between pairs of vertices in a graph using a parallel blocked algorithm that enables the application to efficiently offload compute intensive work to the GPU.", "toolchain": [ "dpcpp" ], "targetDevice": [ "CPU", "GPU" ], "languages": [ { "cpp": {} } ], diff --git a/DirectProgramming/DPC++/SparseLinearAlgebra/merge-spmv/README.md b/DirectProgramming/DPC++/SparseLinearAlgebra/merge-spmv/README.md index c1c06face0..0bd841ffe2 100644 --- a/DirectProgramming/DPC++/SparseLinearAlgebra/merge-spmv/README.md +++ b/DirectProgramming/DPC++/SparseLinearAlgebra/merge-spmv/README.md @@ -11,11 +11,11 @@ For comprehensive instructions regarding DPC++ Programming, go to https://softwa | What you will learn | The Sparse Matrix Vector sample demonstrates the following using the Intel® oneAPI DPC++/C++ Compiler | Time to complete | 15 minutes + ## Purpose -Sparse linear algebra algorithms are common in HPC, in fields as machine learning and computational science. In this sample, a merge based sparse matrix and vector multiplication algorithm is implemented. The input matrix is in compressed sparse row format. Use a parallel merge model enables the application to efficiently offload compute intensive operation to the GPU. +Sparse linear algebra algorithms are common in HPC, in fields as machine learning and computational science. In this sample, a merge based sparse matrix and vector multiplication algorithm is implemented. The input matrix is in compressed sparse row format. Use a parallel merge model enables the application to efficiently offload compute intensive operation to the GPU. For comparison, the application is run sequentially and parallelly with run times for each displayed in the application's output. The device where the code is run is also identified. -## Key implementation details -The basic DPC++ implementation explained in the code includes device selector, unified shared memory, kernel, and command groups. +The workgroup size requirement is 256. If your hardware cannot support this, the application will present an error. Compressed Sparse Row (CSR) representation for sparse matrix have three components: