-
Notifications
You must be signed in to change notification settings - Fork 791
Description
Describe the bug
When targeting AMD HIP gfx906, simple SYCL code compiled by DPC++ often shows poor memory bandwidth, 50% below the hardware peak performance. Profiling shows the memory write traffic is amplified by 300% for unclear reason and it's suspected to be the cause.
To Reproduce
The problem seems to exist in many simple memory read-write kernels.
To illustrate the point, consider the following A[X] = A[X] * B[X] + C[X] vector triad kernel, test1.cpp.
#include <sycl/sycl.hpp>
void benchmark(
float *__restrict x,
float *__restrict y,
float *__restrict z,
sycl::range<1> global_size,
sycl::queue Q
)
{
int timesteps = 1000;
sycl::range local_size{1024};
auto t1 = std::chrono::high_resolution_clock().now();
for (int i = 0; i < timesteps; i++) {
Q.submit([&](sycl::handler &h) {
h.parallel_for<class Bandwidth>(
sycl::nd_range<1>{global_size, local_size}, [=](sycl::nd_item<1> item) {
size_t i = item.get_global_id()[0];
float local_x = x[i];
float local_y = y[i];
float local_z = z[i];
local_x *= local_y;
local_x += local_z;
x[i] = local_x;
}
);
});
}
Q.wait_and_throw();
auto t2 = std::chrono::high_resolution_clock().now();
double dt = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count() / 1e6;
size_t bytes_per_iteration = global_size[0] * 4 * sizeof(float);
fprintf(stderr, "speed: %.0f MB/s\n",
(double) bytes_per_iteration * timesteps / dt / 1024 / 1024
);
}
int main(int argc, char** argv)
{
sycl::queue Q({sycl::property::queue::in_order()});
sycl::range global_size{256 * 256 * 256};
size_t size = global_size[0];
float *x = sycl::malloc_shared<float>(size, Q);
Q.memset(x, 0, sizeof(float) * size);
Q.prefetch(x, sizeof(float) * size);
float *y = sycl::malloc_shared<float>(size, Q);
Q.memset(y, 0, sizeof(float) * size);
Q.prefetch(y, sizeof(float) * size);
float *z = sycl::malloc_shared<float>(size, Q);
Q.memset(z, 0, sizeof(float) * size);
Q.prefetch(z, sizeof(float) * size);
Q.wait_and_throw();
benchmark(x, y, z, global_size, Q);
return 0;
}Compile the code via Intel DPC++ using the command clang++ test1.cpp -o test1_dpcpp.elf -fsycl -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx906 --rocm-device-lib-path=/usr/lib/amdgcn/bitcode/ -Ofast -march=native -Wall -Wextra.
Running test1_dpcpp.elf shows:
speed: 486333 MB/s
The performance is 50% too low.
Compile the same code via OpenSYCL using the command syclcc test1.cpp -o test1_opensycl.elf --opensycl-targets=hip:gfx906 --rocm-device-lib-path=/usr/lib/amdgcn/bitcode/ -Ofast -march=native -Wall -Wextra --save-temps.
Running test1_opensycl.elf shows:
speed: 746092 MB/s
This is close to the realizable peak memory bandwidth on AMD Radeon Pro VII / Instinct MI50.
To profile the kernels via AMD rocprof, create a file named rocprof_input.txt with the following content:
pmc: Wavefronts
pmc: FETCH_SIZE
pmc: WRITE_SIZE
pmc: L2CacheHit
And running rocprof -i rocprof_input.txt -o rocprof_dpcpp.csv ./test1_dpcpp.elf and rocprof -i rocprof_input.txt -o rocprof_opensycl.csv ./test1_opensycl.elf.
According to rocprof_opensycl.csv, in each iteration, around 65536 kilobytes are written into memory in each iteration, as expected for an FP32 array with 16777216 elements. The L2CacheHit rate is around 0% as expected, since there's no data reuse.
Wavefronts FETCH_SIZE WRITE_SIZE L2CacheHit
131072.0000000000 1916.1250000000 65536.0000000000 0.8920779599
131072.0000000000 1938.1250000000 65536.0000000000 0.2148528725
131072.0000000000 2099.6875000000 65536.0000000000 0.2160909461
262144.0000000000 196851.1250000000 65536.0000000000 0.0043812751
262144.0000000000 196863.4375000000 65536.0000000000 0.0021905332
262144.0000000000 196892.8125000000 65536.0000000000 0.0021903406
262144.0000000000 196888.1875000000 65536.0000000000 0.0021904032
262144.0000000000 196890.9375000000 65536.0000000000 0.0021903774
262144.0000000000 196888.4375000000 65536.0000000000 0.0021903733
But according to rocprof_dpcpp.csv, in each iteration, around 159336 kilobytes in written into memory (not including the first three iterations, which are memset()). This is 300% as much as the theoretical value. The L2CacheHit rate is around 48%, also indicating that somehow there are redundant loads or stores.
Wavefronts FETCH_SIZE WRITE_SIZE L2CacheHit
131072.0000000000 1996.3750000000 65536.0000000000 0.6253965377
131072.0000000000 1858.9375000000 65536.0000000000 0.2319405694
131072.0000000000 1827.9375000000 65536.0000000000 0.2351738241
262144.0000000000 196781.1875000000 159211.2187500000 49.2742437845
262144.0000000000 196806.5000000000 159336.3125000000 48.5368960684
262144.0000000000 196822.6250000000 160237.6875000000 48.5208721942
262144.0000000000 196833.9375000000 158601.8750000000 48.5464310007
262144.0000000000 196827.8125000000 159001.8125000000 48.5544115839
262144.0000000000 196823.6250000000 161945.0625000000 48.5346288889
Environment (please complete the following information):
- OS: Linux 6.4.3
- Target device and vendor: AMD HIP
gfx906(Radeon Pro VII / Instinct MI50) - DPC++ version:
clang version 17.0.0 (https://github.com/intel/llvm.git 8ea3e8eb65b863dfacb3c970d4403ae322e8d02e) - Dependencies version: ROCm 5.4.3, HIP 5.4.3.