[SYCL][HIP] Poor Memory Bandwidth due to Unnecessary Memory Write Traffic, 50% slower than OpenSYCL

**Describe the bug**
When targeting AMD HIP `gfx906`, simple SYCL code compiled by DPC++ often shows poor memory bandwidth, 50% below the hardware peak performance. Profiling shows the memory write traffic is amplified by 300% for unclear reason and it's suspected to be the cause.

**To Reproduce**

The problem seems to exist in many simple memory read-write kernels.

To illustrate the point, consider the following `A[X] = A[X] * B[X] + C[X]` vector triad kernel, `test1.cpp`.

```c++
#include <sycl/sycl.hpp>

void benchmark(
	float *__restrict x,
	float *__restrict y,
	float *__restrict z,
	sycl::range<1> global_size,
	sycl::queue Q
)
{
	int timesteps = 1000;
	sycl::range local_size{1024};

	auto t1 = std::chrono::high_resolution_clock().now();

	for (int i = 0; i < timesteps; i++) {
		Q.submit([&](sycl::handler &h) {
			h.parallel_for<class Bandwidth>(
				sycl::nd_range<1>{global_size, local_size}, [=](sycl::nd_item<1> item) {
					size_t i = item.get_global_id()[0];
					
					float local_x = x[i];
					float local_y = y[i];
					float local_z = z[i];

					local_x *= local_y;
					local_x += local_z;

					x[i] = local_x;
				}
			);
		});
	}
	Q.wait_and_throw();

	auto t2 = std::chrono::high_resolution_clock().now();
	double dt = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count() / 1e6;

	size_t bytes_per_iteration = global_size[0] * 4 * sizeof(float);
	fprintf(stderr, "speed: %.0f MB/s\n",
		(double) bytes_per_iteration * timesteps / dt / 1024 / 1024
	);
}

int main(int argc, char** argv)
{
	sycl::queue Q({sycl::property::queue::in_order()});

	sycl::range global_size{256 * 256 * 256};

	size_t size = global_size[0];

	float *x = sycl::malloc_shared<float>(size, Q);
	Q.memset(x, 0, sizeof(float) * size);
	Q.prefetch(x, sizeof(float) * size);

	float *y = sycl::malloc_shared<float>(size, Q);
	Q.memset(y, 0, sizeof(float) * size);
	Q.prefetch(y, sizeof(float) * size);

	float *z = sycl::malloc_shared<float>(size, Q);
	Q.memset(z, 0, sizeof(float) * size);
	Q.prefetch(z, sizeof(float) * size);

	Q.wait_and_throw();

	benchmark(x, y, z, global_size, Q);

	return 0;
}
```

Compile the code via Intel DPC++ using the command `clang++ test1.cpp -o test1_dpcpp.elf -fsycl -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx906 --rocm-device-lib-path=/usr/lib/amdgcn/bitcode/ -Ofast -march=native -Wall -Wextra`.

Running `test1_dpcpp.elf` shows:

    speed: 486333 MB/s

The performance is 50% too low.

Compile the same code via OpenSYCL using the command `syclcc test1.cpp -o test1_opensycl.elf --opensycl-targets=hip:gfx906 --rocm-device-lib-path=/usr/lib/amdgcn/bitcode/ -Ofast -march=native -Wall -Wextra --save-temps`.

Running `test1_opensycl.elf` shows:

    speed: 746092 MB/s

This is close to the realizable peak memory bandwidth on AMD Radeon Pro VII / Instinct MI50. 

To profile the kernels via AMD `rocprof`, create a file named `rocprof_input.txt` with the following content:

    pmc: Wavefronts
    pmc: FETCH_SIZE
    pmc: WRITE_SIZE
    pmc: L2CacheHit

And running `rocprof -i rocprof_input.txt -o rocprof_dpcpp.csv ./test1_dpcpp.elf` and `rocprof -i rocprof_input.txt -o rocprof_opensycl.csv ./test1_opensycl.elf`.

According to `rocprof_opensycl.csv`, in each iteration, around 65536 kilobytes are written into memory in each iteration, as expected for an FP32 array with 16777216 elements. The `L2CacheHit` rate is around 0% as expected, since there's no data reuse.

```
Wavefronts         FETCH_SIZE         WRITE_SIZE        L2CacheHit
131072.0000000000  1916.1250000000    65536.0000000000  0.8920779599
131072.0000000000  1938.1250000000    65536.0000000000  0.2148528725
131072.0000000000  2099.6875000000    65536.0000000000  0.2160909461
262144.0000000000  196851.1250000000  65536.0000000000  0.0043812751
262144.0000000000  196863.4375000000  65536.0000000000  0.0021905332
262144.0000000000  196892.8125000000  65536.0000000000  0.0021903406
262144.0000000000  196888.1875000000  65536.0000000000  0.0021904032
262144.0000000000  196890.9375000000  65536.0000000000  0.0021903774
262144.0000000000  196888.4375000000  65536.0000000000  0.0021903733
```

But according to `rocprof_dpcpp.csv`, in each iteration, around 159336 kilobytes in written into memory (not including the first three iterations, which are `memset()`). This is 300% as much as the theoretical value. The `L2CacheHit` rate is around 48%, also indicating that somehow there are redundant loads or stores. 

```
Wavefronts         FETCH_SIZE         WRITE_SIZE         L2CacheHit
131072.0000000000  1996.3750000000    65536.0000000000   0.6253965377
131072.0000000000  1858.9375000000    65536.0000000000   0.2319405694
131072.0000000000  1827.9375000000    65536.0000000000   0.2351738241
262144.0000000000  196781.1875000000  159211.2187500000  49.2742437845
262144.0000000000  196806.5000000000  159336.3125000000  48.5368960684
262144.0000000000  196822.6250000000  160237.6875000000  48.5208721942
262144.0000000000  196833.9375000000  158601.8750000000  48.5464310007
262144.0000000000  196827.8125000000  159001.8125000000  48.5544115839
262144.0000000000  196823.6250000000  161945.0625000000  48.5346288889
```

**Environment (please complete the following information):**

- OS: Linux 6.4.3
- Target device and vendor: AMD HIP `gfx906` (Radeon Pro VII / Instinct MI50)
- DPC++ version: `clang version 17.0.0 (https://github.com/intel/llvm.git 8ea3e8eb65b863dfacb3c970d4403ae322e8d02e)`
- Dependencies version: ROCm 5.4.3, HIP 5.4.3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYCL][HIP] Poor Memory Bandwidth due to Unnecessary Memory Write Traffic, 50% slower than OpenSYCL #10624

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[SYCL][HIP] Poor Memory Bandwidth due to Unnecessary Memory Write Traffic, 50% slower than OpenSYCL #10624

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions