Skip to content

[SYCL][HIP] Poor Memory Bandwidth due to Unnecessary Memory Write Traffic, 50% slower than OpenSYCL #10624

@biergaizi

Description

@biergaizi

Describe the bug
When targeting AMD HIP gfx906, simple SYCL code compiled by DPC++ often shows poor memory bandwidth, 50% below the hardware peak performance. Profiling shows the memory write traffic is amplified by 300% for unclear reason and it's suspected to be the cause.

To Reproduce

The problem seems to exist in many simple memory read-write kernels.

To illustrate the point, consider the following A[X] = A[X] * B[X] + C[X] vector triad kernel, test1.cpp.

#include <sycl/sycl.hpp>

void benchmark(
	float *__restrict x,
	float *__restrict y,
	float *__restrict z,
	sycl::range<1> global_size,
	sycl::queue Q
)
{
	int timesteps = 1000;
	sycl::range local_size{1024};

	auto t1 = std::chrono::high_resolution_clock().now();

	for (int i = 0; i < timesteps; i++) {
		Q.submit([&](sycl::handler &h) {
			h.parallel_for<class Bandwidth>(
				sycl::nd_range<1>{global_size, local_size}, [=](sycl::nd_item<1> item) {
					size_t i = item.get_global_id()[0];
					
					float local_x = x[i];
					float local_y = y[i];
					float local_z = z[i];

					local_x *= local_y;
					local_x += local_z;

					x[i] = local_x;
				}
			);
		});
	}
	Q.wait_and_throw();

	auto t2 = std::chrono::high_resolution_clock().now();
	double dt = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count() / 1e6;

	size_t bytes_per_iteration = global_size[0] * 4 * sizeof(float);
	fprintf(stderr, "speed: %.0f MB/s\n",
		(double) bytes_per_iteration * timesteps / dt / 1024 / 1024
	);
}

int main(int argc, char** argv)
{
	sycl::queue Q({sycl::property::queue::in_order()});

	sycl::range global_size{256 * 256 * 256};

	size_t size = global_size[0];

	float *x = sycl::malloc_shared<float>(size, Q);
	Q.memset(x, 0, sizeof(float) * size);
	Q.prefetch(x, sizeof(float) * size);

	float *y = sycl::malloc_shared<float>(size, Q);
	Q.memset(y, 0, sizeof(float) * size);
	Q.prefetch(y, sizeof(float) * size);

	float *z = sycl::malloc_shared<float>(size, Q);
	Q.memset(z, 0, sizeof(float) * size);
	Q.prefetch(z, sizeof(float) * size);

	Q.wait_and_throw();

	benchmark(x, y, z, global_size, Q);

	return 0;
}

Compile the code via Intel DPC++ using the command clang++ test1.cpp -o test1_dpcpp.elf -fsycl -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx906 --rocm-device-lib-path=/usr/lib/amdgcn/bitcode/ -Ofast -march=native -Wall -Wextra.

Running test1_dpcpp.elf shows:

speed: 486333 MB/s

The performance is 50% too low.

Compile the same code via OpenSYCL using the command syclcc test1.cpp -o test1_opensycl.elf --opensycl-targets=hip:gfx906 --rocm-device-lib-path=/usr/lib/amdgcn/bitcode/ -Ofast -march=native -Wall -Wextra --save-temps.

Running test1_opensycl.elf shows:

speed: 746092 MB/s

This is close to the realizable peak memory bandwidth on AMD Radeon Pro VII / Instinct MI50.

To profile the kernels via AMD rocprof, create a file named rocprof_input.txt with the following content:

pmc: Wavefronts
pmc: FETCH_SIZE
pmc: WRITE_SIZE
pmc: L2CacheHit

And running rocprof -i rocprof_input.txt -o rocprof_dpcpp.csv ./test1_dpcpp.elf and rocprof -i rocprof_input.txt -o rocprof_opensycl.csv ./test1_opensycl.elf.

According to rocprof_opensycl.csv, in each iteration, around 65536 kilobytes are written into memory in each iteration, as expected for an FP32 array with 16777216 elements. The L2CacheHit rate is around 0% as expected, since there's no data reuse.

Wavefronts         FETCH_SIZE         WRITE_SIZE        L2CacheHit
131072.0000000000  1916.1250000000    65536.0000000000  0.8920779599
131072.0000000000  1938.1250000000    65536.0000000000  0.2148528725
131072.0000000000  2099.6875000000    65536.0000000000  0.2160909461
262144.0000000000  196851.1250000000  65536.0000000000  0.0043812751
262144.0000000000  196863.4375000000  65536.0000000000  0.0021905332
262144.0000000000  196892.8125000000  65536.0000000000  0.0021903406
262144.0000000000  196888.1875000000  65536.0000000000  0.0021904032
262144.0000000000  196890.9375000000  65536.0000000000  0.0021903774
262144.0000000000  196888.4375000000  65536.0000000000  0.0021903733

But according to rocprof_dpcpp.csv, in each iteration, around 159336 kilobytes in written into memory (not including the first three iterations, which are memset()). This is 300% as much as the theoretical value. The L2CacheHit rate is around 48%, also indicating that somehow there are redundant loads or stores.

Wavefronts         FETCH_SIZE         WRITE_SIZE         L2CacheHit
131072.0000000000  1996.3750000000    65536.0000000000   0.6253965377
131072.0000000000  1858.9375000000    65536.0000000000   0.2319405694
131072.0000000000  1827.9375000000    65536.0000000000   0.2351738241
262144.0000000000  196781.1875000000  159211.2187500000  49.2742437845
262144.0000000000  196806.5000000000  159336.3125000000  48.5368960684
262144.0000000000  196822.6250000000  160237.6875000000  48.5208721942
262144.0000000000  196833.9375000000  158601.8750000000  48.5464310007
262144.0000000000  196827.8125000000  159001.8125000000  48.5544115839
262144.0000000000  196823.6250000000  161945.0625000000  48.5346288889

Environment (please complete the following information):

  • OS: Linux 6.4.3
  • Target device and vendor: AMD HIP gfx906 (Radeon Pro VII / Instinct MI50)
  • DPC++ version: clang version 17.0.0 (https://github.com/intel/llvm.git 8ea3e8eb65b863dfacb3c970d4403ae322e8d02e)
  • Dependencies version: ROCm 5.4.3, HIP 5.4.3.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions