diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_cuda_tex_cache_read.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_cuda_tex_cache_read.asciidoc new file mode 100644 index 0000000000000..8ed8c738651e4 --- /dev/null +++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_cuda_tex_cache_read.asciidoc @@ -0,0 +1,118 @@ += sycl_ext_oneapi_cuda_tex_cache_read + +:source-highlighter: coderay +:coderay-linenums-mode: table + +// This section needs to be after the document title. +:doctype: book +:toc2: +:toc: left +:encoding: utf-8 +:lang: en +:dpcpp: pass:[DPC++] + +// Set the default source code type in this document to C++, +// for syntax highlighting purposes. This is needed because +// docbook uses c++ and html5 uses cpp. +:language: {basebackend@docbook:c++:cpp} + + +== Notice + +[%hardbreaks] +Copyright (C) 2022-2023 Intel Corporation. All rights reserved. + +Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks +of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. used by +permission by Khronos. + + +== Contact + +To report problems with this extension, please open a new issue at: + +https://github.com/intel/llvm/issues + + +== Dependencies + +This extension is written against the SYCL 2020 revision 6 specification. All +references below to the "core SYCL specification" or to section numbers in the +SYCL specification refer to that revision. + + +== Status + +This is an experimental extension specification, intended to provide early +access to features and gather community feedback. Interfaces defined in this +specification are implemented in {dpcpp}, but they are not finalized and may +change incompatibly in future versions of {dpcpp} without prior notice. +*Shipping software products should not rely on APIs defined in this +specification.* + +== Backend support status + +This extension is supported by all backends. However, as described in the following sections, this extension is currently only functionally useful in the `ext_oneapi_cuda` backend. + + +== Overview + +When used with the `ext_oneapi_cuda` backend only, this extension exposes the https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-ld-global-nc[ld-global-nc] ptx instruction so that users can load a register variable to the non-coherent read-only texture cache. The texture cache is designed for random access reads and is most performant when work-items within a sub-group read a set of addresses of texture memory that are close to one another in the cache. This extension introduces a free function, `ldg`, that is available for all backends and devices; however this function will only make use of a special hardware feature (the texture cache) when called from the `ext_oneapi_cuda` backend. The only reason that `ldg` is supported on backends other than `ext_oneapi_cuda` is to allow for code portability. + +== Specification + +=== Feature test macro + +This extension provides a feature-test macro as described in the core SYCL +specification. An implementation supporting this extension must predefine the +macro `SYCL_EXT_ONEAPI_CUDA_TEX_CACHE_READ` to one of the values defined in the table +below. Applications can test for the existence of this macro to determine if +the implementation supports this feature, or applications can test the macro's +value to determine which of the extension's features the implementation +supports. + +[%header,cols="1,5"] +|=== +|Value +|Description + +|1 +|The APIs of this experimental extension are not versioned, so the + feature-test macro always has this value. +|=== + +=== `ldg` free function + +This extension adds a single templated free function which may be called from device +code. This function is not available in host code. + +``` +namespace sycl::ext::oneapi::experimental::cuda { + +template +T ldg(const T* ptr); + +} // namespace sycl::ext::oneapi::experimental::cuda +``` + +`ldg` returns the data of type `T` located at address `ptr`. When called from the `ext_oneapi_cuda` backend the data is cached in the read-only texture cache. +When called from any other backend a copy of the data stored at address `ptr` is returned without using any special cache. + +The template parameter `T` can be one of `char`, `signed char`, `short`, `int`, `long`, `long long`, `unsigned char`, `unsigned short`, `unsigned int`, `unsigned long`, `unsigned long long`, `vec`, `vec`, `vec`, `vec`, `vec`, `vec`, `vec`, `vec`, `vec`, `vec`, `vec`, `vec`, `vec`, `vec`, `float`, `vec`, `vec`, `double`, or `vec`. + +=== Example of usage + +``` + h.parallel_for(range, [=](sycl::nd_item<1> item) { + const int idx = item.get_global_id(0); + auto cachedA = sycl::ext::oneapi::experimental::cuda::ldg(&A[idx]); + auto cachedB = sycl::ext::oneapi::experimental::cuda::ldg(&B[idx]); + C[idx] = cachedA + cachedB; + }); +``` + +IMPORTANT: Sometimes the compiler may not be sure that the read-only condition is satisfied. In such cases users can mark the declaration of the pointer used as the argument to `ldg` with both the `const` and `$$__$$restrict$$__$$` qualifiers, which may aid the compiler in detecting the read-only condition. Additionally, data returned from `ldg`, e.g. `cacheA` and `cacheB` in the above example, should not be written to at any point within the kernel. If such data is written to at any point in the kernel, the code will compile and execute correctly, however the texture cache will not be used. + +=== Issues + +- Investigate exposing this functionality through annotated_ptr once the sycl_ext_oneapi_annotated_ptr becomes finalized. diff --git a/sycl/include/sycl/ext/oneapi/experimental/cuda/builtins.hpp b/sycl/include/sycl/ext/oneapi/experimental/cuda/builtins.hpp new file mode 100644 index 0000000000000..cdbf6ab642844 --- /dev/null +++ b/sycl/include/sycl/ext/oneapi/experimental/cuda/builtins.hpp @@ -0,0 +1,225 @@ +//==--- builtins.hpp - SYCL_ONEAPI_CUDA experimental builtins -------------==// +// +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//===----------------------------------------------------------------------===// + +#pragma once + +#define SYCL_EXT_ONEAPI_CUDA_TEX_CACHE_READ 1 + +#include + +#if defined(_WIN32) || defined(_WIN64) +#define ATTRIBUTE_EXT_VEC_TYPE(N) __declspec(ext_vector_type(N)) +#else +#define ATTRIBUTE_EXT_VEC_TYPE(N) __attribute__((ext_vector_type(N))) +#endif + +namespace sycl { +__SYCL_INLINE_VER_NAMESPACE(_V1) { +namespace ext { +namespace oneapi { +namespace experimental { +namespace cuda { + +namespace detail { +using ldg_vector_types = sycl::detail::type_list< + sycl::char2, sycl::char4, sycl::short2, sycl::short4, sycl::int2, + sycl::int4, sycl::longlong2, sycl::uchar2, sycl::uchar4, sycl::ushort2, + sycl::ushort4, sycl::uint2, sycl::uint4, sycl::ulonglong2, sycl::float2, + sycl::float4, sycl::double2>; + +using ldg_types = + sycl::detail::type_list; +} // namespace detail + +template +inline __SYCL_ALWAYS_INLINE std::enable_if_t< + sycl::detail::is_contained< + T, sycl::ext::oneapi::experimental::cuda::detail::ldg_types>::value, + T> +ldg(const T *ptr) { +#if defined(__SYCL_DEVICE_ONLY__) +#if defined(__NVPTX__) + if constexpr (std::is_same_v) { + return __nvvm_ldg_c(ptr); + } else if constexpr (std::is_same_v) { + return __nvvm_ldg_s(ptr); + } else if constexpr (std::is_same_v) { + return __nvvm_ldg_i(ptr); + } else if constexpr (std::is_same_v) { + return __nvvm_ldg_l(ptr); + } else if constexpr (std::is_same_v) { + return __nvvm_ldg_ll(ptr); + } else if constexpr (std::is_same_v) { + return __nvvm_ldg_uc(ptr); + } else if constexpr (std::is_same_v) { + return __nvvm_ldg_us(ptr); + } else if constexpr (std::is_same_v) { + return __nvvm_ldg_ui(ptr); + } else if constexpr (std::is_same_v) { + return __nvvm_ldg_ul(ptr); + } else if constexpr (std::is_same_v) { + return __nvvm_ldg_ull(ptr); + } else if constexpr (std::is_same_v) { + return __nvvm_ldg_f(ptr); + } else if constexpr (std::is_same_v) { + return __nvvm_ldg_d(ptr); + } else if constexpr (std::is_same_v) { + // We can assume that ptr is aligned at least to char2's alignment, but the + // load will assume that ptr is aligned to char2's alignment. This is only + // safe if alignof(f2) <= alignof(char2). + typedef char c2 ATTRIBUTE_EXT_VEC_TYPE(2); + c2 rv = __nvvm_ldg_c2(reinterpret_cast(ptr)); + sycl::char2 ret; + ret.x() = rv[0]; + ret.y() = rv[1]; + return ret; + } else if constexpr (std::is_same_v) { + typedef char c4 ATTRIBUTE_EXT_VEC_TYPE(4); + c4 rv = __nvvm_ldg_c4(reinterpret_cast(ptr)); + sycl::char4 ret; + ret.x() = rv[0]; + ret.y() = rv[1]; + ret.z() = rv[2]; + ret.w() = rv[3]; + return ret; + } else if constexpr (std::is_same_v) { + typedef short s2 ATTRIBUTE_EXT_VEC_TYPE(2); + s2 rv = __nvvm_ldg_s2(reinterpret_cast(ptr)); + sycl::short2 ret; + ret.x() = rv[0]; + ret.y() = rv[1]; + return ret; + } else if constexpr (std::is_same_v) { + typedef short s4 ATTRIBUTE_EXT_VEC_TYPE(4); + s4 rv = __nvvm_ldg_s4(reinterpret_cast(ptr)); + sycl::short4 ret; + ret.x() = rv[0]; + ret.y() = rv[1]; + ret.z() = rv[2]; + ret.w() = rv[3]; + return ret; + } else if constexpr (std::is_same_v) { + typedef int i2 ATTRIBUTE_EXT_VEC_TYPE(2); + i2 rv = __nvvm_ldg_i2(reinterpret_cast(ptr)); + sycl::int2 ret; + ret.x() = rv[0]; + ret.y() = rv[1]; + return ret; + } else if constexpr (std::is_same_v) { + typedef int i4 ATTRIBUTE_EXT_VEC_TYPE(4); + i4 rv = __nvvm_ldg_i4(reinterpret_cast(ptr)); + sycl::int4 ret; + ret.x() = rv[0]; + ret.y() = rv[1]; + ret.z() = rv[2]; + ret.w() = rv[3]; + return ret; + } else if constexpr (std::is_same_v) { + typedef long long ll2 ATTRIBUTE_EXT_VEC_TYPE(2); + ll2 rv = __nvvm_ldg_ll2(reinterpret_cast(ptr)); + sycl::longlong2 ret; + ret.x() = rv[0]; + ret.y() = rv[1]; + return ret; + } else if constexpr (std::is_same_v) { + typedef unsigned char uc2 ATTRIBUTE_EXT_VEC_TYPE(2); + uc2 rv = __nvvm_ldg_uc2(reinterpret_cast(ptr)); + sycl::uchar2 ret; + ret.x() = rv[0]; + ret.y() = rv[1]; + return ret; + } else if constexpr (std::is_same_v) { + typedef unsigned char uc4 ATTRIBUTE_EXT_VEC_TYPE(4); + uc4 rv = __nvvm_ldg_uc4(reinterpret_cast(ptr)); + sycl::uchar4 ret; + ret.x() = rv[0]; + ret.y() = rv[1]; + ret.z() = rv[2]; + ret.w() = rv[3]; + return ret; + } else if constexpr (std::is_same_v) { + typedef unsigned short us2 ATTRIBUTE_EXT_VEC_TYPE(2); + us2 rv = __nvvm_ldg_us2(reinterpret_cast(ptr)); + sycl::ushort2 ret; + ret.x() = rv[0]; + ret.y() = rv[1]; + return ret; + } else if constexpr (std::is_same_v) { + typedef unsigned short us4 ATTRIBUTE_EXT_VEC_TYPE(4); + us4 rv = __nvvm_ldg_us4(reinterpret_cast(ptr)); + sycl::ushort4 ret; + ret.x() = rv[0]; + ret.y() = rv[1]; + ret.z() = rv[2]; + ret.w() = rv[3]; + return ret; + } else if constexpr (std::is_same_v) { + typedef unsigned int ui2 ATTRIBUTE_EXT_VEC_TYPE(2); + ui2 rv = __nvvm_ldg_ui2(reinterpret_cast(ptr)); + sycl::uint2 ret; + ret.x() = rv[0]; + ret.y() = rv[1]; + return ret; + } else if constexpr (std::is_same_v) { + typedef unsigned int ui4 ATTRIBUTE_EXT_VEC_TYPE(4); + ui4 rv = __nvvm_ldg_ui4(reinterpret_cast(ptr)); + sycl::uint4 ret; + ret.x() = rv[0]; + ret.y() = rv[1]; + ret.z() = rv[2]; + ret.w() = rv[3]; + return ret; + } else if constexpr (std::is_same_v) { + typedef unsigned long long ull2 ATTRIBUTE_EXT_VEC_TYPE(2); + ull2 rv = __nvvm_ldg_ull2(reinterpret_cast(ptr)); + sycl::ulonglong2 ret; + ret.x() = rv[0]; + ret.y() = rv[1]; + return ret; + } else if constexpr (std::is_same_v) { + typedef float f2 ATTRIBUTE_EXT_VEC_TYPE(2); + f2 rv = __nvvm_ldg_f2(reinterpret_cast(ptr)); + sycl::float2 ret; + ret.x() = rv[0]; + ret.y() = rv[1]; + return ret; + } else if constexpr (std::is_same_v) { + typedef float f4 ATTRIBUTE_EXT_VEC_TYPE(4); + f4 rv = __nvvm_ldg_f4(reinterpret_cast(ptr)); + sycl::float4 ret; + ret.x() = rv[0]; + ret.y() = rv[1]; + ret.z() = rv[2]; + ret.w() = rv[3]; + return ret; + } else if constexpr (std::is_same_v) { + typedef double d2 ATTRIBUTE_EXT_VEC_TYPE(2); + d2 rv = __nvvm_ldg_d2(reinterpret_cast(ptr)); + sycl::double2 ret; + ret.x() = rv[0]; + ret.y() = rv[1]; + return ret; + } +#else + return *ptr; +#endif +#else + throw runtime_error("ldg is not supported on host.", PI_ERROR_INVALID_DEVICE); +#endif +} + +#undef ATTRIBUTE_EXT_VEC_TYPE + +} // namespace cuda +} // namespace experimental +} // namespace oneapi +} // namespace ext +} // __SYCL_INLINE_VER_NAMESPACE(_V1) +} // namespace sycl diff --git a/sycl/test/check_device_code/cuda/ldg.cpp b/sycl/test/check_device_code/cuda/ldg.cpp new file mode 100644 index 0000000000000..2cddb4fc8818d --- /dev/null +++ b/sycl/test/check_device_code/cuda/ldg.cpp @@ -0,0 +1,192 @@ +// REQUIRES: cuda + +// RUN: %clangxx -Xclang -no-opaque-pointers -fsycl-device-only -fsycl-targets=nvptx64-nvidia-cuda -S -Xclang -emit-llvm %s -o -| FileCheck %s +// RUN: %clangxx -Xclang -opaque-pointers -fsycl-device-only -fsycl-targets=nvptx64-nvidia-cuda -S -Xclang -emit-llvm %s -o -| FileCheck %s --check-prefixes=CHECK-OPAQUE + +#include +#include + +using namespace sycl; +using namespace sycl::ext::oneapi::experimental::cuda; +using namespace sycl::ext::oneapi::experimental; + +int main() { + + sycl::queue q; + + auto *in_c = sycl::malloc_device(1, q); + auto *in_s = sycl::malloc_device(1, q); + auto *in_i = sycl::malloc_device(1, q); + auto *in_l = sycl::malloc_device(1, q); + auto *in_ll = sycl::malloc_device(1, q); + + auto *in_uc = sycl::malloc_device(1, q); + auto *in_us = sycl::malloc_device(1, q); + auto *in_ui = sycl::malloc_device(1, q); + auto *in_ul = sycl::malloc_device(1, q); + auto *in_ull = sycl::malloc_device(1, q); + + auto *in_c2 = sycl::malloc_device(1, q); + auto *in_s2 = sycl::malloc_device(1, q); + auto *in_i2 = sycl::malloc_device(1, q); + auto *in_ll2 = sycl::malloc_device(1, q); + + auto *in_c4 = sycl::malloc_device(1, q); + auto *in_s4 = sycl::malloc_device(1, q); + auto *in_i4 = sycl::malloc_device(1, q); + + auto *in_uc2 = sycl::malloc_device(1, q); + auto *in_us2 = sycl::malloc_device(1, q); + auto *in_ui2 = sycl::malloc_device(1, q); + auto *in_ull2 = sycl::malloc_device(1, q); + + auto *in_uc4 = sycl::malloc_device(1, q); + auto *in_us4 = sycl::malloc_device(1, q); + auto *in_ui4 = sycl::malloc_device(1, q); + + auto *in_f = sycl::malloc_device(1, q); + auto *in_d = sycl::malloc_device(1, q); + + auto *in_f2 = sycl::malloc_device(1, q); + auto *in_d2 = sycl::malloc_device(1, q); + + auto *in_f4 = sycl::malloc_device(1, q); + + auto *out_d = sycl::malloc_device(1, q); + + q.wait(); + + q.submit([=](sycl::handler &h) { + h.single_task([=] { + //CHECK: tail call float @llvm.nvvm.ldg.global.f.f32.p0f32(float* %{{.*}}, i32 4) + //CHECK-OPAQUE: tail call float @llvm.nvvm.ldg.global.f.f32.p0(ptr %{{.*}}, i32 4) + auto cached_f = ldg(&in_f[0]); + //CHECK: tail call double @llvm.nvvm.ldg.global.f.f64.p0f64(double* %{{.*}}, i32 8) + //CHECK-OPAQUE: tail call double @llvm.nvvm.ldg.global.f.f64.p0(ptr %{{.*}}, i32 8) + auto cached_d = ldg(&in_d[0]); + + //CHECK: tail call <2 x float> @llvm.nvvm.ldg.global.f.v2f32.p0v2f32(<2 x float>* %{{.*}}, i32 8) + //CHECK-OPAQUE: tail call <2 x float> @llvm.nvvm.ldg.global.f.v2f32.p0(ptr %{{.*}}, i32 8) + auto cached_f2 = ldg(&in_f2[0]); + //CHECK: tail call <2 x double> @llvm.nvvm.ldg.global.f.v2f64.p0v2f64(<2 x double>* %{{.*}}, i32 16) + //CHECK-OPAQUE: tail call <2 x double> @llvm.nvvm.ldg.global.f.v2f64.p0(ptr %{{.*}}, i32 16) + auto cached_d2 = ldg(&in_d2[0]); + //CHECK: tail call <4 x float> @llvm.nvvm.ldg.global.f.v4f32.p0v4f32(<4 x float>* %{{.*}}, i32 16) + //CHECK-OPAQUE: tail call <4 x float> @llvm.nvvm.ldg.global.f.v4f32.p0(ptr %{{.*}}, i32 16) + auto cached_f4 = ldg(&in_f4[0]); + + // Unsigned variants are identical to signed variants, but this leads to + // correct behavior. + + //CHECK: tail call i8 @llvm.nvvm.ldg.global.i.i8.p0i8(i8* %{{.*}}, i32 1) + //CHECK-OPAQUE: tail call i8 @llvm.nvvm.ldg.global.i.i8.p0(ptr %{{.*}}, i32 1) + auto cached_c = ldg(&in_c[0]); + //CHECK: tail call i16 @llvm.nvvm.ldg.global.i.i16.p0i16(i16* %{{.*}}, i32 2) + //CHECK-OPAQUE: tail call i16 @llvm.nvvm.ldg.global.i.i16.p0(ptr %{{.*}}, i32 2) + auto cached_s = ldg(&in_s[0]); + //CHECK: tail call i32 @llvm.nvvm.ldg.global.i.i32.p0i32(i32* %{{.*}}, i32 4) + //CHECK-OPAQUE: tail call i32 @llvm.nvvm.ldg.global.i.i32.p0(ptr %{{.*}}, i32 4) + auto cached_i = ldg(&in_i[0]); + //CHECK: tail call i64 @llvm.nvvm.ldg.global.i.i64.p0i64(i64* %{{.*}}, i32 8) + //CHECK-OPAQUE: tail call i64 @llvm.nvvm.ldg.global.i.i64.p0(ptr %{{.*}}, i32 8) + auto cached_l = ldg(&in_l[0]); + //CHECK: tail call i64 @llvm.nvvm.ldg.global.i.i64.p0i64(i64* %{{.*}}, i32 8) + //CHECK-OPAQUE: tail call i64 @llvm.nvvm.ldg.global.i.i64.p0(ptr %{{.*}}, i32 8) + auto cached_ll = ldg(&in_ll[0]); + //CHECK: tail call i8 @llvm.nvvm.ldg.global.i.i8.p0i8(i8* %{{.*}}, i32 1) + //CHECK-OPAQUE: tail call i8 @llvm.nvvm.ldg.global.i.i8.p0(ptr %{{.*}}, i32 1) + auto cached_uc = ldg(&in_uc[0]); + //CHECK: tail call i16 @llvm.nvvm.ldg.global.i.i16.p0i16(i16* %{{.*}}, i32 2) + //CHECK-OPAQUE: tail call i16 @llvm.nvvm.ldg.global.i.i16.p0(ptr %{{.*}}, i32 2) + auto cached_us = ldg(&in_us[0]); + //CHECK: tail call i32 @llvm.nvvm.ldg.global.i.i32.p0i32(i32* %{{.*}}, i32 4) + //CHECK-OPAQUE: tail call i32 @llvm.nvvm.ldg.global.i.i32.p0(ptr %{{.*}}, i32 4) + auto cached_ui = ldg(&in_ui[0]); + //CHECK: tail call i64 @llvm.nvvm.ldg.global.i.i64.p0i64(i64* %{{.*}}, i32 8) + //CHECK-OPAQUE: tail call i64 @llvm.nvvm.ldg.global.i.i64.p0(ptr %{{.*}}, i32 8) + auto cached_ul = ldg(&in_ul[0]); + //CHECK: tail call i64 @llvm.nvvm.ldg.global.i.i64.p0i64(i64* %{{.*}}, i32 8) + //CHECK-OPAQUE: tail call i64 @llvm.nvvm.ldg.global.i.i64.p0(ptr %{{.*}}, i32 8) + auto cached_ull = ldg(&in_ull[0]); + + //CHECK: tail call <2 x i8> @llvm.nvvm.ldg.global.i.v2i8.p0v2i8(<2 x i8>* %{{.*}}, i32 2) + //CHECK-OPAQUE: tail call <2 x i8> @llvm.nvvm.ldg.global.i.v2i8.p0(ptr %{{.*}}, i32 2) + auto cached_c2 = ldg(&in_c2[0]); + //CHECK: tail call <2 x i16> @llvm.nvvm.ldg.global.i.v2i16.p0v2i16(<2 x i16>* %{{.*}}, i32 4) + //CHECK-OPAQUE: tail call <2 x i16> @llvm.nvvm.ldg.global.i.v2i16.p0(ptr %{{.*}}, i32 4) + auto cached_s2 = ldg(&in_s2[0]); + //CHECK: tail call <2 x i32> @llvm.nvvm.ldg.global.i.v2i32.p0v2i32(<2 x i32>* %{{.*}}, i32 8) + //CHECK-OPAQUE: tail call <2 x i32> @llvm.nvvm.ldg.global.i.v2i32.p0(ptr %{{.*}}, i32 8) + auto cached_i2 = ldg(&in_i2[0]); + //CHECK: tail call <2 x i64> @llvm.nvvm.ldg.global.i.v2i64.p0v2i64(<2 x i64>* %{{.*}}, i32 16) + //CHECK-OPAQUE: tail call <2 x i64> @llvm.nvvm.ldg.global.i.v2i64.p0(ptr %{{.*}}, i32 16) + auto cached_ll2 = ldg(&in_ll2[0]); + //CHECK: tail call <2 x i8> @llvm.nvvm.ldg.global.i.v2i8.p0v2i8(<2 x i8>* %{{.*}}, i32 2) + //CHECK-OPAQUE: tail call <2 x i8> @llvm.nvvm.ldg.global.i.v2i8.p0(ptr %{{.*}}, i32 2) + auto cached_uc2 = ldg(&in_uc2[0]); + //CHECK: tail call <2 x i16> @llvm.nvvm.ldg.global.i.v2i16.p0v2i16(<2 x i16>* %{{.*}}, i32 4) + //CHECK-OPAQUE: tail call <2 x i16> @llvm.nvvm.ldg.global.i.v2i16.p0(ptr %{{.*}}, i32 4) + auto cached_us2 = ldg(&in_us2[0]); + //CHECK: tail call <2 x i32> @llvm.nvvm.ldg.global.i.v2i32.p0v2i32(<2 x i32>* %{{.*}}, i32 8) + //CHECK-OPAQUE: tail call <2 x i32> @llvm.nvvm.ldg.global.i.v2i32.p0(ptr %{{.*}}, i32 8) + auto cached_ui2 = ldg(&in_ui2[0]); + //CHECK: tail call <2 x i64> @llvm.nvvm.ldg.global.i.v2i64.p0v2i64(<2 x i64>* %{{.*}}, i32 16) + //CHECK-OPAQUE: tail call <2 x i64> @llvm.nvvm.ldg.global.i.v2i64.p0(ptr %{{.*}}, i32 16) + auto cached_ull2 = ldg(&in_ull2[0]); + + //CHECK: tail call <4 x i8> @llvm.nvvm.ldg.global.i.v4i8.p0v4i8(<4 x i8>* %{{.*}}, i32 4) + //CHECK-OPAQUE: tail call <4 x i8> @llvm.nvvm.ldg.global.i.v4i8.p0(ptr %{{.*}}, i32 4) + auto cached_c4 = ldg(&in_c4[0]); + //CHECK: tail call <4 x i16> @llvm.nvvm.ldg.global.i.v4i16.p0v4i16(<4 x i16>* %{{.*}}, i32 8) + //CHECK-OPAQUE: tail call <4 x i16> @llvm.nvvm.ldg.global.i.v4i16.p0(ptr %{{.*}}, i32 8) + auto cached_s4 = ldg(&in_s4[0]); + //CHECK: tail call <4 x i32> @llvm.nvvm.ldg.global.i.v4i32.p0v4i32(<4 x i32>* %{{.*}}, i32 16) + //CHECK-OPAQUE: tail call <4 x i32> @llvm.nvvm.ldg.global.i.v4i32.p0(ptr %{{.*}}, i32 16) + auto cached_i4 = ldg(&in_i4[0]); + + //CHECK: tail call <4 x i8> @llvm.nvvm.ldg.global.i.v4i8.p0v4i8(<4 x i8>* %{{.*}}, i32 4) + //CHECK-OPAQUE: tail call <4 x i8> @llvm.nvvm.ldg.global.i.v4i8.p0(ptr %{{.*}}, i32 4) + auto cached_uc4 = ldg(&in_uc4[0]); + //CHECK: tail call <4 x i16> @llvm.nvvm.ldg.global.i.v4i16.p0v4i16(<4 x i16>* %{{.*}}, i32 8) + //CHECK-OPAQUE: tail call <4 x i16> @llvm.nvvm.ldg.global.i.v4i16.p0(ptr %{{.*}}, i32 8) + auto cached_us4 = ldg(&in_us4[0]); + //CHECK: tail call <4 x i32> @llvm.nvvm.ldg.global.i.v4i32.p0v4i32(<4 x i32>* %{{.*}}, i32 16) + //CHECK-OPAQUE: tail call <4 x i32> @llvm.nvvm.ldg.global.i.v4i32.p0(ptr %{{.*}}, i32 16) + auto cached_ui4 = ldg(&in_ui4[0]); + }); + }); + + q.wait(); + + free(in_f, q); + free(in_d, q); + free(in_f2, q); + free(in_f4, q); + free(in_d2, q); + free(in_c, q); + free(in_s, q); + free(in_i, q); + free(in_l, q); + free(in_ll, q); + free(in_uc, q); + free(in_us, q); + free(in_ui, q); + free(in_ul, q); + free(in_ull, q); + free(in_c2, q); + free(in_s2, q); + free(in_i2, q); + free(in_ll2, q); + free(in_uc2, q); + free(in_us2, q); + free(in_ui2, q); + free(in_ull2, q); + free(in_c4, q); + free(in_s4, q); + free(in_i4, q); + free(in_uc4, q); + free(in_us4, q); + free(in_ui4, q); + + return 0; +}; diff --git a/sycl/test/check_device_code/matrix/matrix-nvptx-bf16-test.cpp b/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-bf16-test.cpp similarity index 100% rename from sycl/test/check_device_code/matrix/matrix-nvptx-bf16-test.cpp rename to sycl/test/check_device_code/cuda/matrix/matrix-nvptx-bf16-test.cpp diff --git a/sycl/test/check_device_code/matrix/matrix-nvptx-bfloat16-test.cpp b/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-bfloat16-test.cpp similarity index 100% rename from sycl/test/check_device_code/matrix/matrix-nvptx-bfloat16-test.cpp rename to sycl/test/check_device_code/cuda/matrix/matrix-nvptx-bfloat16-test.cpp diff --git a/sycl/test/check_device_code/matrix/matrix-nvptx-double-test.cpp b/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-double-test.cpp similarity index 100% rename from sycl/test/check_device_code/matrix/matrix-nvptx-double-test.cpp rename to sycl/test/check_device_code/cuda/matrix/matrix-nvptx-double-test.cpp diff --git a/sycl/test/check_device_code/matrix/matrix-nvptx-half-float-test.cpp b/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-half-float-test.cpp similarity index 100% rename from sycl/test/check_device_code/matrix/matrix-nvptx-half-float-test.cpp rename to sycl/test/check_device_code/cuda/matrix/matrix-nvptx-half-float-test.cpp diff --git a/sycl/test/check_device_code/matrix/matrix-nvptx-half-half-test.cpp b/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-half-half-test.cpp similarity index 100% rename from sycl/test/check_device_code/matrix/matrix-nvptx-half-half-test.cpp rename to sycl/test/check_device_code/cuda/matrix/matrix-nvptx-half-half-test.cpp diff --git a/sycl/test/check_device_code/matrix/matrix-nvptx-int8-test.cpp b/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-int8-test.cpp similarity index 100% rename from sycl/test/check_device_code/matrix/matrix-nvptx-int8-test.cpp rename to sycl/test/check_device_code/cuda/matrix/matrix-nvptx-int8-test.cpp diff --git a/sycl/test/check_device_code/matrix/matrix-nvptx-tf32-test.cpp b/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-tf32-test.cpp similarity index 100% rename from sycl/test/check_device_code/matrix/matrix-nvptx-tf32-test.cpp rename to sycl/test/check_device_code/cuda/matrix/matrix-nvptx-tf32-test.cpp diff --git a/sycl/test/check_device_code/matrix/matrix-nvptx-uint8-test.cpp b/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-uint8-test.cpp similarity index 100% rename from sycl/test/check_device_code/matrix/matrix-nvptx-uint8-test.cpp rename to sycl/test/check_device_code/cuda/matrix/matrix-nvptx-uint8-test.cpp