-
Notifications
You must be signed in to change notification settings - Fork 795
Description
Describe the bug
Machine with 4 Nvidia GPUs per node. free_memory
extension fails since it needs an active context. Also seems to be issues related to global_mem_size
descriptor too.
Also reported in this issue: #5713
Related to PR: #7906
#include <sycl/sycl.hpp>
int main() {
auto const& gpu_devices = sycl::device::get_devices(sycl::info::device_type::gpu);
std::cout << "Number of GPUs: " << gpu_devices.size() << std::endl;
for(const auto& d : gpu_devices) {
std::cout << "Found device " << d.get_info<sycl::info::device::name>() << std::endl;
if (!d.has(sycl::aspect::ext_intel_free_memory)) {
std::clog << "This test is only for the devices with "
"ext_intel_free_memory extension support.\n";
}
auto global_mem_size = d.get_info<sycl::info::device::global_mem_size>();
std::cout << "value of global_mem_size : " << global_mem_size << std::endl;
auto FreeMemory = d.get_info<sycl::ext::intel::info::device::free_memory>();
std::cout << "value of FreeMem in bytes :" << FreeMemory << std::endl;
}
return 0;
}
Output:
Number of GPUs: 4
Found device NVIDIA A100-SXM4-40GB
value of global_mem_size : 42505273344
pi_die: failed cuMemGetInfo() API.
terminate called without an active exception
Aborted (core dumped)
Using debugger indicated that the error codes were related to invalid contexts:
Using host libthread_db library "/lib64/libthread_db.so.1".
[Detaching after fork from child process 16196]
[New Thread 0x15554194b700 (LWP 16200)]
Number of GPUs: 4
Found device NVIDIA A100-SXM4-40GB
warning: Cuda API error detected: cuCtxGetDevice returned (0xc9)
[New Thread 0x15554174a700 (LWP 16201)]
[Thread 0x15554174a700 (LWP 16201) exited]
value of global_mem_size : 42505273344
warning: Cuda API error detected: cuMemGetInfo_v2 returned (0x2c5)
pi_die: failed cuMemGetInfo() API.
terminate called without an active exception
Thread 1 "test_sycl_free_" received signal SIGABRT, Aborted.
CUDA_ERROR_INVALID_CONTEXT = 201
This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See cuCtxGetApiVersion() for more details.
CUDA_ERROR_CONTEXT_IS_DESTROYED = 709
This error indicates that the context current to the calling thread has been destroyed using cuCtxDestroy, or is a primary context which has not yet been initialized.