-
Notifications
You must be signed in to change notification settings - Fork 795
Closed
Labels
Description
Describe the bug
Updating a graph that contains multiple instances of the same node, with no dependencies between them, produces incorrect results.
I might be incorrect about the root-cause here, but it's the best explanation I can come up with.
To reproduce
#include <cstdint>
#include <sycl/sycl.hpp>
int main()
{
sycl::queue q;
static constexpr size_t R = 10;
static constexpr size_t I = 5;
int* output = sycl::malloc_shared<int>(I, q);
std::fill(output, output + I, 0);
std::unique_ptr<sycl::ext::oneapi::experimental::command_graph<sycl::ext::oneapi::experimental::graph_state::executable>> graph;
for (int r = 0; r < R; ++r) {
sycl::ext::oneapi::experimental::command_graph<sycl::ext::oneapi::experimental::graph_state::modifiable> modifiable_graph(q.get_context(), q.get_device());
for (size_t i = 1; i < I; ++i) {
sycl::range global = {i, i, i};
sycl::range local = {i, i, i};
modifiable_graph.add([=](sycl::handler& h) {
h.parallel_for<class test>(sycl::nd_range{global, local}, [=](sycl::nd_item<3> it) noexcept {
if (it.get_group().leader()) {
output[i]++;
}
});
});
}
if (r == 0) {
printf("Building graph\n");
const auto instance = modifiable_graph.finalize(sycl::ext::oneapi::experimental::property::graph::updatable{});
graph = std::make_unique<sycl::ext::oneapi::experimental::command_graph<sycl::ext::oneapi::experimental::graph_state::executable>>(std::move(instance));
}
else {
printf("Updating graph\n");
graph->update(modifiable_graph);
}
printf("Launching graph\n");
q.ext_oneapi_graph(*graph).wait();
}
q.wait();
for (int i = 0; i < I; ++i) {
std::cout << i << ": " << output[i] << std::endl;
}
}Compile with:
clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda update.cpp
Run with:
./a.out
Expected Output
Each kernel should execute 10 times, updating location i each time, so the output should be:
0: 0
1: 10
2: 10
3: 10
4: 10
This is the output that I get if I don't use graphs, or if I replace r == 0 with true (to force re-building the graph every time).
Observed Output
0: 0
1: 1
2: 1
3: 1
4: 37
As far as I can tell, the first graph launch correctly executes one instance of each kernel, but every updated graph launch just executes multiple instances of the kernel with i set to 4.
Environment
- OS: Linux
- NVIDIA A100
- DPC++ version: ede5e44
- Driver Version: 570.133.20 CUDA Version: 12.8
Additional context
No response