Increase CPU & GPU execution overlap of GGML Backends through the new Graph Plan API #14514
Replies: 2 comments 2 replies
-
| We can add a function that does an update and compute at the same time. This would allow the backend to split the CUDA graph in as many parts as it wants, to reduce the time that the GPU is idle while the graph is being captured. This can help reduce power and memory usage, but for maximum performance, I expect that using two graphs and updating one asynchronously while the other one is executing would provide the best results. With @ggerganov work in #14482, I expect that we will be able to avoid most graph updates, so in the typical case we can call  | 
Beta Was this translation helpful? Give feedback.
-
| I am still slightly concerned recreating a ggml graph every 256 tokens would work for APIs like Vulkans new SPIR-V graph API extension, https://registry.khronos.org/vulkan/specs/latest/man/html/VK_ARM_data_graph.html, or TRT-RTX. While Vulkans API currently supports fixed shapes only TRT-RTX supports dynamic shapes with a min/max configuration. This way allocation is fixed for all iterations potentially avoiding the need to call ggml_backend_sched_reset() ggml_backend_sched_alloc_graph(). Some documentation about torch-trt supports dynamic shapes can be found here: https://docs.pytorch.org/TensorRT/_notebooks/dynamic-shapes.html#3. In the CUDA and Vulkan backend a worst-case would be a valid option as well. On the kv-cache use case, is there any other change to the graph than tensor sizes? If not, what would prevent us from following the min/max tensor size pattern and ensure the allocation is sufficiently large up to the max tensor size specified so that all that has to be done after a graph update is updating the tensor shapes? | 
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Introduction:
Currently, the computational graph execution in some GGML backends results in high CPU cost, leading to prolonged GPU idle times. Optimizing inference across diverse hardware platforms necessitates addressing this core inefficiency.
We're primarily looking at two distinct paradigms for backend execution: immediate-mode backends like CUDA without graphs, and command-list-based backends such as Vulkan or CUDA with graphs.
Immediate-mode backends typically process the node list and execute work as they go. Kernel launch overheads are incurred for each kernel. Command-list-based backends, however, must build the entire command list before execution. This can introduce a few milliseconds of delay, during which your GPU sits idle, hurting overall performance.
Here's how these timelines typically look:
Immediate Mode Execution Timeline:
Immediate Mode Execution Timeline:
To minimize this GPU idle time, a common strategy is to split the commands into multiple, smaller command lists (or batches) and flush each of these lists as soon as they are ready. This allows for a more continuous stream of work to the GPU, combining the performance benefits of command lists (e.g., fewer kernel launches, better optimization opportunities for each batch) with the low-latency advantages of immediate-mode execution.
Flushing Command-List Mode Execution Timeline
This technique has already proven successful. For instance, it was implemented in the Vulkan backend via PR 9118, leading to a 10% improvement in token generation performance for StableLM 3B Q8_0 on high-end CPUs (like the Threadripper Pro 7975WX). More recently, PR 11867 for the CUDA backend showed a 2.5% to 7% improvement for Phi3-mini-4k-instrukt 3B on high-end CPUs (Threadripper Pro 5955WX and Intel 14900K). We expect even greater benefits on lower-specced CPUs.
Improving with the Graph Plan API
With a Graph Plan API, outlined in the ggml-backend-impl.h header, one could reduce CPU overhead even more by having an interface which can give more guarantees about graph changes and also allows the developer to deliver more information to the backend about graph changes. In total this will enable smart optimizations on the backend side.
Here’s the current API structure:
While this API has significant potential, its current design for truly asynchronous update and compute functions would require double buffering of internal graph/command buffer data structures. This raises several concerns:
Unless a solid solution emerges for these asynchronous issues, my recommendation is to keep the Graph Plan API synchronous.
Even with a synchronous API, having a way to capture a ggml_cgraph can be very beneficial if designed to convey crucial information to the backend, allowing it to skip unnecessary work. We should consider at least two types of graph changes:
A crucial optimization when the topology is static is that the backend can identify and internally store only the graph nodes it actively uses. This allows it to avoid traversing any nodes irrelevant to its operations, significantly reducing CPU work, cache misses, and branch mispredictions. This also enables the creation of highly optimized, backend-specific data structures for each relevant graph node.
One perspective on a graph plan could be that its topology is always static, and only tensor shape sizes can change. This would open the door to advanced optimizations like operator fusion, further boosting performance.
To make this more flexible, we could add flags to specify the types of changes allowed in the graph:
TOPOLOGY_CHANGESAllows any topological changes (node additions/removals). If this is permitted, a backend would always have to traverse the full graph passed to compute before executing it, much like how CUDA graphs currently operate, as the structure itself is unpredictable. This represents the least optimized scenario.TENSOR_CHANGESAllows changing the specific tensors attached to operations. In this scenario, we can assume the topology of the graph is static, allowing the backend to pre-create a list of operators in execution order and apply the static-topology optimizations mentioned above. HTENSOR_SHAPESGuarantees that only the shape sizes of existing tensors change. This specifically means the dimensions of tensors might vary (e.g., from [1, 512] to [1, 1024]) while the number of dimensions and the overall graph structure remain constant. This is a common scenario in models with dynamic sequence lengths. If this guarantee can be made, it would allow us to skip the entire graph tensor traversal, simply fetching new tensor sizes from a list of tensors used by the backend. This provides the highest level of optimization, as both topology and tensor pointers are static.With these additions to the graph properties, the compute function could also receive additional flags indicating which of the allowed changes actually occurred. This wouldn't just enable the optimizations from above; it would also allow us to skip as much work as possible, saving CPU power.
Here’s how the enhanced API could look:
By passing allowed_changes during plan creation and actual_changes during computation, we can provide the backend with precise information, enabling it to execute only the necessary work and significantly improve overall efficiency.
Beta Was this translation helpful? Give feedback.
All reactions