Replies: 3 comments 3 replies
-
I think the main obstacle for more general-purpose implementation of this is that we don't know which tensors are the "outputs" of the graph. |
Beta Was this translation helpful? Give feedback.
-
Is the proposal: (A) "data types based on inference" meaning that the ggml library chooses dst->type of F16 vs F32 based on some heuristic? I think there's a third option: (C) Backends can choose to use F16 as a fusion optimization to remove F16->F32->F16 conversions that only occur due to the F32 storage format. |
Beta Was this translation helpful? Give feedback.
-
The demo I have tried last week is Plan A, let ggml operator to choose dst->type, if the input type is F16, use F16 as output type. For models like qwen2 and deepseek lite, The CANN backend can support both F32 and F16 type for input and output. Can I just add F16 support for these models' CPU operator and add a parameter to enable or disable F16 support? It's essy, because it just need to change dst->type to F16 in GGML_TYPE_F16, then all operators can know it should use F16. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
This discussion is talking about using FP16 as the data type for intermediate results in graph inference, reducing computation and improving inference speed. Verification was conducted with the CANN backend on Qwen2.5, Qwen3-MoE, and DeepSeek-Lite-V2, showing performance improvements of 3%–10% depending on the concurrency and model.
The main changes in the demo include modifying operators involved in graph by replacing hardcoded FP32 data types with type inference based on input, adding FP16 support for GET_ROWS, and casting t_embd and t_logits back to FP32 at the end of inference.
In fact, this is only a very basic validation. For full FP16 support, the following are still needed:
see #16270 #16251
Beta Was this translation helpful? Give feedback.
All reactions