Support FP16 as intermediate results in graph computation #16271

hipudding · 2025-09-26T06:19:59Z

hipudding
Sep 26, 2025
Collaborator

This discussion is talking about using FP16 as the data type for intermediate results in graph inference, reducing computation and improving inference speed. Verification was conducted with the CANN backend on Qwen2.5, Qwen3-MoE, and DeepSeek-Lite-V2, showing performance improvements of 3%–10% depending on the concurrency and model.

The main changes in the demo include modifying operators involved in graph by replacing hardcoded FP32 data types with type inference based on input, adding FP16 support for GET_ROWS, and casting t_embd and t_logits back to FP32 at the end of inference.

In fact, this is only a very basic validation. For full FP16 support, the following are still needed:

Modify all operators that currently hardcode FP32 to perform type inference based on the input data type.
Add FP16 support to all backend operators.
Extend test cases to include FP16 data types.

see #16270 #16251

ggerganov · 2025-09-26T13:55:42Z

ggerganov
Sep 26, 2025
Maintainer

I think the main obstacle for more general-purpose implementation of this is that we don't know which tensors are the "outputs" of the graph.

0 replies

jeffbolznv · 2025-09-27T02:36:30Z

jeffbolznv
Sep 27, 2025
Collaborator

Is the proposal:

(A) "data types based on inference" meaning that the ggml library chooses dst->type of F16 vs F32 based on some heuristic?
(B) The application can explicitly request an F16 vs F32 dst->type for each operation?

I think there's a third option: (C) Backends can choose to use F16 as a fusion optimization to remove F16->F32->F16 conversions that only occur due to the F32 storage format.

2 replies

ggerganov Sep 27, 2025
Maintainer

A and B would be much more difficult to support because it would require more general support for F16 results in the CPU backend.

I was thinking about C where the backend can decide which F32 outputs to change to F16/BF16 - either by fusion, or by an initial pass over the nodes to change the types of the nodes. But without knowing which nodes are outputs of the current graph split (i.e. need to remain in F32) it's not clear how to implement.

I also now realize that the existing fusion logic might be suffering from the same problem - we could accidentally fuse a node that is going to be the input for a next graph split. I don't think we have a logic currently to prevent that from happening?

JohannesGaessler Sep 28, 2025
Collaborator

General BF16 support in the CPU backend may be more viable since even without hardware support you can do the FP32 <-> BF16 conversions via simple bitwise operations. My opinion is that if we add more types we would benefit a lot from deduplicating the code though.

Would it be possible to set some global, default float type when building the ggml graph and to shift responsibility for handling that type to user code?

hipudding · 2025-09-29T02:19:23Z

hipudding
Sep 29, 2025
Collaborator Author

The demo I have tried last week is Plan A, let ggml operator to choose dst->type, if the input type is F16, use F16 as output type. For models like qwen2 and deepseek lite, The CANN backend can support both F32 and F16 type for input and output. Can I just add F16 support for these models' CPU operator and add a parameter to enable or disable F16 support? It's essy, because it just need to change dst->type to F16 in GGML_TYPE_F16, then all operators can know it should use F16.

1 reply

hipudding Sep 29, 2025
Collaborator Author

In the long run, all operators should support both F32 and F16. Each operator should determine the output type based on the input type, with the main principle being to minimize unnecessary casts.

For example, if the input type is F16, then the computation should be executed in F16. It’s important to distinguish which inputs are intermediate results.

For MULMAT, tensor a is the weight (from the weight file), while tensor b is the output of the previous operator. Therefore, the output type (dst) can be determined based on the type of tensor b.
For FA, k and v come from the kv cache, and q comes from the previous operator’s output, so the output type can be determined based on the type of q.

Thus, if all operators support F16, the intermediate result type can be determined directly from the input parameters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support FP16 as intermediate results in graph computation #16271

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Support FP16 as intermediate results in graph computation #16271

Uh oh!

Uh oh!

hipudding Sep 26, 2025 Collaborator

Replies: 3 comments · 3 replies

Uh oh!

ggerganov Sep 26, 2025 Maintainer

Uh oh!

jeffbolznv Sep 27, 2025 Collaborator

Uh oh!

ggerganov Sep 27, 2025 Maintainer

Uh oh!

JohannesGaessler Sep 28, 2025 Collaborator

Uh oh!

hipudding Sep 29, 2025 Collaborator Author

Uh oh!

hipudding Sep 29, 2025 Collaborator Author

hipudding
Sep 26, 2025
Collaborator

Replies: 3 comments 3 replies

ggerganov
Sep 26, 2025
Maintainer

jeffbolznv
Sep 27, 2025
Collaborator

ggerganov Sep 27, 2025
Maintainer

JohannesGaessler Sep 28, 2025
Collaborator

hipudding
Sep 29, 2025
Collaborator Author

hipudding Sep 29, 2025
Collaborator Author