CANN: Update several operators to support FP16 data format #16251

hipudding · 2025-09-25T13:20:03Z

Many Ascend operators internally use FP16 precision for computation. If input data is in FP32, it must first be cast to FP16 before computation, and then cast back to FP32 after computation, which introduces unnecessary cast operations. Moreover, FP16 computation requires significantly less workload compared to FP32, leading to noticeable efficiency improvements.

In this change, get_rows, rms_norm, and flash_attn_ext are extended to support multiple data types. Validation on the Qwen2 0.5b model shows correct accuracy and about 10% performance gain in concurrent scenarios, with #16270

Make sure to read the contributing guidelines before submitting a PR

ggerganov · 2025-09-25T16:12:29Z

Validation on the Qwen2 model shows correct accuracy and about 10% performance gain in concurrent scenarios.

Which model size is this speed up for?

hipudding · 2025-09-26T02:38:58Z

Validation on the Qwen2 model shows correct accuracy and about 10% performance gain in concurrent scenarios.

Which model size is this speed up for?

Performance improved by 8%–10%. This result is based on our testing with the Qwen2.5 0.5B model using llama-parallel under 10 concurrent requests (we recently had a business case involving the 0.5B model). We also tested on Qwen2.5 7B, Qwen3-MoE, and DeepSeek V2-Lite, where we observed smaller performance gains.

On Ascend, operators such as FA and MulMAT are computed in FP16 precision. However, in llama.cpp, intermediate results default to FP32, which introduces a nontrivial casting overhead. Using FP16 for intermediate results can reduce this casting cost.

Of course, we also tried computing operators directly in FP32, but due to the higher computation cost, the performance was actually worse than the cast+FP16 approach.

This PR only modifies the operators so that they support both FP32 and FP16 data types. To fully adopt FP16 as the intermediate type, further changes are required in other parts of the code. I will submit an issue and a draft PR today to start a discussion on this. #16271

Many Ascend operators internally use FP16 precision for computation. If input data is in FP32, it must first be cast to FP16 before computation, and then cast back to FP32 after computation, which introduces unnecessary cast operations. Moreover, FP16 computation requires significantly less workload compared to FP32, leading to noticeable efficiency improvements. In this change, `get_rows`, `rms_norm`, and `flash_attn_ext` are extended to support multiple data types. Validation on the Qwen2 0.5b model shows correct accuracy and about 10% performance gain in concurrent scenarios. Co-authored-by: noemotiovon <[email protected]>

hipudding · 2025-09-26T02:48:51Z

Test pass for modified operators:

FLASH_ATTN_EXT
MUL_MAT
RMS_NORM
GET_ROWS

hipudding force-pushed the fp16 branch from 982f194 to c7cbe5a Compare September 25, 2025 13:24

hipudding added Ascend NPU issues specific to Ascend NPUs ggml changes relating to the ggml tensor library for machine learning labels Sep 25, 2025

hipudding force-pushed the fp16 branch from c7cbe5a to 80a9455 Compare September 26, 2025 02:48

hipudding marked this pull request as ready for review September 26, 2025 02:49

noemotiovon mentioned this pull request Sep 26, 2025

CANN: add high performance mode using FP16 for intermediate states #16238

Closed

hipudding mentioned this pull request Sep 26, 2025

Support FP16 as intermediate results in graph computation #16270

Draft

hipudding self-assigned this Sep 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CANN: Update several operators to support FP16 data format #16251

CANN: Update several operators to support FP16 data format #16251

hipudding commented Sep 25, 2025 •

edited

Loading

Uh oh!

ggerganov commented Sep 25, 2025

Uh oh!

hipudding commented Sep 26, 2025 •

edited

Loading

Uh oh!

hipudding commented Sep 26, 2025

Uh oh!

Uh oh!

CANN: Update several operators to support FP16 data format #16251

Are you sure you want to change the base?

CANN: Update several operators to support FP16 data format #16251

Conversation

hipudding commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Sep 25, 2025

Uh oh!

hipudding commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hipudding commented Sep 26, 2025

Uh oh!

Uh oh!

hipudding commented Sep 25, 2025 •

edited

Loading

hipudding commented Sep 26, 2025 •

edited

Loading