Skip to content

Conversation

hipudding
Copy link
Collaborator

@hipudding hipudding commented Sep 25, 2025

Many Ascend operators internally use FP16 precision for computation. If input data is in FP32, it must first be cast to FP16 before computation, and then cast back to FP32 after computation, which introduces unnecessary cast operations. Moreover, FP16 computation requires significantly less workload compared to FP32, leading to noticeable efficiency improvements.

In this change, get_rows, rms_norm, and flash_attn_ext are extended to support multiple data types. Validation on the Qwen2 0.5b model shows correct accuracy and about 10% performance gain in concurrent scenarios, with #16270

Make sure to read the contributing guidelines before submitting a PR

@hipudding hipudding added Ascend NPU issues specific to Ascend NPUs ggml changes relating to the ggml tensor library for machine learning labels Sep 25, 2025
@ggerganov
Copy link
Member

Validation on the Qwen2 model shows correct accuracy and about 10% performance gain in concurrent scenarios.

Which model size is this speed up for?

@hipudding
Copy link
Collaborator Author

hipudding commented Sep 26, 2025

Validation on the Qwen2 model shows correct accuracy and about 10% performance gain in concurrent scenarios.

Which model size is this speed up for?

Performance improved by 8%–10%. This result is based on our testing with the Qwen2.5 0.5B model using llama-parallel under 10 concurrent requests (we recently had a business case involving the 0.5B model). We also tested on Qwen2.5 7B, Qwen3-MoE, and DeepSeek V2-Lite, where we observed smaller performance gains.

On Ascend, operators such as FA and MulMAT are computed in FP16 precision. However, in llama.cpp, intermediate results default to FP32, which introduces a nontrivial casting overhead. Using FP16 for intermediate results can reduce this casting cost.

Of course, we also tried computing operators directly in FP32, but due to the higher computation cost, the performance was actually worse than the cast+FP16 approach.

This PR only modifies the operators so that they support both FP32 and FP16 data types. To fully adopt FP16 as the intermediate type, further changes are required in other parts of the code. I will submit an issue and a draft PR today to start a discussion on this. #16271

Many Ascend operators internally use FP16 precision for computation.
If input data is in FP32, it must first be cast to FP16 before
computation, and then cast back to FP32 after computation, which
introduces unnecessary cast operations. Moreover, FP16 computation
requires significantly less workload compared to FP32, leading to
noticeable efficiency improvements.

In this change, `get_rows`, `rms_norm`, and `flash_attn_ext` are extended
to support multiple data types. Validation on the Qwen2 0.5b model shows
correct accuracy and about 10% performance gain in concurrent scenarios.

Co-authored-by: noemotiovon <[email protected]>
@hipudding
Copy link
Collaborator Author

Test pass for modified operators:

  • FLASH_ATTN_EXT
  • MUL_MAT
  • RMS_NORM
  • GET_ROWS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ascend NPU issues specific to Ascend NPUs ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants