Skip to content

Conversation

taronaeo
Copy link
Collaborator

@taronaeo taronaeo commented Sep 7, 2025

This Pull Request cleans up the s390x SIMD Vector Intrinsics syntax to match the new code for easier readability and understanding. No new feature introduction, but there should be slight performance improvements switching the horizontal summation to the optimised vec_hsum.

Verified that models F32, F16, Q8_0, Q5_1, Q5_0, Q4_1, Q4_0, Q6_K, Q4_K, Q3_K, IQ4_XS, IQ4_NL still work as intended.

Performance Benchmark

model size params backend threads test t/s master t/s PR speedup
granite 3B all F32 9.44 GiB 2.53 B BLAS 8 pp512 82.72 82.44 1.00
granite 3B all F32 9.44 GiB 2.53 B BLAS 8 tg128 4.43 4.39 0.99
granite 3B F16 4.72 GiB 2.53 B BLAS 8 pp512 74.93 74.89 1.00
granite 3B F16 4.72 GiB 2.53 B BLAS 8 tg128 3.12 3.05 0.98
granite 3B Q8_0 2.51 GiB 2.53 B BLAS 8 pp512 81.99 81.81 1.00
granite 3B Q8_0 2.51 GiB 2.53 B BLAS 8 tg128 13.81 13.75 1.00
granite 3B Q5_1 1.78 GiB 2.53 B BLAS 8 pp512 81.46 81.53 1.00
granite 3B Q5_1 1.78 GiB 2.53 B BLAS 8 tg128 18.33 18.18 0.99
granite 3B Q5_0 1.64 GiB 2.53 B BLAS 8 pp512 82.07 82.21 1.00
granite 3B Q5_0 1.64 GiB 2.53 B BLAS 8 tg128 20.24 20.25 1.00
granite 3B Q4_1 1.49 GiB 2.53 B BLAS 8 pp512 82.27 82.41 1.00
granite 3B Q4_1 1.49 GiB 2.53 B BLAS 8 tg128 21.8 22 1.01
granite 3B Q4_0 1.35 GiB 2.53 B BLAS 8 pp512 82.3 82.22 1.00
granite 3B Q4_0 1.35 GiB 2.53 B BLAS 8 tg128 22.57 22.32 0.99
granite 3B Q6_K 1.94 GiB 2.53 B BLAS 8 pp512 82.02 81.9 1.00
granite 3B Q6_K 1.94 GiB 2.53 B BLAS 8 tg128 17.49 17.51 1.00
granite 3B Q4_K - Medium 1.44 GiB 2.53 B BLAS 8 pp512 82.31 82.34 1.00
granite 3B Q4_K - Medium 1.44 GiB 2.53 B BLAS 8 tg128 23.55 21.34 0.91
granite 3B Q3_K - Medium 1.16 GiB 2.53 B BLAS 8 pp512 82.44 82.57 1.00
granite 3B Q3_K - Medium 1.16 GiB 2.53 B BLAS 8 tg128 21.19 19.92 0.94
granite 3B IQ4_XS - 4.25 bpw 1.30 GiB 2.53 B BLAS 8 pp512 82.01 82.18 1.00
granite 3B IQ4_XS - 4.25 bpw 1.30 GiB 2.53 B BLAS 8 tg128 24.11 22.14 0.92
granite 3B IQ4_NL - 4.5 bpw 1.37 GiB 2.53 B BLAS 8 pp512 82.08 82.28 1.00
granite 3B IQ4_NL - 4.5 bpw 1.37 GiB 2.53 B BLAS 8 tg128 20.13 19.77 0.98

build: 4d49427 (6405)

Note

Tests were conducted on an IBM z17 Mainframe with 40 IFLs (cores) and 128 GB Memory on a shared R&D LPAR.

Signed-off-by: Aaron Teo <[email protected]>
(cherry picked from commit 0da4b6a)
Signed-off-by: Aaron Teo <[email protected]>
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Sep 7, 2025
@taronaeo taronaeo merged commit d36e61c into ggml-org:master Sep 7, 2025
48 checks passed
njsyw1997 pushed a commit to aizip/llama.cpp that referenced this pull request Sep 10, 2025
* ggml-cpu: clean up s390x simd

Signed-off-by: Aaron Teo <[email protected]>
(cherry picked from commit 0da4b6a)
Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fix hsum data types

Signed-off-by: Aaron Teo <[email protected]>

---------

Signed-off-by: Aaron Teo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants