ggml-cpu: clean up s390x SIMD #15855

taronaeo · 2025-09-07T10:40:14Z

This Pull Request cleans up the s390x SIMD Vector Intrinsics syntax to match the new code for easier readability and understanding. No new feature introduction, but there should be slight performance improvements switching the horizontal summation to the optimised vec_hsum.

Verified that models F32, F16, Q8_0, Q5_1, Q5_0, Q4_1, Q4_0, Q6_K, Q4_K, Q3_K, IQ4_XS, IQ4_NL still work as intended.

Performance Benchmark

model	size	params	backend	threads	test	t/s master	t/s PR	speedup
granite 3B all F32	9.44 GiB	2.53 B	BLAS	8	pp512	82.72	82.44	1.00
granite 3B all F32	9.44 GiB	2.53 B	BLAS	8	tg128	4.43	4.39	0.99
granite 3B F16	4.72 GiB	2.53 B	BLAS	8	pp512	74.93	74.89	1.00
granite 3B F16	4.72 GiB	2.53 B	BLAS	8	tg128	3.12	3.05	0.98
granite 3B Q8_0	2.51 GiB	2.53 B	BLAS	8	pp512	81.99	81.81	1.00
granite 3B Q8_0	2.51 GiB	2.53 B	BLAS	8	tg128	13.81	13.75	1.00
granite 3B Q5_1	1.78 GiB	2.53 B	BLAS	8	pp512	81.46	81.53	1.00
granite 3B Q5_1	1.78 GiB	2.53 B	BLAS	8	tg128	18.33	18.18	0.99
granite 3B Q5_0	1.64 GiB	2.53 B	BLAS	8	pp512	82.07	82.21	1.00
granite 3B Q5_0	1.64 GiB	2.53 B	BLAS	8	tg128	20.24	20.25	1.00
granite 3B Q4_1	1.49 GiB	2.53 B	BLAS	8	pp512	82.27	82.41	1.00
granite 3B Q4_1	1.49 GiB	2.53 B	BLAS	8	tg128	21.8	22	1.01
granite 3B Q4_0	1.35 GiB	2.53 B	BLAS	8	pp512	82.3	82.22	1.00
granite 3B Q4_0	1.35 GiB	2.53 B	BLAS	8	tg128	22.57	22.32	0.99
granite 3B Q6_K	1.94 GiB	2.53 B	BLAS	8	pp512	82.02	81.9	1.00
granite 3B Q6_K	1.94 GiB	2.53 B	BLAS	8	tg128	17.49	17.51	1.00
granite 3B Q4_K - Medium	1.44 GiB	2.53 B	BLAS	8	pp512	82.31	82.34	1.00
granite 3B Q4_K - Medium	1.44 GiB	2.53 B	BLAS	8	tg128	23.55	21.34	0.91
granite 3B Q3_K - Medium	1.16 GiB	2.53 B	BLAS	8	pp512	82.44	82.57	1.00
granite 3B Q3_K - Medium	1.16 GiB	2.53 B	BLAS	8	tg128	21.19	19.92	0.94
granite 3B IQ4_XS - 4.25 bpw	1.30 GiB	2.53 B	BLAS	8	pp512	82.01	82.18	1.00
granite 3B IQ4_XS - 4.25 bpw	1.30 GiB	2.53 B	BLAS	8	tg128	24.11	22.14	0.92
granite 3B IQ4_NL - 4.5 bpw	1.37 GiB	2.53 B	BLAS	8	pp512	82.08	82.28	1.00
granite 3B IQ4_NL - 4.5 bpw	1.37 GiB	2.53 B	BLAS	8	tg128	20.13	19.77	0.98

build: 4d49427 (6405)

Note

Tests were conducted on an IBM z17 Mainframe with 40 IFLs (cores) and 128 GB Memory on a shared R&D LPAR.

Signed-off-by: Aaron Teo <[email protected]> (cherry picked from commit 0da4b6a) Signed-off-by: Aaron Teo <[email protected]>

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: clean up s390x simd Signed-off-by: Aaron Teo <[email protected]> (cherry picked from commit 0da4b6a) Signed-off-by: Aaron Teo <[email protected]> * ggml-cpu: fix hsum data types Signed-off-by: Aaron Teo <[email protected]> --------- Signed-off-by: Aaron Teo <[email protected]>

taronaeo added 2 commits September 7, 2025 17:38

ggml-cpu: clean up s390x simd

6471c2e

Signed-off-by: Aaron Teo <[email protected]> (cherry picked from commit 0da4b6a) Signed-off-by: Aaron Teo <[email protected]>

ggml-cpu: fix hsum data types

4d49427

Signed-off-by: Aaron Teo <[email protected]>

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Sep 7, 2025

ggerganov approved these changes Sep 7, 2025

View reviewed changes

taronaeo merged commit d36e61c into ggml-org:master Sep 7, 2025
48 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-cpu: clean up s390x SIMD #15855

ggml-cpu: clean up s390x SIMD #15855

Uh oh!

taronaeo commented Sep 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ggml-cpu: clean up s390x SIMD #15855

ggml-cpu: clean up s390x SIMD #15855

Uh oh!

Conversation

taronaeo commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance Benchmark

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

taronaeo commented Sep 7, 2025 •

edited

Loading