-
Notifications
You must be signed in to change notification settings - Fork 13k
Closed
Labels
Description
Name and Version
build/bin/llama-cli --version
register_backend: registered backend zDNN (1 devices)
register_device: registered device zDNN (IBM Z Neural Network Processing Assist (NNPA))
register_backend: registered backend BLAS (1 devices)
register_device: registered device BLAS (OpenBLAS)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (CPU)
load_backend: failed to find ggml_backend_init in /devfield/taronaeo/llama-realtest/build/bin/libggml-blas.so
load_backend: failed to find ggml_backend_init in /devfield/taronaeo/llama-realtest/build/bin/libggml-cpu.so
version: 6194 (3007baf20)
built with gcc (GCC) 15.1.0 for s390x-redhat-linux
Operating systems
Linux
GGML backends
zDNN
Hardware
IBM z17 40 IFLs / 128 GB Memory / zAIU Accelerator
Models
Granite 3.3 2B Instruct Big-Endian
Problem description & steps to reproduce
Run llama-cli
as per normal without the LLAMA_SET_ROWS=0
environment flag and it inferences incorrectly. E.g.,
$ build/bin/llama-cli -m /devfield/taronaeo/hf_models/granite-3.3-2b-instruct-be.F32.gguf -t 8 -n 25 -p "Write me a dog walking business idea 1. " -no-cnv -ngl -1 --seed 1568795874
main: llama threadpool init, n_threads = 8
system_info: n_threads = 8 (n_threads_batch = 8) / 40 | zDNN : NNPA = 1 | NNPA_PARMBLKFORMAT_0 = 1 | NNPA_PARMBLKFORMAT_1 = 1 | CPU : VXE = 1 | LLAMAFILE = 1 | OP
ENMP = 1 | REPACK = 1 |
sampler seed: 1568795874
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 25, n_keep = 0
Write me a dog walking business idea 1.
222221111111111111111...
llama_perf_sampler_print: sampling time = 5.98 ms / 36 runs ( 0.17 ms per token, 6021.07 tokens per second)
llama_perf_context_print: load time = 3833.95 ms
llama_perf_context_print: prompt eval time = 381.54 ms / 11 tokens ( 34.69 ms per token, 28.83 tokens per second)
llama_perf_context_print: eval time = 6255.99 ms / 24 runs ( 260.67 ms per token, 3.84 tokens per second)
llama_perf_context_print: total time = 6841.94 ms / 35 tokens
llama_perf_context_print: graphs reused = 22
ggml_zdnn_free: deallocating
But when we set the LLAMA_SET_ROWS=0
environment variable, it inferences correctly.
$ LLAMA_SET_ROWS=0 build/bin/llama-cli -m /devfield/taronaeo/hf_models/granite-3.3-2b-instruct-be.F32.gguf -t 8 -n 25 -p "Write me a dog walking business idea 1. " -no-cnv -ngl -1 --seed 1568795874
main: llama threadpool init, n_threads = 8
system_info: n_threads = 8 (n_threads_batch = 8) / 40 | zDNN : NNPA = 1 | NNPA_PARMBLKFORMAT_0 = 1 | NNPA_PARMBLKFORMAT_1 = 1 | CPU : VXE = 1 | LLAMAFILE = 1 | OP
ENMP = 1 | REPACK = 1 |
sampler seed: 1568795874
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 25, n_keep = 0
Write me a dog walking business idea 1.
2.
3.
4.
5.
1. **"Pawsome Play
llama_perf_sampler_print: sampling time = 6.35 ms / 36 runs ( 0.18 ms per token, 5669.29 tokens per second)
llama_perf_context_print: load time = 3787.06 ms
llama_perf_context_print: prompt eval time = 380.83 ms / 11 tokens ( 34.62 ms per token, 28.88 tokens per second)
llama_perf_context_print: eval time = 7180.11 ms / 24 runs ( 299.17 ms per token, 3.34 tokens per second)
llama_perf_context_print: total time = 7773.64 ms / 35 tokens
llama_perf_context_print: graphs reused = 0
ggml_zdnn_free: deallocating
First Bad Commit
Suspect it has to do with #14959
Relevant log output
N/A