Skip to content

Eval bug: zDNN backend not inferencing correctly after LLAMA_SET_ROWS enablement #15414

@taronaeo

Description

@taronaeo

Name and Version

build/bin/llama-cli --version
register_backend: registered backend zDNN (1 devices)
register_device: registered device zDNN (IBM Z Neural Network Processing Assist (NNPA))
register_backend: registered backend BLAS (1 devices)
register_device: registered device BLAS (OpenBLAS)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (CPU)
load_backend: failed to find ggml_backend_init in /devfield/taronaeo/llama-realtest/build/bin/libggml-blas.so
load_backend: failed to find ggml_backend_init in /devfield/taronaeo/llama-realtest/build/bin/libggml-cpu.so
version: 6194 (3007baf20)
built with gcc (GCC) 15.1.0 for s390x-redhat-linux

Operating systems

Linux

GGML backends

zDNN

Hardware

IBM z17 40 IFLs / 128 GB Memory / zAIU Accelerator

Models

Granite 3.3 2B Instruct Big-Endian

Problem description & steps to reproduce

Run llama-cli as per normal without the LLAMA_SET_ROWS=0 environment flag and it inferences incorrectly. E.g.,

$ build/bin/llama-cli -m /devfield/taronaeo/hf_models/granite-3.3-2b-instruct-be.F32.gguf -t 8 -n 25 -p "Write me a dog walking business idea 1. " -no-cnv -ngl -1 --seed 1568795874

main: llama threadpool init, n_threads = 8                                                                                                                        
                                                                                                                                                                  
system_info: n_threads = 8 (n_threads_batch = 8) / 40 | zDNN : NNPA = 1 | NNPA_PARMBLKFORMAT_0 = 1 | NNPA_PARMBLKFORMAT_1 = 1 | CPU : VXE = 1 | LLAMAFILE = 1 | OP
ENMP = 1 | REPACK = 1 |                                                                                                                                           
                                                                                                                                                                  
sampler seed: 1568795874                                                                                                                                          
sampler params:                                                                                                                                                   
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000                                                           
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096                                                               
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800           
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000                                                                                                   
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist                           
generate: n_ctx = 4096, n_batch = 2048, n_predict = 25, n_keep = 0                                                                                                
                                                                                                                                                                  
Write me a dog walking business idea 1.                                                                                                                           
222221111111111111111...                                                                                                                                          
                                                                                                                                                                  
llama_perf_sampler_print:    sampling time =       5.98 ms /    36 runs   (    0.17 ms per token,  6021.07 tokens per second)                                     
llama_perf_context_print:        load time =    3833.95 ms                                                                                                        
llama_perf_context_print: prompt eval time =     381.54 ms /    11 tokens (   34.69 ms per token,    28.83 tokens per second)                                     
llama_perf_context_print:        eval time =    6255.99 ms /    24 runs   (  260.67 ms per token,     3.84 tokens per second)                                     
llama_perf_context_print:       total time =    6841.94 ms /    35 tokens                                                                                         
llama_perf_context_print:    graphs reused =         22                                                                                                           
ggml_zdnn_free: deallocating

But when we set the LLAMA_SET_ROWS=0 environment variable, it inferences correctly.

$ LLAMA_SET_ROWS=0 build/bin/llama-cli -m /devfield/taronaeo/hf_models/granite-3.3-2b-instruct-be.F32.gguf -t 8 -n 25 -p "Write me a dog walking business idea 1. " -no-cnv -ngl -1 --seed 1568795874

main: llama threadpool init, n_threads = 8                                                                                                                        
                                                                                                                                                                  
system_info: n_threads = 8 (n_threads_batch = 8) / 40 | zDNN : NNPA = 1 | NNPA_PARMBLKFORMAT_0 = 1 | NNPA_PARMBLKFORMAT_1 = 1 | CPU : VXE = 1 | LLAMAFILE = 1 | OP
ENMP = 1 | REPACK = 1 |                                                                                                                                           
                                                                                                                                                                  
sampler seed: 1568795874                                                                                                                                          
sampler params:                                                                                                                                                   
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000                                                           
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096                                                               
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800           
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000                                                                                                   
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist                           
generate: n_ctx = 4096, n_batch = 2048, n_predict = 25, n_keep = 0                                                                                                
                                                                                                                                                                  
Write me a dog walking business idea 1.                                                                                                                           
2.                                                                                                                                                                
3.                                                                                                                                                                
4.                                                                                                                                                                
5.                                                                                                                                                                
                                                                                                                                                                  
1. **"Pawsome Play                                                                                                                                                
                                                                                                                                                                  
llama_perf_sampler_print:    sampling time =       6.35 ms /    36 runs   (    0.18 ms per token,  5669.29 tokens per second)                                     
llama_perf_context_print:        load time =    3787.06 ms                                                                                                        
llama_perf_context_print: prompt eval time =     380.83 ms /    11 tokens (   34.62 ms per token,    28.88 tokens per second)                                     
llama_perf_context_print:        eval time =    7180.11 ms /    24 runs   (  299.17 ms per token,     3.34 tokens per second)                                     
llama_perf_context_print:       total time =    7773.64 ms /    35 tokens                                                                                         
llama_perf_context_print:    graphs reused =          0                                                                                                           
ggml_zdnn_free: deallocating

First Bad Commit

Suspect it has to do with #14959

Relevant log output

N/A

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions