Skip to content

Conversation

vinkal-chudgar
Copy link
Contributor

@vinkal-chudgar vinkal-chudgar commented Sep 26, 2025

baseline-perplexity-16192.txt
afterfix-perplexity-16192.txt
baseline-bench-16192.txt
afterfix-bench-16192.txt
ci.zip

Fixes: #16192

Summary

Older MiniCPM GGUFs do not include the scaling metadata keys. The loader previously treated these as required, so quantization failed with "key not found in model". This PR treats those keys optional and supplies legacy default values so older files quantize and load.

Problem

Some MiniCPM GGUFs do not contain

  • minicpm.embedding_scale
  • minicpm.residual_scale,
  • minicpm.logit_scale

The loader currently treats these as required, so quantization fails with:
key not found in model: minicpm.embedding_scale

Solution

In the LLM_ARCH_MINICPM branch of the loader, Initialize MiniCPM scaling parameters with legacy MiniCPM values:

  • f_embedding_scale = 12.0f
  • f_residual_scale = 1.4f / sqrtf((float) n_layer)
  • f_logit_scale = 256.0f / n_embd (guards to 1.0f if n_embd == 0)

Read the three GGUF keys with required = false. When the GGUF provides the keys, their values override the defaults; otherwise the legacy defaults are used.
Newer GGUFs that already include these keys are unaffected.

User impact

  • Older MiniCPM GGUFs: files that omit the three scaling keys now quantize and load successfully. The loader uses legacy defaults instead of failing.
  • Newer MiniCPM GGUFs: unchanged. If the keys are present, their values are used.

Validation

Functional (older + newer MiniCPM)

  • Older MiniCPM 2B GGUF (without scaling keys):
    • llama-quantize completes;
    • At load time, llama-cli prints the effective MiniCPM scales it will use. Excerpt from the log:
      f_embedding_scale = 12.000000
      f_residual_scale = 0.221359
      f_logit_scale = 1.1e-01
      
      Matching calculations from legacy defaults:
      • f_residual_scale is 1.4 / sqrt(n_layer). With n_layer = 40, this equals 0.221359.
      • f_logit_scale is 256 / n_embd. With n_embd = 2304, this equals 0.111111 (printed as 1.1e-01).
    • llama-cli chat runs normally (sample Q&A prompt verified).
  • Newer MiniCPM4 0.5B GGUF (with scaling keys):
    • llama-quantize completes cleanly using metadata values.
    • metadata present and used. No defaults are needed; behavior unchanged.

Perplexity (CPU-only)

Command used in both runs:

./build-<base|fix>/bin/llama-perplexity \
  -m ~/models/tinyllama/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
  -t 22 -ngl 0 \
  -f ~/data/wikitext-2-raw/wiki.test.100k.raw

Results:

  • Baseline: PPL 17.6138 ± 0.5025, throughput 55.27 tok/s, 27136 tokens, 53 chunks
  • After fix: PPL 17.6138 ± 0.5025, throughput 57.35 tok/s, 27136 tokens, 53 chunks

Conclusion: Perplexity is identical. Throughput difference is within normal CPU variance.

Raw logs (attached)

  • baseline-perplexity-16192.txt
  • afterfix-perplexity-16192.txt

llama-bench (CPU-only)

Command used in both runs:

./build-<base|fix>/bin/llama-bench \
  -m ~/models/tinyllama/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
  -t 22 -ngl 0 -r 3 --no-warmup --progress -fa 1 -o md

Results:

  • Baseline:
    • pp512: 54.55 ± 4.36 t/s
    • tg128: 0.17 ± 0.00 t/s
  • After fix:
    • pp512: 54.96 ± 3.85 t/s
    • tg128: 0.17 ± 0.00 t/s

No regression observed.

Raw logs (attached)

  • baseline-bench-16192.txt
  • afterfix-bench-16192.txt

Local CI (CPU-only)

Executed from repo root:

rm -rf ./tmp && mkdir -p ./tmp/results ./tmp/mnt
bash ./ci/run.sh ./tmp/results ./tmp/mnt 2>&1 | tee ./tmp/results/ci.log

Outcome:

  • Exit code: 0
  • All CTest suites in this run passed
  • The local CI log contains a few “ERROR 404: Not Found.” messages. These did not affect the run

CI Log attached: ci.zip

Style

Formatted with clang-format 18.1.3, only the lines changed in this PR were formatted.

Environment

  • OS: Ubuntu 24.04 on WSL2 (CPU-only)
  • Compiler: GCC 13.3
  • Build flags: -DGGML_CUDA=OFF, -DGGML_NATIVE=ON
  • Threads: -t 22
  • Model for perf checks: TinyLlama-1.1B-chat v1.0, Q4_K_M (GGUF v3)
  • Dataset for PPL: WikiText-2 test, 100 KB slice wiki.test.100k.raw

Build SHAs used

Baseline: c498fc8 (short c498fc8)
After fix: 6337679 (short 6337679)

Older MiniCPM GGUFs do not include the scaling metadata keys (minicpm.embedding_scale, minicpm.residual_scale, minicpm.logit_scale). The loader currently treats these as required, so quantization fails with:

    key not found in model: minicpm.embedding_scale

This change restores backward compatibility by treating these keys as optional in the loader and using the older MiniCPM scaling values:

    embedding_scale = 12.0f
    residual_scale  = 1.4f / sqrt(n_layer)
    logit_scale     = 256.0f / n_embd

When the GGUF provides the keys, their values override the defaults; otherwise the legacy defaults are used. Newer GGUFs that already include these keys are unaffected.

Fixes: ggml-org#16192
Signed-off-by: Vinkal Chudgar <[email protected]>
@vinkal-chudgar vinkal-chudgar marked this pull request as ready for review September 26, 2025 14:02
Committed as suggested. Thanks!

Co-authored-by: Sigbjørn Skjæret <[email protected]>
@CISC
Copy link
Collaborator

CISC commented Sep 26, 2025

Thank you, nice work, will merge when CI finishes. :)

@CISC CISC merged commit 72b24d9 into ggml-org:master Sep 26, 2025
64 of 67 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Eval bug: MiniCPM quantization fails with missing key minicpm.embedding_scale
2 participants