minicpm: make embedding_scale residual_scale logit_scale optional with legacy defaults. #16273

vinkal-chudgar · 2025-09-26T09:32:11Z

baseline-perplexity-16192.txt
afterfix-perplexity-16192.txt
baseline-bench-16192.txt
afterfix-bench-16192.txt
ci.zip

Summary

Older MiniCPM GGUFs do not include the scaling metadata keys. The loader previously treated these as required, so quantization failed with "key not found in model". This PR treats those keys optional and supplies legacy default values so older files quantize and load.

Problem

Some MiniCPM GGUFs do not contain

minicpm.embedding_scale
minicpm.residual_scale,
minicpm.logit_scale

The loader currently treats these as required, so quantization fails with:
key not found in model: minicpm.embedding_scale

Solution

In the LLM_ARCH_MINICPM branch of the loader, Initialize MiniCPM scaling parameters with legacy MiniCPM values:

f_embedding_scale = 12.0f
f_residual_scale = 1.4f / sqrtf((float) n_layer)
f_logit_scale = 256.0f / n_embd (guards to 1.0f if n_embd == 0)

Read the three GGUF keys with required = false. When the GGUF provides the keys, their values override the defaults; otherwise the legacy defaults are used.
Newer GGUFs that already include these keys are unaffected.

User impact

Older MiniCPM GGUFs: files that omit the three scaling keys now quantize and load successfully. The loader uses legacy defaults instead of failing.
Newer MiniCPM GGUFs: unchanged. If the keys are present, their values are used.

Validation

Functional (older + newer MiniCPM)

Older MiniCPM 2B GGUF (without scaling keys):
- llama-quantize completes;
- At load time, llama-cli prints the effective MiniCPM scales it will use. Excerpt from the log:
```
f_embedding_scale = 12.000000
f_residual_scale = 0.221359
f_logit_scale = 1.1e-01
```
  Matching calculations from legacy defaults:
  - f_residual_scale is 1.4 / sqrt(n_layer). With n_layer = 40, this equals 0.221359.
  - f_logit_scale is 256 / n_embd. With n_embd = 2304, this equals 0.111111 (printed as 1.1e-01).
- llama-cli chat runs normally (sample Q&A prompt verified).
Newer MiniCPM4 0.5B GGUF (with scaling keys):
- llama-quantize completes cleanly using metadata values.
- metadata present and used. No defaults are needed; behavior unchanged.

Perplexity (CPU-only)

Command used in both runs:

./build-<base|fix>/bin/llama-perplexity \
  -m ~/models/tinyllama/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
  -t 22 -ngl 0 \
  -f ~/data/wikitext-2-raw/wiki.test.100k.raw

Results:

Baseline: PPL 17.6138 ± 0.5025, throughput 55.27 tok/s, 27136 tokens, 53 chunks
After fix: PPL 17.6138 ± 0.5025, throughput 57.35 tok/s, 27136 tokens, 53 chunks

Conclusion: Perplexity is identical. Throughput difference is within normal CPU variance.

Raw logs (attached)

baseline-perplexity-16192.txt
afterfix-perplexity-16192.txt

llama-bench (CPU-only)

Command used in both runs:

./build-<base|fix>/bin/llama-bench \
  -m ~/models/tinyllama/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
  -t 22 -ngl 0 -r 3 --no-warmup --progress -fa 1 -o md

Results:

Baseline:
- pp512: 54.55 ± 4.36 t/s
- tg128: 0.17 ± 0.00 t/s
After fix:
- pp512: 54.96 ± 3.85 t/s
- tg128: 0.17 ± 0.00 t/s

No regression observed.

Raw logs (attached)

baseline-bench-16192.txt
afterfix-bench-16192.txt

Local CI (CPU-only)

Executed from repo root:

rm -rf ./tmp && mkdir -p ./tmp/results ./tmp/mnt
bash ./ci/run.sh ./tmp/results ./tmp/mnt 2>&1 | tee ./tmp/results/ci.log

Outcome:

Exit code: 0
All CTest suites in this run passed
The local CI log contains a few “ERROR 404: Not Found.” messages. These did not affect the run

CI Log attached: ci.zip

Style

Formatted with clang-format 18.1.3, only the lines changed in this PR were formatted.

Environment

OS: Ubuntu 24.04 on WSL2 (CPU-only)
Compiler: GCC 13.3
Build flags: -DGGML_CUDA=OFF, -DGGML_NATIVE=ON
Threads: -t 22
Model for perf checks: TinyLlama-1.1B-chat v1.0, Q4_K_M (GGUF v3)
Dataset for PPL: WikiText-2 test, 100 KB slice wiki.test.100k.raw

Build SHAs used

Baseline: c498fc8 (short c498fc8)
After fix: 6337679 (short 6337679)

Older MiniCPM GGUFs do not include the scaling metadata keys (minicpm.embedding_scale, minicpm.residual_scale, minicpm.logit_scale). The loader currently treats these as required, so quantization fails with: key not found in model: minicpm.embedding_scale This change restores backward compatibility by treating these keys as optional in the loader and using the older MiniCPM scaling values: embedding_scale = 12.0f residual_scale = 1.4f / sqrt(n_layer) logit_scale = 256.0f / n_embd When the GGUF provides the keys, their values override the defaults; otherwise the legacy defaults are used. Newer GGUFs that already include these keys are unaffected. Fixes: ggml-org#16192 Signed-off-by: Vinkal Chudgar <[email protected]>

src/llama-model.cpp

Committed as suggested. Thanks! Co-authored-by: Sigbjørn Skjæret <[email protected]>

CISC · 2025-09-26T19:21:01Z

Thank you, nice work, will merge when CI finishes. :)

vinkal-chudgar marked this pull request as ready for review September 26, 2025 14:02

vinkal-chudgar requested a review from CISC as a code owner September 26, 2025 14:02

CISC approved these changes Sep 26, 2025

View reviewed changes

src/llama-model.cpp Outdated Show resolved Hide resolved

Update src/llama-model.cpp

28a4100

Committed as suggested. Thanks! Co-authored-by: Sigbjørn Skjæret <[email protected]>

CISC merged commit 72b24d9 into ggml-org:master Sep 26, 2025
64 of 67 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

minicpm: make embedding_scale residual_scale logit_scale optional with legacy defaults. #16273

minicpm: make embedding_scale residual_scale logit_scale optional with legacy defaults. #16273

vinkal-chudgar commented Sep 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

CISC commented Sep 26, 2025

Uh oh!

Uh oh!

Uh oh!

minicpm: make embedding_scale residual_scale logit_scale optional with legacy defaults. #16273

minicpm: make embedding_scale residual_scale logit_scale optional with legacy defaults. #16273

Conversation

vinkal-chudgar commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

User impact

Validation

Perplexity (CPU-only)

llama-bench (CPU-only)

Local CI (CPU-only)

Style

Environment

Build SHAs used

Uh oh!

Uh oh!

CISC commented Sep 26, 2025

Uh oh!

Uh oh!

Uh oh!

vinkal-chudgar commented Sep 26, 2025 •

edited

Loading