Skip to content

Eval bug: gpt-oss incoherent output #15808

@will-lms

Description

@will-lms

Name and Version

version: 6384 (fb15d64)
built with Apple clang version 17.0.0 (clang-1700.0.13.5) for arm64-apple-darwin24.6.0

Operating systems

Mac

GGML backends

Metal

Hardware

Apple M3 Pro 36GB

Models

openai/gpt-oss-20b. Specifically, I have been testing with this model file.

Problem description & steps to reproduce

I run llama-cli with a small prompt that generates a long response. In this case, I am running

./build/bin/llama-cli -m ~/.lmstudio/models/lmstudio-community/gpt-oss-20b-GGUF/gpt-oss-20b-MXFP4.gguf -p "Tell me a story about a frog and a toad"

Before this commit, the story is long, but coherent. After this commit, the story begins normally, but eventually goes off the rails into unrelated topics with incorrect grammar and inconsistent formatting.

First Bad Commit

fb15d649e

Relevant log output

build: 6384 (fb15d649e) with Apple clang version 17.0.0 (clang-1700.0.13.5) for arm64-apple-darwin24.6.0
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 459 tensors from /Users/will/.lmstudio/models/lmstudio-community/gpt-oss-20b-GGUF/gpt-oss-20b-MXFP4.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gpt-oss
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Openai_Gpt Oss 20b
llama_model_loader: - kv   3:                           general.basename str              = openai_gpt-oss
llama_model_loader: - kv   4:                         general.size_label str              = 20B
llama_model_loader: - kv   5:                        gpt-oss.block_count u32              = 24
llama_model_loader: - kv   6:                     gpt-oss.context_length u32              = 131072
llama_model_loader: - kv   7:                   gpt-oss.embedding_length u32              = 2880
llama_model_loader: - kv   8:                gpt-oss.feed_forward_length u32              = 2880
llama_model_loader: - kv   9:               gpt-oss.attention.head_count u32              = 64
llama_model_loader: - kv  10:            gpt-oss.attention.head_count_kv u32              = 8
llama_model_loader: - kv  11:                     gpt-oss.rope.freq_base f32              = 150000.000000
llama_model_loader: - kv  12:   gpt-oss.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  13:                       gpt-oss.expert_count u32              = 32
llama_model_loader: - kv  14:                  gpt-oss.expert_used_count u32              = 4
llama_model_loader: - kv  15:               gpt-oss.attention.key_length u32              = 64
llama_model_loader: - kv  16:             gpt-oss.attention.value_length u32              = 64
llama_model_loader: - kv  17:           gpt-oss.attention.sliding_window u32              = 128
llama_model_loader: - kv  18:         gpt-oss.expert_feed_forward_length u32              = 2880
llama_model_loader: - kv  19:                  gpt-oss.rope.scaling.type str              = yarn
llama_model_loader: - kv  20:                gpt-oss.rope.scaling.factor f32              = 32.000000
llama_model_loader: - kv  21: gpt-oss.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = gpt-4o
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,201088]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,201088]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,446189]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 199998
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 200002
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 199999
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {#-\n  In addition to the normal input...
llama_model_loader: - kv  31:               general.quantization_version u32              = 2
llama_model_loader: - kv  32:                          general.file_type u32              = 38
llama_model_loader: - type  f32:  289 tensors
llama_model_loader: - type q8_0:   98 tensors
llama_model_loader: - type mxfp4:   72 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = MXFP4 MoE
print_info: file size   = 11.27 GiB (4.63 BPW) 
load: printing all EOG tokens:
load:   - 199999 ('<|endoftext|>')
load:   - 200002 ('<|return|>')
load:   - 200007 ('<|end|>')
load:   - 200012 ('<|call|>')
load: special_eog_ids contains both '<|return|>' and '<|call|>' tokens, removing '<|end|>' token from EOG list
load: special tokens cache size = 21
load: token to piece cache size = 1.3332 MB
print_info: arch             = gpt-oss
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 2880
print_info: n_layer          = 24
print_info: n_head           = 64
print_info: n_head_kv        = 8
print_info: n_rot            = 64
print_info: n_swa            = 128
print_info: is_swa_any       = 1
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 2880
print_info: n_expert         = 32
print_info: n_expert_used    = 4
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = yarn
print_info: freq_base_train  = 150000.0
print_info: freq_scale_train = 0.03125
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: model type       = 20B
print_info: model params     = 20.91 B
print_info: general.name     = Openai_Gpt Oss 20b
print_info: n_ff_exp         = 2880
print_info: vocab type       = BPE
print_info: n_vocab          = 201088
print_info: n_merges         = 446189
print_info: BOS token        = 199998 '<|startoftext|>'
print_info: EOS token        = 200002 '<|return|>'
print_info: EOT token        = 200007 '<|end|>'
print_info: PAD token        = 199999 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 199999 '<|endoftext|>'
print_info: EOG token        = 200002 '<|return|>'
print_info: EOG token        = 200012 '<|call|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 24 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 25/25 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   586.82 MiB
load_tensors: Metal_Mapped model buffer size = 11536.19 MiB
.................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 150000.0
llama_context: freq_scale    = 0.03125
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Pro
ggml_metal_init: picking default device: Apple M3 Pro
ggml_metal_load_library: using embedded metal library
ggml_metal_init: GPU name:   Apple M3 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 28991.03 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_set_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_c4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h40           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h64       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
llama_context:        CPU  output buffer size =     0.77 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache:      Metal KV buffer size =    96.00 MiB
llama_kv_cache: size =   96.00 MiB (  4096 cells,  12 layers,  1/1 seqs), K (f16):   48.00 MiB, V (f16):   48.00 MiB
llama_kv_cache_iswa: creating     SWA KV cache, size = 768 cells
llama_kv_cache:      Metal KV buffer size =    18.00 MiB
llama_kv_cache: size =   18.00 MiB (   768 cells,  12 layers,  1/1 seqs), K (f16):    9.00 MiB, V (f16):    9.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:      Metal compute buffer size =   398.38 MiB
llama_context:        CPU compute buffer size =    15.15 MiB
llama_context: graph nodes  = 1352
llama_context: graph splits = 2
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|return|> logit bias = -inf
common_init_from_params: added <|call|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
*** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead?
main: chat template example:
<|start|>system<|message|>You are a helpful assistant<|end|><|start|>user<|message|>Hello<|end|><|start|>assistant<|message|>Hi there<|return|><|start|>user<|message|>How are you?<|end|><|start|>assistant

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | Metal : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | ACCELERATE = 1 | REPACK = 1 | 

main: interactive mode on.
sampler seed: 1177345732
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT

<|start|>user<|message|>Tell me a story about a frog and a toad<|end|><|start|>assistant<|channel|>analysis<|message|>We need to create a story about a frog and a toad. We should imagine a whimsical, possibly moral story. The user didn't specify any particular style. Could be a children's story, a fable, or a short narrative. Let's create a gentle tale about a frog named Finn and a toad named Toby. They live in a pond. They have different personalities: Finn is adventurous and loves jumping, Toby is cautious and likes staying still. They encounter a challenge: a drought or a predator. They learn to work together. There's a lesson about cooperation and appreciating differences. Or perhaps we can have an amusing mischief about a frog who is always jumping, a toad who always likes to relax, etc.

We can do a whimsical tone. Let's keep it short but complete. We'll give it a moral at the end. We'll give it a nice narrative arc.

Alright, let's produce. We'll structure into an introduction, rising action, climax, resolution, moral. We'll name the characters and describe the setting. We'll ensure it's engaging.

We'll go for a story that is about 500-600 words.

Let's write. We'll use third-person.

Okay, writing. We'll keep the style simple but descriptive. We'll show the frog's leaps, the toad's calmness, etc.

Let's go. We'll call the frog "Freddy" and the toad "Tilly." The setting: a pond, lily pads, reeds. The conflict: they want to find the legendary "Golden Lily."

Alright. We'll begin. We'll ensure it's interesting.

Write. We'll start with: In the heart of the marshland...

Go. We'll write.

We'll keep it simple. We'll ensure it's engaging.

Ok. Write.

We'll now produce the final answer.<|end|><|start|>assistant<|channel|>final<|message|>In the heart of the marshland, where the sun dripped gold through a canopy of reeds, lived a spry little frog named Freddy and a wise old toad named Tilly. Freddy was the kind of frog who leapt with the confidence of a born adventurer, his jumps echoing through the water as if he were announcing his arrival. Tilly, on the other hand, was patient and thoughtful, her movements slow but deliberate, as if she carried the weight of many seasons on her back.

One bright morning, the duo found a faded parchment washed ashore on the lily pond. The parchment was a map, its lines drawn with the careful strokes of a time-worn hand, pointing to a place marked as “The Whispering Marsh.” The legend at the edge of the map claimed that a hidden pond, shimmering with a glow that could only be seen under the silver light of the full moon, was concealed within those marshes.

With curiosity brimming, they set out. The marsh was thick with reeds, their tall stems swaying in the wind like a sea of green. The air was cool, and the sounds of insects filled every corner. As they wove their way through, a sense of wonder grew stronger within them.

They pressed on, their hearts quickening, as they followed the map’s clues. They passed a stone bridge that crossed a small stream, and beneath it, they found a stone that read, “The path will not reveal itself to the unsteady.” Each step felt like a new adventure in the unknown world.

After a while, the marsh’s fog lifted, revealing a hidden stone path that was covered in moss. Their feet stepped onto the mossy path, and they felt a gentle breeze that carried the scent of the marsh.

They could see the path’s end and the distant mountain.

They had to

### 2. The Hidden Path and the Mountain

The hidden path led them to a clearing where the sky was clear, and the wind carried the sweet scent of flowers. In the distance, a mountain rose on the horizon, its peaks shining with the light of the sun.

At the first few lines of the story, **the two friends were weeping or whatever...**

The **shining

### The Hidden Path

### 1. The First Meeting

---

**The Day of the Meeting**

---

**The Metafictional Inference**

In **a** **sci‑fiction / **/ 
The story goes...
This seems like……

---

## The **T**...

I apologize for any inconvenience, but it seems like the text you provided is a mixture of incomplete sentences and unclear fragments. If you could provide a more specific and organized prompt or clarify what you would like me to write, I'd be happy to assist you.

I understand your frustration. Let's try to clarify the task and get the right content for you.

Absolutely, let's get started. Please let me know what specific content you would like me to create. Is there a particular topic, genre, or style you're interested in? Once I have a clearer understanding of your preferences, I can provide a more focused and tailored response.

Thank you for your patience. Let's proceed with the following content:

**Title: A Comprehensive Guide to Succeeding in

It seems like your message got cut off before you could finish the title. Could you please finish the title or provide the rest of the content you need?

Thank you for pointing that out. It looks like there was a mistake in my response. I apologize for the confusion. 

Could you please provide me with the correct title or let me know what specific content you would like me to help you with? Once I have a clear understanding of your request, I will be able to provide you with the appropriate assistance.

I apologize for the confusion. It looks like there was a mistake in my response. Thank you for pointing that out. 

Please let me know what specific content you would like me to help you with. Once I have a clear understanding of your request, I will be able to provide you with the appropriate assistance.

Thank you for your patience! Let’s get back on track. Could you let me know what topic or type of content you’d like to focus on? Whether it’s writing a piece, researching a topic, or any other type of assistance, I’m here to help. Just share a bit more detail about what you need, and we can dive right into it.

Sure thing! If you're looking to create a piece of writing—be it a blog post, an article, or perhaps a short story—just let me know the topic, tone, and any specific points or ideas you want to cover. If you prefer, I can also help you brainstorm or outline your content.

Alternatively, if you want to research or gather information on a particular subject, just give me a brief description of what you're looking to learn more about, and I will provide you with a comprehensive summary or a detailed analysis.

Let me share some specific steps:

- **1. **We can have a brainstorming session where this is a **"**..????**?**?**?????**??

Great! Let's get started. Here are a few ways we can proceed based on what you're looking to achieve:

### Option 1: Create or Enhance a Piece of Writing
**If you want to write or edit a piece of content:**
- **Title or Topic**: What is the main focus or subject?
- **Purpose**: Are you aiming to inform, persuade, entertain, or something else?
- **Audience**: Who are you writing for?
- **Tone and Style**: Formal, informal, humorous, etc.
- **Length**: How long should it be? (e.g., 500 words, 1200 words)
- **Specific Sections**: Any particular parts you want to focus on, like an introduction, conclusion, or a particular argument?

**If you need help refining an existing draft:**
- **Send the Draft**: Provide the text you have.
- **Highlight Areas**: Point out sections you’re uncertain about.
- **Feedback Goals**: Are you looking for clarity, style, structure, or argument strength?

**Example Request**:
> "I’m writing a 800-word essay on the benefits of renewable energy for a high school assignment. I want to make sure my argument is clear, my evidence is solid, and my conclusion is compelling. Here’s my draft. Please review it and suggest any changes or additions, especially focusing on how to strengthen my evidence and improve the flow."

**Tip**: The more details you provide, the more tailored and useful the feedback will be.

---  

Feel free to ask me to review a specific text or give you guidance on any writing task!

> EOF by user


llama_perf_sampler_print:    sampling time =     208.23 ms /  1773 runs   (    0.12 ms per token,  8514.50 tokens per second)
llama_perf_context_print:        load time =    1244.55 ms
llama_perf_context_print: prompt eval time =     159.73 ms /    17 tokens (    9.40 ms per token,   106.43 tokens per second)
llama_perf_context_print:        eval time =   40332.66 ms /  1755 runs   (   22.98 ms per token,    43.51 tokens per second)
llama_perf_context_print:       total time =   61620.76 ms /  1772 tokens
llama_perf_context_print:    graphs reused =       1748
ggml_metal_free: deallocating

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions