Misc. bug: chat template diff logic causes incomplete tokenization with GPT-OSS model due to <|return|> vs <|end|> inconsistency

### Name and Version

(venv) $ build/bin/llama-cli --version
version: 6199 (f08c4c0d8)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu


### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

llama-cli

### Command line

```shell
build/bin/llama-cli -m ~/work/ai/models/converted/gpt-oss-20b.gguf -c 0 -fa --jinja -p "Test" --verbose-prompt -ngl 15 --no-warmup -sp --reasoning-format none
```

### Problem description & steps to reproduce

This issue can be reprocuced using `gpt-oss-20b.gguf` model:
```console
> What is the capital of Sweden?
main: number of tokens in prompt = 17
    83 -> 't'
   497 -> 'art'
    91 -> '|'
    29 -> '>'
  1428 -> 'user'
200008 -> '<|message|>'
    54 -> 'W'
  4827 -> 'What'
   382 -> ' is'
   290 -> ' the'
  9029 -> ' capital'
   328 -> ' of'
 42009 -> ' Sweden'
    30 -> '?'
200007 -> '<|end|>'
200006 -> '<|start|>'
173781 -> 'assistant'
<|channel|>analysis<|message|>User asks: "What is the capital of Sweden?" Answer: Stockholm. Provide concise.<|end|><|start|>assistant<|channel|>final<|message|>The capital of Sweden is **Stockholm**.<|return|>

> 
```
It looks like the start token is getting "cut off" here.

```console
$ gdb --args build/bin/llama-cli -m ~/work/ai/models/converted/gpt-oss-20b.gguf -c 0 -fa --jinja -p "Test" --verbose-prompt -ngl 15 --no-warmup -sp
```

So when we enter a prompt and press enter this will break out this loop in main.cpp:
```c++
                std::string line;
                bool another_line = true;
                do {
                    another_line = console::readline(line, params.multiline_input);
                    buffer += line;
                } while (another_line);
                ...

                    bool format_chat = params.conversation_mode && params.enable_chat_template;
                    std::string user_inp = format_chat
                        ? chat_add_and_format("user", std::move(buffer))
                        : std::move(buffer);
```
And this will call the the `chat_add_and_format` lambda:
```c++
    auto chat_add_and_format = [&chat_msgs, &chat_templates](const std::string & role, const std::string & content) {
        common_chat_msg new_msg;
        new_msg.role = role;
        new_msg.content = content;
        auto formatted = common_chat_format_single(chat_templates.get(), chat_msgs, new_msg, role == "user", g_params->use_jinja);
        chat_msgs.push_back(new_msg);
        LOG_DBG("formatted: '%s'\n", formatted.c_str());
        return formatted;
    };
```
Which will call into the `common_chat_format_single` function:
```c++
std::string common_chat_format_single(
        const struct common_chat_templates * tmpls,
        const std::vector<common_chat_msg> & past_msg,
        const common_chat_msg & new_msg,
        bool add_ass,
        bool use_jinja) {

    common_chat_templates_inputs inputs;
    inputs.use_jinja = use_jinja;
    inputs.add_bos = tmpls->add_bos;
    inputs.add_eos = tmpls->add_eos;

    std::string fmt_past_msg;
    if (!past_msg.empty()) {
        inputs.messages = past_msg;
        inputs.add_generation_prompt = false;
        fmt_past_msg = common_chat_templates_apply(tmpls, inputs).prompt;
    }
    std::ostringstream ss;
    // if the past_msg ends with a newline, we must preserve it in the formatted version
    if (add_ass && !fmt_past_msg.empty() && fmt_past_msg.back() == '\n') {
        ss << "\n";
    };
    // format chat with new_msg
    inputs.messages.push_back(new_msg);
    inputs.add_generation_prompt = add_ass;
    auto fmt_new_msg = common_chat_templates_apply(tmpls, inputs).prompt;
    // get the diff part
    ss << fmt_new_msg.substr(fmt_past_msg.size(), fmt_new_msg.size() - fmt_past_msg.size());
    return ss.str();
}
```
Inspecting the `fmt_past_msg` we have:
```console
(gdb) p fmt_past_msg
$23 = "<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.\nKnowledge cutoff: 2024-06\nCurrent date: 2025-08-19\n\nReasoning: medium\n\n# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>user<|message|>Test<|end|><|start|>assistant<|channel|>final<|message|>assistant<|channel|>analysis<|message|>The user says \"Test\". Probably they want to test the chat. We should respond with something acknowledging. Maybe a simple \"Hello! How can I help you?\" Or ask what they want. We'll respond appropriately.<|end|><|start|>assistant<|channel|>final<|message|>Got it! How can I help you today?<|return|>"
```
Notice that the last token is `<|return|>` here.

And the `fmt_new_msg` is:
```console
$24 = "<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.\nKnowledge cutoff: 2024-06\nCurrent date: 2025-08-19\n\nReasoning: medium\n\n# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>user<|message|>Test<|end|><|start|>assistant<|channel|>final<|message|>assistant<|channel|>analysis<|message|>The user says \"Test\". Probably they want to test the chat. We should respond with something acknowledging. Maybe a simple \"Hello! How can I help you?\" Or ask what they want. We'll respond appropriately.<|end|><|start|>assistant<|channel|>final<|message|>Got it! How can I help you today?<|end|><|start|>user<|message|>What is the capital of Sweden?<|end|><|start|>assistant"
```
Notice that the token `<|return|>` has been replaced with `<|end|>`.

The tempalte engine will iterate through all the messages and when it comes to
the following statement it will be executed since `add_generation_prompt` was
set to false above:
```console
(gdb) call (void)printf("%s", tmpls->template_default->source_.c_str()) 
....
{%- elif loop.last and not add_generation_prompt %}
    {{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|return|>" }}
```
So this will produce a string that ends with <|return|> (and not <|end|>). And  
later when we try to get the substring of the new message it will start at the  
wrong position because the template replaced <|return|> with <|end|> in the same
location, causing the substring to begin mid-token in <|start|>



Lets just make sure how the formatted string that is returned is actually used      
and if this really matters at all:                                                  
```c++                                                                              
                    std::string user_inp = format_chat                              
                        ? chat_add_and_format("user", std::move(buffer))            
                        : std::move(buffer);                                        
                    const auto line_pfx = common_tokenize(ctx, params.input_prefix, false, true);
                    const auto line_inp = common_tokenize(ctx, user_inp,            false, format_chat);
                    const auto line_sfx = common_tokenize(ctx, params.input_suffix, false, true);
```                                                                                 
```console                                                                          
(gdb) p user_inp                                                                    
$13 = "tart|>user<|message|>What is the capitlal of Sweden?<|end|><|start|>assistant"
(gdb) p line_inp                                                                    
$14 = std::vector of length 17, capacity 77 = {83, 497, 91, 29, 1428, 200008, 4827, 382, 290, 41415, 46006, 328, 42009, 30, 200007,
  200006, 173781}                                                                   
```                                                                                                                                    
So lets set user_inp to the correct value and then tokenize it:                 
```console                                                                      
(gdb) call (char*)strcpy((char*)user_inp.data(), "<|start|>user<|message|>What is the capital of Sweden?<|end|><|start|>assistant")
$17 = 0x555570d68e40 "<|start|>user<|message|>What is the capital of Sweden?<|end|><|start|>assistant"
(gdb) p line_inp                                                                
$19 = std::vector of length 13, capacity 76 = {200006, 1428, 200008, 4827, 382, 290, 9029, 328, 42009, 30, 200007, 200006, 105782}
```                                                                                 
So we have the following difference in tokens:                                  
```console                                                                      
vector of length 17 {83    ,  497,     91,   29, 1428, 200008, 4827, 382,   290, 41415,  46006,    328,  42009, 30, 200007, 200006, 173781}
vector of length 13 {200006, 1428, 200008, 4827,  382,    290, 9029, 328, 42009,    30, 200007, 200006, 105782}
```     

I'm not sure if this is a known issue or not but I was not able to find anything from a quick search. And the model seems to be working fine as well even with this so looking for some feedback regarding this.

### First Bad Commit

_No response_

### Relevant log output

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: chat template diff logic causes incomplete tokenization with GPT-OSS model due to <|return|> vs <|end|> inconsistency #15417

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: chat template diff logic causes incomplete tokenization with GPT-OSS model due to <|return|> vs <|end|> inconsistency #15417

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions