Skip to content

Misc. bug: chat template diff logic causes incomplete tokenization with GPT-OSS model due to <|return|> vs <|end|> inconsistency #15417

@danbev

Description

@danbev

Name and Version

(venv) $ build/bin/llama-cli --version
version: 6199 (f08c4c0)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-cli

Command line

build/bin/llama-cli -m ~/work/ai/models/converted/gpt-oss-20b.gguf -c 0 -fa --jinja -p "Test" --verbose-prompt -ngl 15 --no-warmup -sp --reasoning-format none

Problem description & steps to reproduce

This issue can be reprocuced using gpt-oss-20b.gguf model:

> What is the capital of Sweden?
main: number of tokens in prompt = 17
    83 -> 't'
   497 -> 'art'
    91 -> '|'
    29 -> '>'
  1428 -> 'user'
200008 -> '<|message|>'
    54 -> 'W'
  4827 -> 'What'
   382 -> ' is'
   290 -> ' the'
  9029 -> ' capital'
   328 -> ' of'
 42009 -> ' Sweden'
    30 -> '?'
200007 -> '<|end|>'
200006 -> '<|start|>'
173781 -> 'assistant'
<|channel|>analysis<|message|>User asks: "What is the capital of Sweden?" Answer: Stockholm. Provide concise.<|end|><|start|>assistant<|channel|>final<|message|>The capital of Sweden is **Stockholm**.<|return|>

> 

It looks like the start token is getting "cut off" here.

$ gdb --args build/bin/llama-cli -m ~/work/ai/models/converted/gpt-oss-20b.gguf -c 0 -fa --jinja -p "Test" --verbose-prompt -ngl 15 --no-warmup -sp

So when we enter a prompt and press enter this will break out this loop in main.cpp:

                std::string line;
                bool another_line = true;
                do {
                    another_line = console::readline(line, params.multiline_input);
                    buffer += line;
                } while (another_line);
                ...

                    bool format_chat = params.conversation_mode && params.enable_chat_template;
                    std::string user_inp = format_chat
                        ? chat_add_and_format("user", std::move(buffer))
                        : std::move(buffer);

And this will call the the chat_add_and_format lambda:

    auto chat_add_and_format = [&chat_msgs, &chat_templates](const std::string & role, const std::string & content) {
        common_chat_msg new_msg;
        new_msg.role = role;
        new_msg.content = content;
        auto formatted = common_chat_format_single(chat_templates.get(), chat_msgs, new_msg, role == "user", g_params->use_jinja);
        chat_msgs.push_back(new_msg);
        LOG_DBG("formatted: '%s'\n", formatted.c_str());
        return formatted;
    };

Which will call into the common_chat_format_single function:

std::string common_chat_format_single(
        const struct common_chat_templates * tmpls,
        const std::vector<common_chat_msg> & past_msg,
        const common_chat_msg & new_msg,
        bool add_ass,
        bool use_jinja) {

    common_chat_templates_inputs inputs;
    inputs.use_jinja = use_jinja;
    inputs.add_bos = tmpls->add_bos;
    inputs.add_eos = tmpls->add_eos;

    std::string fmt_past_msg;
    if (!past_msg.empty()) {
        inputs.messages = past_msg;
        inputs.add_generation_prompt = false;
        fmt_past_msg = common_chat_templates_apply(tmpls, inputs).prompt;
    }
    std::ostringstream ss;
    // if the past_msg ends with a newline, we must preserve it in the formatted version
    if (add_ass && !fmt_past_msg.empty() && fmt_past_msg.back() == '\n') {
        ss << "\n";
    };
    // format chat with new_msg
    inputs.messages.push_back(new_msg);
    inputs.add_generation_prompt = add_ass;
    auto fmt_new_msg = common_chat_templates_apply(tmpls, inputs).prompt;
    // get the diff part
    ss << fmt_new_msg.substr(fmt_past_msg.size(), fmt_new_msg.size() - fmt_past_msg.size());
    return ss.str();
}

Inspecting the fmt_past_msg we have:

(gdb) p fmt_past_msg
$23 = "<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.\nKnowledge cutoff: 2024-06\nCurrent date: 2025-08-19\n\nReasoning: medium\n\n# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>user<|message|>Test<|end|><|start|>assistant<|channel|>final<|message|>assistant<|channel|>analysis<|message|>The user says \"Test\". Probably they want to test the chat. We should respond with something acknowledging. Maybe a simple \"Hello! How can I help you?\" Or ask what they want. We'll respond appropriately.<|end|><|start|>assistant<|channel|>final<|message|>Got it! How can I help you today?<|return|>"

Notice that the last token is <|return|> here.

And the fmt_new_msg is:

$24 = "<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.\nKnowledge cutoff: 2024-06\nCurrent date: 2025-08-19\n\nReasoning: medium\n\n# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>user<|message|>Test<|end|><|start|>assistant<|channel|>final<|message|>assistant<|channel|>analysis<|message|>The user says \"Test\". Probably they want to test the chat. We should respond with something acknowledging. Maybe a simple \"Hello! How can I help you?\" Or ask what they want. We'll respond appropriately.<|end|><|start|>assistant<|channel|>final<|message|>Got it! How can I help you today?<|end|><|start|>user<|message|>What is the capital of Sweden?<|end|><|start|>assistant"

Notice that the token <|return|> has been replaced with <|end|>.

The tempalte engine will iterate through all the messages and when it comes to
the following statement it will be executed since add_generation_prompt was
set to false above:

(gdb) call (void)printf("%s", tmpls->template_default->source_.c_str()) 
....
{%- elif loop.last and not add_generation_prompt %}
    {{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|return|>" }}

So this will produce a string that ends with <|return|> (and not <|end|>). And
later when we try to get the substring of the new message it will start at the
wrong position because the template replaced <|return|> with <|end|> in the same
location, causing the substring to begin mid-token in <|start|>

Lets just make sure how the formatted string that is returned is actually used
and if this really matters at all:

                    std::string user_inp = format_chat                              
                        ? chat_add_and_format("user", std::move(buffer))            
                        : std::move(buffer);                                        
                    const auto line_pfx = common_tokenize(ctx, params.input_prefix, false, true);
                    const auto line_inp = common_tokenize(ctx, user_inp,            false, format_chat);
                    const auto line_sfx = common_tokenize(ctx, params.input_suffix, false, true);
(gdb) p user_inp                                                                    
$13 = "tart|>user<|message|>What is the capitlal of Sweden?<|end|><|start|>assistant"
(gdb) p line_inp                                                                    
$14 = std::vector of length 17, capacity 77 = {83, 497, 91, 29, 1428, 200008, 4827, 382, 290, 41415, 46006, 328, 42009, 30, 200007,
  200006, 173781}                                                                   

So lets set user_inp to the correct value and then tokenize it:

(gdb) call (char*)strcpy((char*)user_inp.data(), "<|start|>user<|message|>What is the capital of Sweden?<|end|><|start|>assistant")
$17 = 0x555570d68e40 "<|start|>user<|message|>What is the capital of Sweden?<|end|><|start|>assistant"
(gdb) p line_inp                                                                
$19 = std::vector of length 13, capacity 76 = {200006, 1428, 200008, 4827, 382, 290, 9029, 328, 42009, 30, 200007, 200006, 105782}

So we have the following difference in tokens:

vector of length 17 {83    ,  497,     91,   29, 1428, 200008, 4827, 382,   290, 41415,  46006,    328,  42009, 30, 200007, 200006, 173781}
vector of length 13 {200006, 1428, 200008, 4827,  382,    290, 9029, 328, 42009,    30, 200007, 200006, 105782}

I'm not sure if this is a known issue or not but I was not able to find anything from a quick search. And the model seems to be working fine as well even with this so looking for some feedback regarding this.

First Bad Commit

No response

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions