-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Description
Name and Version
(venv) $ build/bin/llama-cli --version
version: 6199 (f08c4c0)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-cli
Command line
build/bin/llama-cli -m ~/work/ai/models/converted/gpt-oss-20b.gguf -c 0 -fa --jinja -p "Test" --verbose-prompt -ngl 15 --no-warmup -sp --reasoning-format noneProblem description & steps to reproduce
This issue can be reprocuced using gpt-oss-20b.gguf model:
> What is the capital of Sweden?
main: number of tokens in prompt = 17
83 -> 't'
497 -> 'art'
91 -> '|'
29 -> '>'
1428 -> 'user'
200008 -> '<|message|>'
54 -> 'W'
4827 -> 'What'
382 -> ' is'
290 -> ' the'
9029 -> ' capital'
328 -> ' of'
42009 -> ' Sweden'
30 -> '?'
200007 -> '<|end|>'
200006 -> '<|start|>'
173781 -> 'assistant'
<|channel|>analysis<|message|>User asks: "What is the capital of Sweden?" Answer: Stockholm. Provide concise.<|end|><|start|>assistant<|channel|>final<|message|>The capital of Sweden is **Stockholm**.<|return|>
> It looks like the start token is getting "cut off" here.
$ gdb --args build/bin/llama-cli -m ~/work/ai/models/converted/gpt-oss-20b.gguf -c 0 -fa --jinja -p "Test" --verbose-prompt -ngl 15 --no-warmup -spSo when we enter a prompt and press enter this will break out this loop in main.cpp:
std::string line;
bool another_line = true;
do {
another_line = console::readline(line, params.multiline_input);
buffer += line;
} while (another_line);
...
bool format_chat = params.conversation_mode && params.enable_chat_template;
std::string user_inp = format_chat
? chat_add_and_format("user", std::move(buffer))
: std::move(buffer);And this will call the the chat_add_and_format lambda:
auto chat_add_and_format = [&chat_msgs, &chat_templates](const std::string & role, const std::string & content) {
common_chat_msg new_msg;
new_msg.role = role;
new_msg.content = content;
auto formatted = common_chat_format_single(chat_templates.get(), chat_msgs, new_msg, role == "user", g_params->use_jinja);
chat_msgs.push_back(new_msg);
LOG_DBG("formatted: '%s'\n", formatted.c_str());
return formatted;
};Which will call into the common_chat_format_single function:
std::string common_chat_format_single(
const struct common_chat_templates * tmpls,
const std::vector<common_chat_msg> & past_msg,
const common_chat_msg & new_msg,
bool add_ass,
bool use_jinja) {
common_chat_templates_inputs inputs;
inputs.use_jinja = use_jinja;
inputs.add_bos = tmpls->add_bos;
inputs.add_eos = tmpls->add_eos;
std::string fmt_past_msg;
if (!past_msg.empty()) {
inputs.messages = past_msg;
inputs.add_generation_prompt = false;
fmt_past_msg = common_chat_templates_apply(tmpls, inputs).prompt;
}
std::ostringstream ss;
// if the past_msg ends with a newline, we must preserve it in the formatted version
if (add_ass && !fmt_past_msg.empty() && fmt_past_msg.back() == '\n') {
ss << "\n";
};
// format chat with new_msg
inputs.messages.push_back(new_msg);
inputs.add_generation_prompt = add_ass;
auto fmt_new_msg = common_chat_templates_apply(tmpls, inputs).prompt;
// get the diff part
ss << fmt_new_msg.substr(fmt_past_msg.size(), fmt_new_msg.size() - fmt_past_msg.size());
return ss.str();
}Inspecting the fmt_past_msg we have:
(gdb) p fmt_past_msg
$23 = "<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.\nKnowledge cutoff: 2024-06\nCurrent date: 2025-08-19\n\nReasoning: medium\n\n# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>user<|message|>Test<|end|><|start|>assistant<|channel|>final<|message|>assistant<|channel|>analysis<|message|>The user says \"Test\". Probably they want to test the chat. We should respond with something acknowledging. Maybe a simple \"Hello! How can I help you?\" Or ask what they want. We'll respond appropriately.<|end|><|start|>assistant<|channel|>final<|message|>Got it! How can I help you today?<|return|>"Notice that the last token is <|return|> here.
And the fmt_new_msg is:
$24 = "<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.\nKnowledge cutoff: 2024-06\nCurrent date: 2025-08-19\n\nReasoning: medium\n\n# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>user<|message|>Test<|end|><|start|>assistant<|channel|>final<|message|>assistant<|channel|>analysis<|message|>The user says \"Test\". Probably they want to test the chat. We should respond with something acknowledging. Maybe a simple \"Hello! How can I help you?\" Or ask what they want. We'll respond appropriately.<|end|><|start|>assistant<|channel|>final<|message|>Got it! How can I help you today?<|end|><|start|>user<|message|>What is the capital of Sweden?<|end|><|start|>assistant"Notice that the token <|return|> has been replaced with <|end|>.
The tempalte engine will iterate through all the messages and when it comes to
the following statement it will be executed since add_generation_prompt was
set to false above:
(gdb) call (void)printf("%s", tmpls->template_default->source_.c_str())
....
{%- elif loop.last and not add_generation_prompt %}
{{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|return|>" }}So this will produce a string that ends with <|return|> (and not <|end|>). And
later when we try to get the substring of the new message it will start at the
wrong position because the template replaced <|return|> with <|end|> in the same
location, causing the substring to begin mid-token in <|start|>
Lets just make sure how the formatted string that is returned is actually used
and if this really matters at all:
std::string user_inp = format_chat
? chat_add_and_format("user", std::move(buffer))
: std::move(buffer);
const auto line_pfx = common_tokenize(ctx, params.input_prefix, false, true);
const auto line_inp = common_tokenize(ctx, user_inp, false, format_chat);
const auto line_sfx = common_tokenize(ctx, params.input_suffix, false, true);(gdb) p user_inp
$13 = "tart|>user<|message|>What is the capitlal of Sweden?<|end|><|start|>assistant"
(gdb) p line_inp
$14 = std::vector of length 17, capacity 77 = {83, 497, 91, 29, 1428, 200008, 4827, 382, 290, 41415, 46006, 328, 42009, 30, 200007,
200006, 173781} So lets set user_inp to the correct value and then tokenize it:
(gdb) call (char*)strcpy((char*)user_inp.data(), "<|start|>user<|message|>What is the capital of Sweden?<|end|><|start|>assistant")
$17 = 0x555570d68e40 "<|start|>user<|message|>What is the capital of Sweden?<|end|><|start|>assistant"
(gdb) p line_inp
$19 = std::vector of length 13, capacity 76 = {200006, 1428, 200008, 4827, 382, 290, 9029, 328, 42009, 30, 200007, 200006, 105782}So we have the following difference in tokens:
vector of length 17 {83 , 497, 91, 29, 1428, 200008, 4827, 382, 290, 41415, 46006, 328, 42009, 30, 200007, 200006, 173781}
vector of length 13 {200006, 1428, 200008, 4827, 382, 290, 9029, 328, 42009, 30, 200007, 200006, 105782}I'm not sure if this is a known issue or not but I was not able to find anything from a quick search. And the model seems to be working fine as well even with this so looking for some feedback regarding this.
First Bad Commit
No response