Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions common/chat.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -201,6 +201,7 @@ std::vector<common_chat_msg> common_chat_msgs_parse_oaicompat(const json & messa
msg.role = message.at("role");

auto has_content = message.contains("content");
auto has_reasoning_content = message.contains("reasoning_content");
auto has_tool_calls = message.contains("tool_calls");
if (has_content) {
const auto & content = message.at("content");
Expand Down Expand Up @@ -249,8 +250,8 @@ std::vector<common_chat_msg> common_chat_msgs_parse_oaicompat(const json & messa
msg.tool_calls.push_back(tc);
}
}
if (!has_content && !has_tool_calls) {
throw std::runtime_error("Expected 'content' or 'tool_calls' (ref: https://github.com/ggml-org/llama.cpp/issues/8367 & https://github.com/ggml-org/llama.cpp/issues/12279)");
if (!has_content && !has_tool_calls && !has_reasoning_content) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aldehir about your comment: I was getting errors from llama_server when my codex fork sent "reasoning_content" in this validation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting. It isn't the behavior I see from my own clients sending back reasoning_content. I also use codex, but with middleware that translates reasoning to reasoning_content. Have you inspected the traffic from codex to ensure it is passing back tool_calls?

This doesn't hurt anything, but it does codify that a model may output only reasoning and nothing else.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting. It isn't the behavior I see from my own clients sending back reasoning_content. I also use codex, but with middleware that translates reasoning to reasoning_content.

I actually have my own middleware which I use just to inspect requests. I could never see it sending reasoning back to llama.cpp without those changes I made. There was some code which dropped it when the last message was a user message, which is certainly always the case when sending the request.

Have you inspected the traffic from codex to ensure it is passing back tool_calls?

Yes, it does receive tool calls.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is easy to verify: If you run llama.cpp master with my codex fork, it will fail with 500 on the second message (which is the first request that would send previous resoning content):

image

Copy link
Collaborator

@aldehir aldehir Nov 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some code which dropped it when the last message was a user message, which is certainly always the case when sending the request.

gpt-oss only needs the reasoning when looping on tool calls, i.e where the last message has the tool role. The template itself will not include reasoning for tool calls prior to the last "final" message (an assistant message with content). The message before a user message usually is a final assistant message, so all prior reasoning is removed. Minimax M2 does appear to require it for every assistant message, though. Looks like MiniMax-M2 only keeps it for tool calling loops as well.

image

This test case should pass even if you don't pass back reasoning_content, as content should be present.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpt-oss only needs the reasoning when looping on tool calls, i.e where the last message has the tool role. The template itself will not include reasoning for tool calls prior to the last "final" message (an assistant message with content). The message before a user message usually is a final assistant message, so all prior reasoning is removed. Minimax M2 does appear to require it for every assistant message, though.

If I understood correctly, then there's no problem with always passing reasoning back since the template will only use when needed, right?

In that case it is best to just allow passing reasoning_content and let the template handle how LLMs use it?

Copy link
Collaborator

@aldehir aldehir Nov 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that is preferable, the model creators typically generate the template so they should encode whatever logic they expect there. Worse case, we can manipulate the messages in the *_init_params() function for the specific model. That's my own opinion, I do not speak for the maintainers.

I tested your branch, and I found the cause of your problem:

tarruda/codex send-thinking codex-test

Notice, on the right, your patch is sending the reasoning content in a separate message. This is why you are receiving the error, because there is no accompanying content or tool_calls. Even if allowed, the template would render a final message with no content (from the first message) and may degrade model performance.

Additionally, gpt-oss only needs the reasoning from tool call messages. If it comes from a regular assistant message, it is dropped. You see this in the chat template. (Note: it does add it if add_generation_prompt = false, which is only applicable during training)

Take a look at my patch: aldehir/codex@fe2ca23

aldehir/codex llama-cpp-support codex-test-patch

I had to give it a more specific example, so I asked it to run ls and then read the first 3 lines of the README file in the directory. Notice the reasoning_content added to the assistant message with tool_calls. This works with the current master branch as is.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so to summarize:

  • For GPT-OSS, reasoning has to be passed back only with tool calls or normal content. If not, it is either ignored or it can break the conversation
  • We still use this PR to allow reasoning content to be passed back independently, because some LLMs like Minimax M2 might use it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aldehir your codex patch is much simpler. I assume it can break for whatever other inference engine uses "reasoning" instead of "reasoning_content", so it probably needs to be a configurable.

Were you planning to submit a PR to codex to make it compatible with llama.cpp or are you just continue using the reasoning -> reasoning_content proxy?

Copy link
Collaborator

@aldehir aldehir Nov 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For GPT-OSS, reasoning has to be passed back only with tool calls or normal content. If not, it is either ignored or it can break the conversation

For gpt-oss, technically only tool calls. But, it doesn't hurt if you keep it intact with all assistant messages since the template will render it properly.

We still use this PR to allow reasoning content to be passed back independently, because some LLMs like Minimax M2 might use it.

I don't believe this is needed, as I point out in #16946 (comment), it works as is if I pass along reasoning_content.

Were you planning to submit a PR to codex to make it compatible with llama.cpp or are you just continue using the reasoning -> reasoning_content proxy?

I have no intention to submit a PR. I think the ideal approach here is to adopt a Responses API that automatically supports this interaction.

throw std::runtime_error("Expected 'content', 'reasoning_content' or 'tool_calls' (ref: https://github.com/ggml-org/llama.cpp/issues/8367 & https://github.com/ggml-org/llama.cpp/issues/12279)");
}
if (message.contains("reasoning_content")) {
msg.reasoning_content = message.at("reasoning_content");
Expand Down
13 changes: 13 additions & 0 deletions tools/server/tests/unit/test_chat_completion.py
Original file line number Diff line number Diff line change
Expand Up @@ -476,3 +476,16 @@ def make_cmpl_request():
assert last_progress["total"] > 0
assert last_progress["processed"] == last_progress["total"]
assert total_batch_count == batch_count


def test_standalone_reasoning_content_is_accepted():
global server
server.start()
res = server.make_request("POST", "/chat/completions", data={
"max_tokens": 8,
"messages": [
{"role": "user", "content": "How much is 102 + 7?"},
{"role": "assistant", "reasoning_content": "Calculate."},
]
})
assert res.status_code == 200
4 changes: 2 additions & 2 deletions tools/server/utils.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -595,8 +595,8 @@ static json oaicompat_chat_params_parse(
throw std::runtime_error("All non-assistant messages must contain 'content'");
}
if (role == "assistant") {
if (!msg.contains("content") && !msg.contains("tool_calls")) {
throw std::runtime_error("Assistant message must contain either 'content' or 'tool_calls'!");
if (!msg.contains("content") && !msg.contains("tool_calls") && !msg.contains("reasoning_content")) {
throw std::runtime_error("Assistant message must contain either 'content' or 'tool_calls' or 'reasoning_content'!");
}
if (!msg.contains("content")) {
continue; // avoid errors with no content
Expand Down