-
Notifications
You must be signed in to change notification settings - Fork 13.6k
chat: Allow reasoning_content to be passed back #16934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
tarruda
wants to merge
3
commits into
ggml-org:master
from
tarruda:allow-passing-back-reasoning-content
Closed
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aldehir about your comment: I was getting errors from llama_server when my codex fork sent "reasoning_content" in this validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's interesting. It isn't the behavior I see from my own clients sending back
reasoning_content. I also use codex, but with middleware that translatesreasoningtoreasoning_content. Have you inspected the traffic from codex to ensure it is passing backtool_calls?This doesn't hurt anything, but it does codify that a model may output only reasoning and nothing else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually have my own middleware which I use just to inspect requests. I could never see it sending reasoning back to llama.cpp without those changes I made. There was some code which dropped it when the last message was a user message, which is certainly always the case when sending the request.
Yes, it does receive tool calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is easy to verify: If you run llama.cpp master with my codex fork, it will fail with 500 on the second message (which is the first request that would send previous resoning content):
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gpt-oss only needs the reasoning when looping on tool calls, i.e where the last message has the
toolrole. The template itself will not include reasoning for tool calls prior to the last "final" message (an assistant message withcontent). The message before a user message usually is a final assistant message, so all prior reasoning is removed.Minimax M2 does appear to require it for every assistant message, though.Looks like MiniMax-M2 only keeps it for tool calling loops as well.This test case should pass even if you don't pass back
reasoning_content, ascontentshould be present.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understood correctly, then there's no problem with always passing reasoning back since the template will only use when needed, right?
In that case it is best to just allow passing
reasoning_contentand let the template handle how LLMs use it?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that is preferable, the model creators typically generate the template so they should encode whatever logic they expect there. Worse case, we can manipulate the messages in the
*_init_params()function for the specific model. That's my own opinion, I do not speak for the maintainers.I tested your branch, and I found the cause of your problem:
tarruda/codex send-thinking
Notice, on the right, your patch is sending the reasoning content in a separate message. This is why you are receiving the error, because there is no accompanying
contentortool_calls. Even if allowed, the template would render a final message with no content (from the first message) and may degrade model performance.Additionally, gpt-oss only needs the reasoning from tool call messages. If it comes from a regular assistant message, it is dropped. You see this in the chat template. (Note: it does add it if
add_generation_prompt = false, which is only applicable during training)Take a look at my patch: aldehir/codex@fe2ca23
aldehir/codex llama-cpp-support
I had to give it a more specific example, so I asked it to run
lsand then read the first 3 lines of the README file in the directory. Notice thereasoning_contentadded to the assistant message withtool_calls. This works with the current master branch as is.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so to summarize:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aldehir your codex patch is much simpler. I assume it can break for whatever other inference engine uses "reasoning" instead of "reasoning_content", so it probably needs to be a configurable.
Were you planning to submit a PR to codex to make it compatible with llama.cpp or are you just continue using the
reasoning->reasoning_contentproxy?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For gpt-oss, technically only tool calls. But, it doesn't hurt if you keep it intact with all assistant messages since the template will render it properly.
I don't believe this is needed, as I point out in #16946 (comment), it works as is if I pass along
reasoning_content.I have no intention to submit a PR. I think the ideal approach here is to adopt a Responses API that automatically supports this interaction.