-
Notifications
You must be signed in to change notification settings - Fork 13.2k
feat: nemotron thinking & toolcalling support #15676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
On my end: the behavior persist, and the thinking mode broke, outputting only
Here is my server command:
|
having the same issue when streaming:
|
Sorry I just saw I used think instead of /think as system prompt, but I just tried again with the correct /think and the issue is actually exactly the same :/
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I apologize if I come off as nit-picky.
It also seems the webui doesn't properly render out the thinking UI element, probably because of the forced thinking which comes from the template:

I'm guessing the UI is looking for the <think>
tag, which is not present in the generation.
Tool calling works great, though!
Please follow this PR to support think tags in grammar because it's going to be required if there is no triggers (tool call mode = required, triggers are not expected to run and grammar is applied from start, so we need to allow as soon as possible thinking. |
The template isn't applied properly when there are tool calls/responses in the message: /apply-template curl example#!/bin/bash
curl http://localhost:8080/apply-template -H 'Content-Type: application/json' -d '{
"messages": [
{
"content": "What is the current weather in Barcelona, Stockholm, Lima, Berlin, and Oslo? And also, display them in a list sorted by their temperatures, highest first.",
"role": "user"
},
{
"content": null,
"role": "assistant",
"tool_calls": [
{
"type": "function",
"id": "2u4dKpkZDH21gTVH7Sr6R2wm3pAxisVF",
"function": {
"name": "get_weather",
"arguments": "{\"location\": \"Barcelona\"}"
}
},
{
"type": "function",
"id": "DK3FsBQguP4NZm0yMRZrIe78ZePKZyq9",
"function": {
"name": "get_weather",
"arguments": "{\"location\": \"Stockholm\"}"
}
},
{
"type": "function",
"id": "21DPzPHMEx2Y1eTPyVDHs1ytuRUnND3E",
"function": {
"name": "get_weather",
"arguments": "{\"location\": \"Lima\"}"
}
},
{
"type": "function",
"id": "Nr8JukMvqXyvnypsgYR1DPrrxMtvFQjz",
"function": {
"name": "get_weather",
"arguments": "{\"location\": \"Berlin\"}"
}
},
{
"type": "function",
"id": "vl7dST6XZddIhZ9geWBIsGftgzSPUlQ5",
"function": {
"name": "get_weather",
"arguments": "{\"location\": \"Oslo\"}"
}
}
]
},
{
"content": "Barcelona: \u2600\ufe0f +25\u00b0C",
"role": "tool",
"tool_call_id": "2u4dKpkZDH21gTVH7Sr6R2wm3pAxisVF"
},
{
"content": "Stockholm: \u2600\ufe0f +13\u00b0C",
"role": "tool",
"tool_call_id": "DK3FsBQguP4NZm0yMRZrIe78ZePKZyq9"
},
{
"content": "Lima: +16\u00b0C",
"role": "tool",
"tool_call_id": "21DPzPHMEx2Y1eTPyVDHs1ytuRUnND3E"
},
{
"content": "Berlin: \u2600\ufe0f +26\u00b0C",
"role": "tool",
"tool_call_id": "Nr8JukMvqXyvnypsgYR1DPrrxMtvFQjz"
},
{
"content": "Oslo: \u2600\ufe0f +13\u00b0C",
"role": "tool",
"tool_call_id": "vl7dST6XZddIhZ9geWBIsGftgzSPUlQ5"
}
]
}'
I would expect something more like this: expected prompt
It looks like the minja polyfills are injecting |
Here I come after a good night's sleep to find I opened a Pandora's Box... Thanks for all the feedback guys, @ExtReMLapin is probably right that we might need to rebase it on his PR to fix the thinking-in-required-toolcall issue, but the tool responses part is also pretty worrying... |
Okay, so first of all, the Jinja template was of course broken. Here's a corrected template: {%- set ns = namespace(enable_thinking=true) -%}
{%- for message in messages -%}
{%- set content = message['content'] -%}
{%- if message['role'] == 'user' or message['role'] == 'system' -%}
{%- if '/think' in content -%}
{%- set ns.enable_thinking = true -%}
{%- elif '/no_think' in content -%}
{%- set ns.enable_thinking = false -%}
{%- endif -%}
{%- endif -%}
{%- endfor -%}
{%- if messages[0]['role'] != 'system' -%}
{%- set ns.non_tool_system_content = '' -%}
{{- '<SPECIAL_10>System
' -}}
{%- else -%}
{%- set ns.non_tool_system_content = (messages[0]['content'] | default('')).replace('/think', '').replace('/no_think', '').strip() -%}
{{- '<SPECIAL_10>System
' + ns.non_tool_system_content }}
{%- endif -%}
{%- if tools -%}
{%- if ns.non_tool_system_content is defined and ns.non_tool_system_content != '' -%}
{{- '
' -}}
{%- endif -%}
{{- 'You can use the following tools to assist the user if required:' -}}
{{- '
<AVAILABLE_TOOLS>[' -}}
{%- for tool in tools -%}
{{- (tool.function if tool.function is defined else tool) | tojson -}}
{{- ', ' if not loop.last else '' -}}
{%- endfor -%}
{{- ']</AVAILABLE_TOOLS>
' -}}
{{- 'If you decide to call any tool(s), use the following format:
' -}}
{{- '<TOOLCALL>[{{"name": "tool_name1", "arguments": "tool_args1"}}, ' -}}
{{- '{{"name": "tool_name2", "arguments": "tool_args2"}}]</TOOLCALL>
' -}}
{{- 'The user will execute tool-calls and return responses from tool(s) in this format:
' -}}
{{- '<TOOL_RESPONSE>[{{"tool_response1"}}, {{"tool_response2"}}]</TOOL_RESPONSE>
' -}}
{{- 'Based on the tool responses, you can call additional tools if needed, correct tool calls if any errors are found, or just respond to the user.' -}}
{%- endif -%}
{{- '
' -}}
{%- set messages = messages[1:] if messages[0]['role'] == 'system' else messages -%}
{%- if messages[-1]['role'] == 'assistant' -%}
{%- set ns.last_turn_assistant_content = (messages[-1]['content'] | default('')).strip() -%}
{%- set messages = messages[:-1] -%}
{%- endif -%}
{%- for message in messages %}
{%- set content = message['content'] %}
{%- if message['role'] == 'user' -%}
{{- '<SPECIAL_11>User
' + (content | default('')).replace('/think', '').replace('/no_think', '').strip() + '
' }}
{%- elif message['role'] == 'tool' -%}
{%- if loop.first or (messages[loop.index0 - 1].role != 'tool') -%}
{{- '<SPECIAL_11>User
' + '<TOOL_RESPONSE>[' }}
{%- endif -%}
{{- message['content'] -}}
{{- ', ' if not loop.last and (messages[loop.index0 + 1].role == 'tool') else '' -}}
{%- if loop.last or (messages[loop.index0 + 1].role != 'tool') -%}
{{- ']</TOOL_RESPONSE>
' -}}
{%- endif -%}
{%- elif message['role'] == 'assistant' -%}
{%- if content and '</think>' in content -%}
{%- set content = (content.split('</think>')[1] | default('')).strip() %}
{%- endif -%}
{{- '<SPECIAL_11>Assistant
' + ((content | default('') | string).strip() if content is not none else '') }}
{%- if message.tool_calls -%}
{%- if (content | default('')).strip() != '' -%}
{{- '
' -}}
{%- endif -%}
{{- '<TOOLCALL>[' -}}
{%- for call in message.tool_calls -%}
{%- set fn = call.function if call.function is defined else call -%}
{{- '{"name": "' + fn.name + '", "arguments": ' -}}
{%- if fn.arguments is string -%}
{{- fn.arguments -}}
{%- else -%}
{{- fn.arguments | tojson -}}
{%- endif -%}
{{- '}' + (', ' if not loop.last else '') -}}
{%- endfor -%}
{{- ']</TOOLCALL>' -}}
{%- endif -%}
{{- '
<SPECIAL_12>
' -}}
{%- endif -%}
{%- endfor -%}
{%- if add_generation_prompt -%}
{{- '<SPECIAL_11>Assistant
' -}}
{%- if ns.enable_thinking is defined and ns.enable_thinking is false -%}
{{- '<think></think>' -}}
{%- else -%}
{{- '<think>
' -}}
{%- endif -%}
{%- if ns.last_turn_assistant_content is defined and ns.last_turn_assistant_content != '' -%}
{{- ns.last_turn_assistant_content -}}
{%- endif -%}
{%- else -%}
{%- if ns.last_turn_assistant_content is defined and ns.last_turn_assistant_content != '' -%}
{{- '<SPECIAL_11>Assistant
' -}}
{%- if ns.enable_thinking is defined and ns.enable_thinking is false -%}
{{- '<think></think>' -}}
{%- else -%}
{{- '<think>
' -}}
{%- endif -%}
{{- ns.last_turn_assistant_content -}}
{%- if continue_final_message is defined -%}
{%- if continue_final_message is false -%}
{{- '
<SPECIAL_12>
' -}}
{%- endif -%}
{%- else -%}
{{- '
<SPECIAL_12>
' -}}
{%- endif -%}
{%- endif -%}
{%- endif -%} |
However, the corrected template, while it parses to: <SPECIAL_10>System
You can use the following tools to assist the user if required:
<AVAILABLE_TOOLS>[{"description": "Get the current weather for a given city", "name": "get_weather", "parameters": {"properties": {"city": {"description": "The name of the city", "type": "string"}}, "required": ["city"], "type": "object"}}]</AVAILABLE_TOOLS>
If you decide to call any tool(s), use the following format:
<TOOLCALL>[{{"name": "tool_name1", "arguments": "tool_args1"}}, {{"name": "tool_name2", "arguments": "tool_args2"}}]</TOOLCALL>
The user will execute tool-calls and return responses from tool(s) in this format:
<TOOL_RESPONSE>[{{"tool_response1"}}, {{"tool_response2"}}]</TOOL_RESPONSE>
Based on the tool responses, you can call additional tools if needed, correct tool calls if any errors are found, or just respond to the user.
<SPECIAL_11>User
What is the current weather in Barcelona, Stockholm, Lima, Berlin, and Oslo? And also, display them in a list sorted by their temperatures, highest first.
<SPECIAL_11>Assistant
<TOOLCALL>[{"name": "get_weather", "arguments": {"location": "Barcelona"}}]</TOOLCALL>
<SPECIAL_12>
<SPECIAL_11>User
<TOOL_RESPONSE>[Barcelona: ☀️ +25°C]</TOOL_RESPONSE> in my tester app, since gives the same bad response in |
Ha, got it. {%- set ns = namespace(enable_thinking=true) -%}
{%- for message in messages -%}
{%- set content = message['content'] -%}
{%- if message['role'] == 'user' or message['role'] == 'system' -%}
{%- if '/think' in content -%}
{%- set ns.enable_thinking = true -%}
{%- elif '/no_think' in content -%}
{%- set ns.enable_thinking = false -%}
{%- endif -%}
{%- endif -%}
{%- endfor -%}
{%- if messages[0]['role'] != 'system' -%}
{%- set ns.non_tool_system_content = '' -%}
{{- '<SPECIAL_10>System
' -}}
{%- else -%}
{%- set ns.non_tool_system_content = (messages[0]['content'] | default('', true)).replace('/think', '').replace('/no_think', '').strip() -%}
{{- '<SPECIAL_10>System
' + ns.non_tool_system_content }}
{%- endif -%}
{%- if tools -%}
{%- if ns.non_tool_system_content is defined and ns.non_tool_system_content != '' -%}
{{- '
' -}}
{%- endif -%}
{{- 'You can use the following tools to assist the user if required:' -}}
{{- '
<AVAILABLE_TOOLS>[' -}}
{%- for tool in tools -%}
{{- (tool.function if tool.function is defined else tool) | tojson -}}
{{- ', ' if not loop.last else '' -}}
{%- endfor -%}
{{- ']</AVAILABLE_TOOLS>
' -}}
{{- 'If you decide to call any tool(s), use the following format:
' -}}
{{- '<TOOLCALL>[{{"name": "tool_name1", "arguments": "tool_args1"}}, ' -}}
{{- '{{"name": "tool_name2", "arguments": "tool_args2"}}]</TOOLCALL>
' -}}
{{- 'The user will execute tool-calls and return responses from tool(s) in this format:
' -}}
{{- '<TOOL_RESPONSE>[{{"tool_response1"}}, {{"tool_response2"}}]</TOOL_RESPONSE>
' -}}
{{- 'Based on the tool responses, you can call additional tools if needed, correct tool calls if any errors are found, or just respond to the user.' -}}
{%- endif -%}
{{- '
' -}}
{%- set messages = messages[1:] if messages[0]['role'] == 'system' else messages -%}
{%- if messages[-1]['role'] == 'assistant' -%}
{%- set ns.last_turn_assistant_content = (messages[-1]['content'] | default('', true)).strip() -%}
{%- set ns.last_turn_assistant_tool_calls = messages[-1]['tool_calls'] if 'tool_calls' in messages[-1] else [] -%}
{%- set messages = messages[:-1] -%}
{%- endif -%}
{%- for message in messages %}
{%- set content = message['content'] %}
{%- if message['role'] == 'user' -%}
{{- '<SPECIAL_11>User
' + (content | default('', true)).replace('/think', '').replace('/no_think', '').strip() + '
' }}
{%- elif message['role'] == 'tool' -%}
{%- if loop.first or (messages[loop.index0 - 1].role != 'tool') -%}
{{- '<SPECIAL_11>User
' + '<TOOL_RESPONSE>[' }}
{%- endif -%}
{{- message['content'] -}}
{{- ', ' if not loop.last and (messages[loop.index0 + 1].role == 'tool') else '' -}}
{%- if loop.last or (messages[loop.index0 + 1].role != 'tool') -%}
{{- ']</TOOL_RESPONSE>
' -}}
{%- endif -%}
{%- elif message['role'] == 'assistant' -%}
{%- if content and '</think>' in content -%}
{%- set content = (content.split('</think>')[1] | default('', true)).strip() %}
{%- endif -%}
{{- '<SPECIAL_11>Assistant
' + ((content | default('', true)).strip() if content is not none else '') }}
{%- if message.tool_calls -%}
{%- if (content | default('', true)).strip() != '' -%}
{{- '
' -}}
{%- endif -%}
{{- '<TOOLCALL>[' -}}
{%- for call in message.tool_calls -%}
{%- set fn = call.function if call.function is defined else call -%}
{{- '{"name": "' + fn.name + '", "arguments": ' -}}
{%- if fn.arguments is string -%}
{{- fn.arguments -}}
{%- else -%}
{{- fn.arguments | tojson -}}
{%- endif -%}
{{- '}' + (', ' if not loop.last else '') -}}
{%- endfor -%}
{{- ']</TOOLCALL>' -}}
{%- endif -%}
{{- '
<SPECIAL_12>
' -}}
{%- endif -%}
{%- endfor -%}
{%- if add_generation_prompt -%}
{{- '<SPECIAL_11>Assistant
' -}}
{%- if ns.enable_thinking is defined and ns.enable_thinking is false -%}
{{- '<think></think>' -}}
{%- else -%}
{{- '<think>
' -}}
{%- endif -%}
{%- if ns.last_turn_assistant_content is defined and ns.last_turn_assistant_content != '' -%}
{{- ns.last_turn_assistant_content -}}
{%- endif -%}
{%- else -%}
{%- if ns.last_turn_assistant_content is defined and ns.last_turn_assistant_content != '' -%}
{{- '<SPECIAL_11>Assistant
' -}}
{%- if ns.enable_thinking is defined and ns.enable_thinking is false -%}
{{- '<think></think>' -}}
{%- else -%}
{{- '<think>
' -}}
{%- endif -%}
{{- ns.last_turn_assistant_content -}}
{%- if continue_final_message is defined -%}
{%- if continue_final_message is false -%}
{{- '
<SPECIAL_12>
' -}}
{%- endif -%}
{%- else -%}
{{- '
<SPECIAL_12>
' -}}
{%- endif -%}
{%- endif -%}
{%- if ns.last_turn_assistant_tool_calls is defined and ns.last_turn_assistant_tool_calls | length > 0 -%}
{{- '<SPECIAL_11>Assistant
' -}}
{{- '<TOOLCALL>[' -}}
{%- for call in ns.last_turn_assistant_tool_calls -%}
{%- set fn = call.function if call.function is defined else call -%}
{{- '{"name": "' + fn.name + '", "arguments": ' -}}
{%- if fn.arguments is string -%}
{{- fn.arguments -}}
{%- else -%}
{{- fn.arguments | tojson -}}
{%- endif -%}
{{- '}' + (', ' if not loop.last else '') -}}
{%- endfor -%}
{{- ']</TOOLCALL>' -}}
{{- '
<SPECIAL_12>
' -}}
{%- endif -%}
{%- endif -%} With this template, |
if it's still wip i would mark the PR as draft (top right button) |
@ExtReMLapin Nah, I think that's all, should be ready to go. |
@ExtReMLapin You generally don't want a grammar for reasoning content, since that content is pretty much arbitrary text. That's extremely clunky and probably slows down processing quite a bit. I don't see any cases in which such a grammar would be necessary, unless you have a model that does selective reasoning (i.e. reasons only on some cases) and |
I believe his main concern is allowing the model to reason while still constraining it to force a tool call when It's probably the best thing to do with reasoning models, otherwise the model starts exhibiting strange behavior if not allowed to reason. For example, |
Non-tool use works as intended. WebUI still broken, although the upcoming UI natively supports Parallel tool calls are properly enforced when on/off. Template looks good. I haven't seen the same performance degradation in multi-turn scenarios with tool calls as before. Tool calls arguments don't stream like in other models. Probably not a big deal, most clients wait for the entire tool call anyway. Overall, it looks good to me! Good job. |
Yep, that is currently fixed by setting |
Ah... I updated my
But I still get the missing opening
My
|
@blakkd okay, this is a weird case. Any reason why you are using /think and /no_think in the system prompt instead of using I think this is some weird interaction with |
Also, for the template, please use the one that's commited in the PR (at |
I was just using
/no_think
/think
|
Sorry, trying without |
Updated my report above without
|
@blakkd Please make sure that you're on the PR and that you're using the newest chat template. I can't reproduce your results, for me, everything is working fine: ilintar@LinuksowaJaskinia:/mnt/win/k/models/ilintar/NVIDIA-Nemotron-Nano-9B-v2$ curl -X POST http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{ "role": "system", "content": "/think" },
{ "role": "user", "content": "What do you believe in?" }
]
}'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","reasoning_content":"Okay, the user asked, \"What do you believe in?\" Hmm, I need to figure out how to respond. Since I'm an AI, I don't have beliefs or consciousness. I should make that clear. But I should also address the question thoughtfully.\n\nMaybe start by explaining that I don't have personal beliefs. Then, perhaps ask the user what they're looking for. Are they curious about my capabilities? Or do they want to discuss beliefs in general? It's important to keep the conversation open-ended. Let me make sure my response is friendly and helpful. I should avoid any technical jargon. Keep it simple and conversational. Yeah, that makes sense. Let me put that together.","content":"As an AI, I don't have personal beliefs, consciousness, or emotions. I don't \"believe\" in anything in the human sense. My purpose is to process information, assist with questions, and provide helpful responses based on patterns in data. If you're curious about a specific topic—like philosophy, science, or personal values—I’d be happy to explore it with you! What’s on your mind? 😊\n"}}],"created":1756634866,"model":"gpt-3.5-turbo","system_fingerprint":"b6327-3753564c2","object":"chat.completion","usage":{"completion_tokens":235,"prompt_tokens":19,"total_tokens":254},"id":"chatcmpl-ONuEFbeX2yoiB31KAYeHBemTax0KK7SZ","timings":{"prompt_n":19,"prompt_ms":51.627,"prompt_per_token_ms":2.7172105263157897,"prompt_per_second":368.02448331299513,"predicted_n":235,"predicted_ms":3330.256,"predicted_per_token_ms":14.171302127659574,"predicted_per_second":70.56514574254952}}ilintar@LinuksowaJaskinia:/mnt/win/k/models/ilintar/NVIDIA-Nemotron-Nano-9B-v2$ curl -X POST http://127.0.0.1:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"messages": [
{ "role": "system", "content": "/think" },
{ "role": "user", "content": "What do you believe in?" }
], "add_generation_prompt": true
}'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","reasoning_content":"Okay, the user asked, What do you believe in? Hmm, I need to figure out how to answer this. Since I'm an AI, I don't have beliefs or consciousness. But I should explain that clearly without being too technical. Maybe start by stating that I don't have personal beliefs because I'm a machine learning model.\n\nWait, the user might be looking for a deeper answer or maybe they're testing if I have some kind of sentience. I should make it clear that I don't have personal experiences or beliefs. But I should also be helpful. Perhaps I can mention that I can provide information on various belief systems, philosophies, or common human beliefs if they're interested.\n\nLet me check if there's a standard response for this. Some AIs might say they don't have beliefs, but others might try to relate to human values. Since my training data includes a lot of human perspectives, maybe I can say I can discuss different beliefs but don't have my own.\n\nAlso, the user might be curious about my capabilities. So, I should offer assistance in exploring different belief systems. That way, the answer is informative and invites further questions if needed.\n\nI should avoid any ambiguity. Make sure the user understands I don't have consciousness or personal beliefs. Keep the tone friendly and open for them to ask more specific questions. Yeah, that makes sense. Let me structure the response step by step: first clarify I don't have beliefs, then offer help with information on beliefs, and invite them to ask more.","content":"As an AI, I don't have personal beliefs, consciousness, or subjective experiences. I don't \"believe\" in anything in the human sense—I process information based on patterns in data and respond using algorithms. However, I can share information about belief systems, philosophies, or common human values (like ethics, spirituality, or scientific principles) if you're curious about those topics! What would you like to explore? 😊\n"}}],"created":1756634940,"model":"gpt-3.5-turbo","system_fingerprint":"b6327-3753564c2","object":"chat.completion","usage":{"completion_tokens":403,"prompt_tokens":19,"total_tokens":422},"id":"chatcmpl-LTYVgy6LAsY5lBKkk4rN4MF7rXbf92op","timings":{"prompt_n":19,"prompt_ms":37.061,"prompt_per_token_ms":1.950578947368421,"prompt_per_second":512.6683036075659,"predicted_n":403,"predicted_ms":5722.308,"predicted_per_token_ms":14.ilintar@LinuksowaJaskinia:/mnt/win/k/models/ilintar/NVIDIA-Nemotron-Nano-9B-v2$ curl -X POST http://127.0.0.1:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"messages": [
{ "role": "system", "content": "/no_think" },
{ "role": "user", "content": "What do you believe in?" }
], "add_generation_prompt": true
}'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"That's a thoughtful question! As an AI, I don't have personal beliefs or consciousness—I don't \"believe\" in things in the way humans do. My purpose is to provide information, assist with tasks, and engage in meaningful dialogue based on data and patterns. \n\nIf you're asking about beliefs in a philosophical or personal sense, I can share perspectives on topics like ethics, science, spirituality, or human values—if you'd like to explore any of those! What interests you? 😊\n"}}],"created":1756634955,"model":"gpt-3.5-turbo","system_fingerprint":"b6327-3753564c2","object":"chat.completion","usage":{"completion_tokens":106,"prompt_tokens":21,"total_tokens":127},"id":"chatcmpl-olOQAGEYlUEkK0R7lbEfBwlUjpBqOSc1","timings":{"prompt_n":21,"prompt_ms":25.048,"prompt_per_token_ms":1.1927619047619047,"prompt_per_second":838.3902906419676,"predicted_n":106,"predicted_ms":1496.76,"predicted_per_token_ms":14.120377358490567,"predicted_per_second":70.81963708276544}}ilintar@LinuksowaJaskinia:/mnt/win/k/models/ilintar/NVIDIA-Nemotron-Nano-9B-v2$ Command: ilintar@LinuksowaJaskinia:/mnt/win/k/models/ilintar/NVIDIA-Nemotron-Nano-9B-v2$ llama-server -m nvidia-NVIDIA-Nemotron-Nano-9B-v2-q5_k_m.gguf --ctx-size 131000 --top-p 0.95 --temp 0.6 --no-context-shift --jinja --port 8000 -fa -ctk q8_0 -ctv q8_0 --chat-template-file /devel/tools/llama.cpp/models/templates/NVIDIA-Nemotron-Nano-v2.jinja -ngl 99 Also, if the PR and chat template is correct, you might try the models from https://huggingface.co/ilintar/NVIDIA-Nemotron-Nano-9B-v2-GGUF just to rule out any conversion problems. |
@CISC I think this one is ready. |
After testing with opencode I encountered a bug with content after toolcalling, so I relaxed the parser to allow content after There's still #15677, but I think I'll need the higher-ups to fix that one 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add more tests (thinking + tool call and tool call + content).
@pwilkin That's so weird, I don't know what I'm doing wrong: What I ran:
Then, the exact command you shared which I copy pasted (just changed the port and paths)
then
But I still get the same result: no
The content of my Do you see anything I missed or misunderstood? |
@blakkd The main repo doesn't create PR branches for PRs (I mean, it does, but they're hidden and not normally checkoutable). I don't know what ilintar@LinuksowaJaskinia:/devel/alt$ git clone https://github.com/ggml-org/llama.cpp
Cloning into 'llama.cpp'...
remote: Enumerating objects: 60597, done.
remote: Counting objects: 100% (248/248), done.
remote: Compressing objects: 100% (152/152), done.
remote: Total 60597 (delta 175), reused 98 (delta 96), pack-reused 60349 (from 4)
Receiving objects: 100% (60597/60597), 151.05 MiB | 35.64 MiB/s, done.
Resolving deltas: 100% (43940/43940), done.
ilintar@LinuksowaJaskinia:/devel/alt$ cd llama.cpp/
ilintar@LinuksowaJaskinia:/devel/alt/llama.cpp$ git checkout pr-15676
error: pathspec 'pr-15676' did not match any file(s) known to git The proper order on a freshly cloned repo would be: $ git clone https://github.com/ggml-org/llama.cpp
$ cd llama.cpp
$ git checkout -b pr-15657 # Create a new branch
$ git remote add pwilkin https://github.com/pwilkin/llama.cpp # Add my fork
$ git fetch --all # Fetch branches from all repos
$ git branch -u pwilkin/nemotron-chat # Make your local branch track my PR branch
$ git reset --hard pwilkin/nemotron-chat # Can use pull, but reset --hard guarantees you're fully synced
$ cmake -B build -DGGML_CUDA=ON
$ cmake --build build --config Release -j 32
$ ./build/bin/llama-server -m /mnt/277c6bdc-56fd-45a3-9195-3612028a5a15/GGUFs/NVIDIA-Nemotron-Nano-9B-v2-Q5_K_M/nvidia-NVIDIA-Nemotron-Nano-9B-v2-q5_k_m.gguf --ctx-size 131000 --top-p 0.95 --temp 0.6 --no-context-shift --jinja --port 8679 -fa -ctk q8_0 -ctv q8_0 --chat-template-file models/templates/NVIDIA-Nemotron-v2.jinja -ngl 99 |
Ahem. Not to nitpick, but the proper order when freshly cloning is shorter, no need to go for the "source" of prs:
The least inconvenient way to update the branch however is this: |
@pwilkin Working!!! I now get the proper reasoning content and confirm on my side! Really thanks for taking the time to teach me the proper way! I'll keep this saved for next times! @Hoernchen thanks too! I'll keep both of your step-by-step solutions! Right now I'm retaining this shorter one which is easier for me and worked too:
Really thanks again! |
@pwilkin gentle ping |
Had a busy week at work :) Added tests:
|
…g-model-disabled-agent-prefill * origin/master: (84 commits) CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (ggml-org#15802) tests : add --list-ops and --show-coverage options (ggml-org#15745) gguf: gguf_writer refactor (ggml-org#15691) kv-cache : fix SWA checks + disable cacheless iSWA (ggml-org#15811) model-conversion : add --embeddings flag to modelcard.template [no ci] (ggml-org#15801) chat : fixed crash when Hermes 2 <tool_call> had a newline before it (ggml-org#15639) chat : nemotron thinking & toolcalling support (ggml-org#15676) scripts : add Jinja tester PySide6 simple app (ggml-org#15756) llama : add support for EmbeddingGemma 300m (ggml-org#15798) metal : Add template specialization for mul_mm_id w/ ne20 == 10 (ggml-org#15799) llama : set n_outputs to 1 to avoid 0 outputs mean-pooling (ggml-org#15791) CANN: Refactor ND to NZ workspace to be per-device (ggml-org#15763) server: add exceed_context_size_error type (ggml-org#15780) Document the new max GPU layers default in help (ggml-org#15771) ggml: add ops for WAN video model (cuda && cpu) (ggml-org#15669) CANN: Fix precision issue on 310I DUO multi-devices (ggml-org#15784) opencl: add hs=40 to FA (ggml-org#15758) CANN: fix acl_rstd allocation size in ggml_cann_rms_norm (ggml-org#15760) vulkan: fix mmv subgroup16 selection (ggml-org#15775) vulkan: don't use std::string in load_shaders, to improve compile time (ggml-org#15724) ...
…upport * origin/master: Thinking model disabled assistant prefill (ggml-org#15404) Implement --log-colors with always/never/auto (ggml-org#15792) CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (ggml-org#15802) tests : add --list-ops and --show-coverage options (ggml-org#15745) gguf: gguf_writer refactor (ggml-org#15691) kv-cache : fix SWA checks + disable cacheless iSWA (ggml-org#15811) model-conversion : add --embeddings flag to modelcard.template [no ci] (ggml-org#15801) chat : fixed crash when Hermes 2 <tool_call> had a newline before it (ggml-org#15639) chat : nemotron thinking & toolcalling support (ggml-org#15676) scripts : add Jinja tester PySide6 simple app (ggml-org#15756) llama : add support for EmbeddingGemma 300m (ggml-org#15798)
* feat: nemotron thinking & toolcalling support * Trailing whitespaces * Corrected template for Nemotron * Template and parser fixes * Final template and grammar changes * Whitespace * Always do lazy grammar processing since </think> tag will always be there. * Allow extra content after toolcall * Whitespace * New tests: thinking + tools, tools + content, thinking + tools + content (new!) * Whitespace * Remove cURL test script
Followup to #15507, adds reasoning + toolcalling support (with streaming!)