Skip to content

Conversation

pwilkin
Copy link
Collaborator

@pwilkin pwilkin commented Aug 29, 2025

Followup to #15507, adds reasoning + toolcalling support (with streaming!)

@github-actions github-actions bot added the testing Everything test related label Aug 29, 2025
@blakkd
Copy link

blakkd commented Aug 30, 2025

On my end: the behavior persist, and the thinking mode broke, outputting only </think>\n\n :/ Do you have different results?

/no_think

~ ❯❯❯ curl -X POST http://127.0.0.1:8679/v1/chat/completions \
            -H "Content-Type: application/json" \
            -d '{
              "messages": [
                { "role": "system", "content": "/no_think" },
                { "role": "user", "content": "What you believe in?" }
              ],
              "add_generation_prompt": true
            }'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"As an AI, I don't have personal beliefs, feelings, or consciousness. My purpose is to provide accurate, helpful, and unbiased information based on the data I've been trained on. I aim to assist with questions, solve problems, and engage in meaningful conversations. What would you like to discuss or explore? 😊\n"}}],"created":1756514000,"model":"NVIDIA-Nemotron-Nano-9B-v2_Q8_0_131K","system_fingerprint":"b6323-ad891663","object":"chat.completion","usage":{"completion_tokens":68,"prompt_tokens":20,"total_tokens":88},"id":"chatcmpl-mzhFaCxEtci8IEZrCwbWOb6GGD4t7qSj","timings":{"prompt_n":20,"prompt_ms":24.449,"prompt_per_token_ms":1.22245,"prompt_per_second":818.0293672542844,"predicted_n":68,"predicted_ms":965.517,"predicted_per_token_ms":14.198779411764706,"predicted_per_second":70.42858903571869}}

/think

 ~ ❯❯❯ curl -X POST http://127.0.0.1:8679/v1/chat/completions \
            -H "Content-Type: application/json" \
            -d '{
              "messages": [
                { "role": "system", "content": "think" },
                { "role": "user", "content": "What you believe in?" }
              ],
              "add_generation_prompt": true
            }'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"</think>\n\n"}}],"created":1756514012,"model":"NVIDIA-Nemotron-Nano-9B-v2_Q8_0_131K","system_fingerprint":"b6323-ad891663","object":"chat.completion","usage":{"completion_tokens":4,"prompt_tokens":20,"total_tokens":24},"id":"chatcmpl-Jvuz69IxKIvIHVqmSLNyMt9QVAMdKVzX","timings":{"prompt_n":20,"prompt_ms":34.45,"prompt_per_token_ms":1.7225000000000001,"prompt_per_second":580.5515239477503,"predicted_n":4,"predicted_ms":55.757,"predicted_per_token_ms":13.93925,"predicted_per_second":71.73987122693116}}

Here is my server command:

~/l/b/bin ❯❯❯ /home/user/llama.cpp/build/bin/llama-server \   pr-15676
                  --model /mnt/277c6bdc-56fd-45a3-9195-3612028a5a15/GGUFs/NVIDIA-Nemotron-Nano-9B-v2-Q8_0/nvidia_NVIDIA-Nemotron-Nano-9B-v2-Q8_0.gguf \
                  --ctx-size 131000 \
                  --no-context-shift \
                  --n-gpu-layers 57 \
                  --temp 0.6 \
                  --top-p 0.95 \
                  --jinja \
                  --host 0.0.0.0 \
                  --port 8679 \
                  --flash-attn \
                  --chat-template-file /mnt/277c6bdc-56fd-45a3-9195-3612028a5a15/GGUFs/NVIDIA-Nemotron-Nano-9B-v2-Q8_0/template.jinja \
                  -a NVIDIA-Nemotron-Nano-9B-v2_Q8_0_131K
                  ```

@blakkd
Copy link

blakkd commented Aug 30, 2025

having the same issue when streaming:

~ ❯❯❯ curl -X POST http://127.0.0.1:8679/v1/chat/completions \
                      -H "Content-Type: application/json" \
                      -d '{
                    "messages": [
                      { "role": "system", "content": "/think" },
                      { "role": "user", "content": "What you believe in?" }
                    ],
                    "add_generation_prompt": true,
                    "stream": true
                  }'
data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1756516218,"id":"chatcmpl-hzmkqR1f8qx1PzNNlTNLR6sBgk7bmQVG","model":"NVIDIA-Nemotron-Nano-9B-v2_Q8_0_131K","system_fingerprint":"b6323-ad891663","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":"stop","index":0,"delta":{}}],"created":1756516218,"id":"chatcmpl-hzmkqR1f8qx1PzNNlTNLR6sBgk7bmQVG","model":"NVIDIA-Nemotron-Nano-9B-v2_Q8_0_131K","system_fingerprint":"b6323-ad891663","object":"chat.completion.chunk"}

data: {"choices":[],"created":1756516218,"id":"chatcmpl-hzmkqR1f8qx1PzNNlTNLR6sBgk7bmQVG","model":"NVIDIA-Nemotron-Nano-9B-v2_Q8_0_131K","system_fingerprint":"b6323-ad891663","object":"chat.completion.chunk","usage":{"completion_tokens":4,"prompt_tokens":18,"total_tokens":22},"timings":{"prompt_n":18,"prompt_ms":31.819,"prompt_per_token_ms":1.7677222222222222,"prompt_per_second":565.6997391495647,"predicted_n":4,"predicted_ms":51.77,"predicted_per_token_ms":12.9425,"predicted_per_second":77.264825188333}}

data: [DONE]

@blakkd
Copy link

blakkd commented Aug 30, 2025

Sorry I just saw I used think instead of /think as system prompt, but I just tried again with the correct /think and the issue is actually exactly the same :/

~ ❯❯❯ curl -X POST http://127.0.0.1:8679/v1/chat/completions \
            -H "Content-Type: application/json" \
            -d '{
              "messages": [
                { "role": "system", "content": "/think" },
                { "role": "user", "content": "What do you believe in?" }
              ],
              "add_generation_prompt": true
            }'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"</think>\n\n"}}],"created":1756515721,"model":"NVIDIA-Nemotron-Nano-9B-v2_Q8_0_131K","system_fingerprint":"b6323-ad891663","object":"chat.completion","usage":{"completion_tokens":4,"prompt_tokens":19,"total_tokens":23},"id":"chatcmpl-rT8LeIVtqYVONDsbQop6fpGpKQbhbPf9","timings":{"prompt_n":19,"prompt_ms":31.081,"prompt_per_token_ms":1.6358421052631578,"prompt_per_second":611.3059425372414,"predicted_n":4,"predicted_ms":49.772,"predicted_per_token_ms":12.443,"predicted_per_second":80.36647110825363}}

Copy link
Collaborator

@aldehir aldehir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize if I come off as nit-picky.

It also seems the webui doesn't properly render out the thinking UI element, probably because of the forced thinking which comes from the template:

image

I'm guessing the UI is looking for the <think> tag, which is not present in the generation.

Tool calling works great, though!

@ExtReMLapin
Copy link
Contributor

Please follow this PR to support think tags in grammar because it's going to be required if there is no triggers (tool call mode = required, triggers are not expected to run and grammar is applied from start, so we need to allow as soon as possible thinking.

#15248

@aldehir
Copy link
Collaborator

aldehir commented Aug 30, 2025

The template isn't applied properly when there are tool calls/responses in the message:

/apply-template curl example
#!/bin/bash
curl http://localhost:8080/apply-template -H 'Content-Type: application/json' -d '{
    "messages": [
        {
            "content": "What is the current weather in Barcelona, Stockholm, Lima, Berlin, and Oslo? And also, display them in a list sorted by their temperatures, highest first.",
            "role": "user"
        },
        {
            "content": null,
            "role": "assistant",
            "tool_calls": [
                {
                    "type": "function",
                    "id": "2u4dKpkZDH21gTVH7Sr6R2wm3pAxisVF",
                    "function": {
                        "name": "get_weather",
                        "arguments": "{\"location\": \"Barcelona\"}"
                    }
                },
                {
                    "type": "function",
                    "id": "DK3FsBQguP4NZm0yMRZrIe78ZePKZyq9",
                    "function": {
                        "name": "get_weather",
                        "arguments": "{\"location\": \"Stockholm\"}"
                    }
                },
                {
                    "type": "function",
                    "id": "21DPzPHMEx2Y1eTPyVDHs1ytuRUnND3E",
                    "function": {
                        "name": "get_weather",
                        "arguments": "{\"location\": \"Lima\"}"
                    }
                },
                {
                    "type": "function",
                    "id": "Nr8JukMvqXyvnypsgYR1DPrrxMtvFQjz",
                    "function": {
                        "name": "get_weather",
                        "arguments": "{\"location\": \"Berlin\"}"
                    }
                },
                {
                    "type": "function",
                    "id": "vl7dST6XZddIhZ9geWBIsGftgzSPUlQ5",
                    "function": {
                        "name": "get_weather",
                        "arguments": "{\"location\": \"Oslo\"}"
                    }
                }
            ]
        },
        {
            "content": "Barcelona: \u2600\ufe0f +25\u00b0C",
            "role": "tool",
            "tool_call_id": "2u4dKpkZDH21gTVH7Sr6R2wm3pAxisVF"
        },
        {
            "content": "Stockholm: \u2600\ufe0f +13\u00b0C",
            "role": "tool",
            "tool_call_id": "DK3FsBQguP4NZm0yMRZrIe78ZePKZyq9"
        },
        {
            "content": "Lima: +16\u00b0C",
            "role": "tool",
            "tool_call_id": "21DPzPHMEx2Y1eTPyVDHs1ytuRUnND3E"
        },
        {

            "content": "Berlin: \u2600\ufe0f +26\u00b0C",
            "role": "tool",
            "tool_call_id": "Nr8JukMvqXyvnypsgYR1DPrrxMtvFQjz"
        },
        {
            "content": "Oslo: \u2600\ufe0f +13\u00b0C",
            "role": "tool",
            "tool_call_id": "vl7dST6XZddIhZ9geWBIsGftgzSPUlQ5"
        }
    ]
}'
<SPECIAL_10>System

<SPECIAL_11>User
What is the current weather in Barcelona, Stockholm, Lima, Berlin, and Oslo? And also, display them in a list sorted by their temperatures, highest first.
<SPECIAL_11>Assistant
{
  "tool_calls": [
    {
      "name": "get_weather",
      "arguments": {
        "location": "Barcelona"
      },
      "id": "2u4dKpkZDH21gTVH7Sr6R2wm3pAxisVF"
    },
    {
      "name": "get_weather",
      "arguments": {
        "location": "Stockholm"
      },
      "id": "DK3FsBQguP4NZm0yMRZrIe78ZePKZyq9"
    },
    {
      "name": "get_weather",
      "arguments": {
        "location": "Lima"
      },
      "id": "21DPzPHMEx2Y1eTPyVDHs1ytuRUnND3E"
    },
    {
      "name": "get_weather",
      "arguments": {
        "location": "Berlin"
      },
      "id": "Nr8JukMvqXyvnypsgYR1DPrrxMtvFQjz"
    },
    {
      "name": "get_weather",
      "arguments": {
        "location": "Oslo"
      },
      "id": "vl7dST6XZddIhZ9geWBIsGftgzSPUlQ5"
    }
  ]
}
<SPECIAL_12>
<SPECIAL_11>User
{
  "tool_response": {
    "content": "Barcelona: ☀️ +25°C",
    "tool_call_id": "2u4dKpkZDH21gTVH7Sr6R2wm3pAxisVF"
  }
}
<SPECIAL_11>User
{
  "tool_response": {
    "content": "Stockholm: ☀️ +13°C",
    "tool_call_id": "DK3FsBQguP4NZm0yMRZrIe78ZePKZyq9"
  }
}
<SPECIAL_11>User
{
  "tool_response": {
    "content": "Lima: +16°C",
    "tool_call_id": "21DPzPHMEx2Y1eTPyVDHs1ytuRUnND3E"
  }
}
<SPECIAL_11>User
{
  "tool_response": {
    "content": "Berlin: ☀️ +26°C",
    "tool_call_id": "Nr8JukMvqXyvnypsgYR1DPrrxMtvFQjz"
  }
}
<SPECIAL_11>User
{
  "tool_response": {
    "content": "Oslo: ☀️ +13°C",
    "tool_call_id": "vl7dST6XZddIhZ9geWBIsGftgzSPUlQ5"
  }
}
<SPECIAL_11>Assistant
<think>

I would expect something more like this:

expected prompt
<SPECIAL_10>System

<SPECIAL_11>User
What is the current weather in Barcelona, Stockholm, Lima, Berlin, and Oslo? And also, display them in a list sorted by their temperatures, highest first.
<SPECIAL_11>Assistant
<TOOLCALL>[
    {
      "name": "get_weather",
      "arguments": {
        "location": "Barcelona"
      },
      "id": "2u4dKpkZDH21gTVH7Sr6R2wm3pAxisVF"
    },
    {
      "name": "get_weather",
      "arguments": {
        "location": "Stockholm"
      },
      "id": "DK3FsBQguP4NZm0yMRZrIe78ZePKZyq9"
    },
    {
      "name": "get_weather",
      "arguments": {
        "location": "Lima"
      },
      "id": "21DPzPHMEx2Y1eTPyVDHs1ytuRUnND3E"
    },
    {
      "name": "get_weather",
      "arguments": {
        "location": "Berlin"
      },
      "id": "Nr8JukMvqXyvnypsgYR1DPrrxMtvFQjz"
    },
    {
      "name": "get_weather",
      "arguments": {
        "location": "Oslo"
      },
      "id": "vl7dST6XZddIhZ9geWBIsGftgzSPUlQ5"
    }
]</TOOLCALL>
<SPECIAL_12>
<SPECIAL_11>User
<TOOL_RESPONSE>[
  {
    "content": "Barcelona: ☀️ +25°C",
    "tool_call_id": "2u4dKpkZDH21gTVH7Sr6R2wm3pAxisVF"
  },
  {
    "content": "Stockholm: ☀️ +13°C",
    "tool_call_id": "DK3FsBQguP4NZm0yMRZrIe78ZePKZyq9"
  },
  {
    "content": "Lima: +16°C",
    "tool_call_id": "21DPzPHMEx2Y1eTPyVDHs1ytuRUnND3E"
  },
  {
    "content": "Berlin: ☀️ +26°C",
    "tool_call_id": "Nr8JukMvqXyvnypsgYR1DPrrxMtvFQjz"
  },
  {
    "content": "Oslo: ☀️ +13°C",
    "tool_call_id": "vl7dST6XZddIhZ9geWBIsGftgzSPUlQ5"
  }
]</TOOL_RESPONSE>
<SPECIAL_11>Assistant
<think>

It looks like the minja polyfills are injecting "tool_calls" and "tool_response" into the message "content". I guess it determined the template does not support tool calls/responses. After a few turns, the model starts generating tool calls that match the polyfill, which doesn't get parsed. Not sure how to fix this without digging further into minja.

@pwilkin
Copy link
Collaborator Author

pwilkin commented Aug 30, 2025

Here I come after a good night's sleep to find I opened a Pandora's Box...

Thanks for all the feedback guys, @ExtReMLapin is probably right that we might need to rebase it on his PR to fix the thinking-in-required-toolcall issue, but the tool responses part is also pretty worrying...

@pwilkin
Copy link
Collaborator Author

pwilkin commented Aug 30, 2025

Okay, so first of all, the Jinja template was of course broken. Here's a corrected template:

{%- set ns = namespace(enable_thinking=true) -%}
{%- for message in messages -%}
  {%- set content = message['content'] -%}
  {%- if message['role'] == 'user' or message['role'] == 'system' -%}
    {%- if '/think' in content -%}
      {%- set ns.enable_thinking = true -%}
    {%- elif '/no_think' in content -%}
      {%- set ns.enable_thinking = false -%}
    {%- endif -%}
  {%- endif -%}
{%- endfor -%}

{%- if messages[0]['role'] != 'system' -%}
  {%- set ns.non_tool_system_content = '' -%}
  {{- '<SPECIAL_10>System
' -}}
{%- else -%}
  {%- set ns.non_tool_system_content = (messages[0]['content'] | default('')).replace('/think', '').replace('/no_think', '').strip() -%}
  {{- '<SPECIAL_10>System
' + ns.non_tool_system_content }}
{%- endif -%}

{%- if tools -%}
  {%- if ns.non_tool_system_content is defined and ns.non_tool_system_content != '' -%}
    {{- '

' -}}
  {%- endif -%}
  {{- 'You can use the following tools to assist the user if required:' -}}
  {{- '
<AVAILABLE_TOOLS>[' -}}
  {%- for tool in tools -%}
    {{- (tool.function if tool.function is defined else tool) | tojson -}}
    {{- ', ' if not loop.last else '' -}}
  {%- endfor -%}
  {{- ']</AVAILABLE_TOOLS>

' -}}
  {{- 'If you decide to call any tool(s), use the following format:
' -}}
  {{- '<TOOLCALL>[{{"name": "tool_name1", "arguments": "tool_args1"}}, ' -}}
  {{- '{{"name": "tool_name2", "arguments": "tool_args2"}}]</TOOLCALL>

' -}}
  {{- 'The user will execute tool-calls and return responses from tool(s) in this format:
' -}}
  {{- '<TOOL_RESPONSE>[{{"tool_response1"}}, {{"tool_response2"}}]</TOOL_RESPONSE>

' -}}
  {{- 'Based on the tool responses, you can call additional tools if needed, correct tool calls if any errors are found, or just respond to the user.' -}}
{%- endif -%}
{{- '

' -}}

{%- set messages = messages[1:] if messages[0]['role'] == 'system' else messages -%}
{%- if messages[-1]['role'] == 'assistant' -%}
  {%- set ns.last_turn_assistant_content = (messages[-1]['content'] | default('')).strip() -%}
  {%- set messages = messages[:-1] -%}
{%- endif -%}

{%- for message in messages %}
  {%- set content = message['content'] %}
  {%- if message['role'] == 'user' -%}
    {{- '<SPECIAL_11>User
' + (content | default('')).replace('/think', '').replace('/no_think', '').strip() + '
' }}
  {%- elif message['role'] == 'tool' -%}
    {%- if loop.first or (messages[loop.index0 - 1].role != 'tool') -%}
      {{- '<SPECIAL_11>User
' + '<TOOL_RESPONSE>[' }}
    {%- endif -%}
    {{- message['content'] -}}
    {{- ', ' if not loop.last and (messages[loop.index0 + 1].role == 'tool') else '' -}}
    {%- if loop.last or (messages[loop.index0 + 1].role != 'tool') -%}
      {{- ']</TOOL_RESPONSE>

' -}}
    {%- endif -%}
  {%- elif message['role'] == 'assistant' -%}
    {%- if content and '</think>' in content -%}
      {%- set content = (content.split('</think>')[1] | default('')).strip() %}
    {%- endif -%}
    {{- '<SPECIAL_11>Assistant
' + ((content | default('') | string).strip() if content is not none else '') }}
    {%- if message.tool_calls -%}
      {%- if (content | default('')).strip() != '' -%}
        {{- '

' -}}
      {%- endif -%}
      {{- '<TOOLCALL>[' -}}
      {%- for call in message.tool_calls -%}
        {%- set fn = call.function if call.function is defined else call -%}
        {{- '{"name": "' + fn.name + '", "arguments": ' -}}
        {%- if fn.arguments is string -%}
          {{- fn.arguments -}}
        {%- else -%}
          {{- fn.arguments | tojson -}}
        {%- endif -%}
        {{- '}' + (', ' if not loop.last else '') -}}
      {%- endfor -%}
      {{- ']</TOOLCALL>' -}}
    {%- endif -%}
    {{- '
<SPECIAL_12>

' -}}
  {%- endif -%}
{%- endfor -%}

{%- if add_generation_prompt -%}
  {{- '<SPECIAL_11>Assistant
' -}}
  {%- if ns.enable_thinking is defined and ns.enable_thinking is false -%}
    {{- '<think></think>' -}}
  {%- else -%}
    {{- '<think>

' -}}
  {%- endif -%}
  {%- if ns.last_turn_assistant_content is defined and ns.last_turn_assistant_content != '' -%}
    {{- ns.last_turn_assistant_content -}}
  {%- endif -%}
{%- else -%}
  {%- if ns.last_turn_assistant_content is defined and ns.last_turn_assistant_content != '' -%}
    {{- '<SPECIAL_11>Assistant
' -}}
    {%- if ns.enable_thinking is defined and ns.enable_thinking is false -%}
      {{- '<think></think>' -}}
    {%- else -%}
      {{- '<think>

' -}}
    {%- endif -%}
    {{- ns.last_turn_assistant_content -}}
    {%- if continue_final_message is defined -%}
      {%- if continue_final_message is false -%}
        {{- '
<SPECIAL_12>

' -}}
      {%- endif -%}
    {%- else -%}
      {{- '
<SPECIAL_12>

' -}}
    {%- endif -%}
  {%- endif -%}
{%- endif -%}

@pwilkin
Copy link
Collaborator Author

pwilkin commented Aug 30, 2025

However, the corrected template, while it parses to:

<SPECIAL_10>System
You can use the following tools to assist the user if required:
<AVAILABLE_TOOLS>[{"description": "Get the current weather for a given city", "name": "get_weather", "parameters": {"properties": {"city": {"description": "The name of the city", "type": "string"}}, "required": ["city"], "type": "object"}}]</AVAILABLE_TOOLS>

If you decide to call any tool(s), use the following format:
<TOOLCALL>[{{"name": "tool_name1", "arguments": "tool_args1"}}, {{"name": "tool_name2", "arguments": "tool_args2"}}]</TOOLCALL>

The user will execute tool-calls and return responses from tool(s) in this format:
<TOOL_RESPONSE>[{{"tool_response1"}}, {{"tool_response2"}}]</TOOL_RESPONSE>

Based on the tool responses, you can call additional tools if needed, correct tool calls if any errors are found, or just respond to the user.

<SPECIAL_11>User
What is the current weather in Barcelona, Stockholm, Lima, Berlin, and Oslo? And also, display them in a list sorted by their temperatures, highest first.
<SPECIAL_11>Assistant


<TOOLCALL>[{"name": "get_weather", "arguments": {"location": "Barcelona"}}]</TOOLCALL>
<SPECIAL_12>

<SPECIAL_11>User
<TOOL_RESPONSE>[Barcelona: ☀️ +25°C]</TOOL_RESPONSE>

in my tester app, since gives the same bad response in /apply-template.

@pwilkin
Copy link
Collaborator Author

pwilkin commented Aug 30, 2025

Ha, got it.

{%- set ns = namespace(enable_thinking=true) -%}
{%- for message in messages -%}
  {%- set content = message['content'] -%}
  {%- if message['role'] == 'user' or message['role'] == 'system' -%}
    {%- if '/think' in content -%}
      {%- set ns.enable_thinking = true -%}
    {%- elif '/no_think' in content -%}
      {%- set ns.enable_thinking = false -%}
    {%- endif -%}
  {%- endif -%}
{%- endfor -%}

{%- if messages[0]['role'] != 'system' -%}
  {%- set ns.non_tool_system_content = '' -%}
  {{- '<SPECIAL_10>System
' -}}
{%- else -%}
  {%- set ns.non_tool_system_content = (messages[0]['content'] | default('', true)).replace('/think', '').replace('/no_think', '').strip() -%}
  {{- '<SPECIAL_10>System
' + ns.non_tool_system_content }}
{%- endif -%}

{%- if tools -%}
  {%- if ns.non_tool_system_content is defined and ns.non_tool_system_content != '' -%}
    {{- '

' -}}
  {%- endif -%}
  {{- 'You can use the following tools to assist the user if required:' -}}
  {{- '
<AVAILABLE_TOOLS>[' -}}
  {%- for tool in tools -%}
    {{- (tool.function if tool.function is defined else tool) | tojson -}}
    {{- ', ' if not loop.last else '' -}}
  {%- endfor -%}
  {{- ']</AVAILABLE_TOOLS>

' -}}
  {{- 'If you decide to call any tool(s), use the following format:
' -}}
  {{- '<TOOLCALL>[{{"name": "tool_name1", "arguments": "tool_args1"}}, ' -}}
  {{- '{{"name": "tool_name2", "arguments": "tool_args2"}}]</TOOLCALL>

' -}}
  {{- 'The user will execute tool-calls and return responses from tool(s) in this format:
' -}}
  {{- '<TOOL_RESPONSE>[{{"tool_response1"}}, {{"tool_response2"}}]</TOOL_RESPONSE>

' -}}
  {{- 'Based on the tool responses, you can call additional tools if needed, correct tool calls if any errors are found, or just respond to the user.' -}}
{%- endif -%}
{{- '

' -}}

{%- set messages = messages[1:] if messages[0]['role'] == 'system' else messages -%}
{%- if messages[-1]['role'] == 'assistant' -%}
  {%- set ns.last_turn_assistant_content = (messages[-1]['content'] | default('', true)).strip() -%}
  {%- set ns.last_turn_assistant_tool_calls = messages[-1]['tool_calls'] if 'tool_calls' in messages[-1] else [] -%}
  {%- set messages = messages[:-1] -%}
{%- endif -%}

{%- for message in messages %}
  {%- set content = message['content'] %}
  {%- if message['role'] == 'user' -%}
    {{- '<SPECIAL_11>User
' + (content | default('', true)).replace('/think', '').replace('/no_think', '').strip() + '
' }}
  {%- elif message['role'] == 'tool' -%}
    {%- if loop.first or (messages[loop.index0 - 1].role != 'tool') -%}
      {{- '<SPECIAL_11>User
' + '<TOOL_RESPONSE>[' }}
    {%- endif -%}
    {{- message['content'] -}}
    {{- ', ' if not loop.last and (messages[loop.index0 + 1].role == 'tool') else '' -}}
    {%- if loop.last or (messages[loop.index0 + 1].role != 'tool') -%}
      {{- ']</TOOL_RESPONSE>

' -}}
    {%- endif -%}
  {%- elif message['role'] == 'assistant' -%}
    {%- if content and '</think>' in content -%}
      {%- set content = (content.split('</think>')[1] | default('', true)).strip() %}
    {%- endif -%}
    {{- '<SPECIAL_11>Assistant
' + ((content | default('', true)).strip() if content is not none else '') }}
    {%- if message.tool_calls -%}
      {%- if (content | default('', true)).strip() != '' -%}
        {{- '

' -}}
      {%- endif -%}
      {{- '<TOOLCALL>[' -}}
      {%- for call in message.tool_calls -%}
        {%- set fn = call.function if call.function is defined else call -%}
        {{- '{"name": "' + fn.name + '", "arguments": ' -}}
        {%- if fn.arguments is string -%}
          {{- fn.arguments -}}
        {%- else -%}
          {{- fn.arguments | tojson -}}
        {%- endif -%}
        {{- '}' + (', ' if not loop.last else '') -}}
      {%- endfor -%}
      {{- ']</TOOLCALL>' -}}
    {%- endif -%}
    {{- '
<SPECIAL_12>

' -}}
  {%- endif -%}
{%- endfor -%}

{%- if add_generation_prompt -%}
  {{- '<SPECIAL_11>Assistant
' -}}
  {%- if ns.enable_thinking is defined and ns.enable_thinking is false -%}
    {{- '<think></think>' -}}
  {%- else -%}
    {{- '<think>

' -}}
  {%- endif -%}
  {%- if ns.last_turn_assistant_content is defined and ns.last_turn_assistant_content != '' -%}
    {{- ns.last_turn_assistant_content -}}
  {%- endif -%}
{%- else -%}
  {%- if ns.last_turn_assistant_content is defined and ns.last_turn_assistant_content != '' -%}
    {{- '<SPECIAL_11>Assistant
' -}}
    {%- if ns.enable_thinking is defined and ns.enable_thinking is false -%}
      {{- '<think></think>' -}}
    {%- else -%}
      {{- '<think>

' -}}
    {%- endif -%}
    {{- ns.last_turn_assistant_content -}}
    {%- if continue_final_message is defined -%}
      {%- if continue_final_message is false -%}
        {{- '
<SPECIAL_12>

' -}}
      {%- endif -%}
    {%- else -%}
      {{- '
<SPECIAL_12>

' -}}
    {%- endif -%}
  {%- endif -%}
  {%- if ns.last_turn_assistant_tool_calls is defined and ns.last_turn_assistant_tool_calls | length > 0 -%}
    {{- '<SPECIAL_11>Assistant
' -}}
    {{- '<TOOLCALL>[' -}}
    {%- for call in ns.last_turn_assistant_tool_calls -%}
      {%- set fn = call.function if call.function is defined else call -%}
      {{- '{"name": "' + fn.name + '", "arguments": ' -}}
      {%- if fn.arguments is string -%}
        {{- fn.arguments -}}
      {%- else -%}
        {{- fn.arguments | tojson -}}
      {%- endif -%}
      {{- '}' + (', ' if not loop.last else '') -}}
    {%- endfor -%}
    {{- ']</TOOLCALL>' -}}
    {{- '
<SPECIAL_12>

' -}}
 {%- endif -%}
{%- endif -%}

With this template, /apply-template returns the correct response.

@ExtReMLapin
Copy link
Contributor

if it's still wip i would mark the PR as draft (top right button)

@pwilkin
Copy link
Collaborator Author

pwilkin commented Aug 30, 2025

@ExtReMLapin Nah, I think that's all, should be ready to go.

@pwilkin
Copy link
Collaborator Author

pwilkin commented Aug 30, 2025

@blakkd @aldehir Could you guys check your cases with the last commit?

@pwilkin
Copy link
Collaborator Author

pwilkin commented Aug 30, 2025

@ExtReMLapin You generally don't want a grammar for reasoning content, since that content is pretty much arbitrary text. That's extremely clunky and probably slows down processing quite a bit. I don't see any cases in which such a grammar would be necessary, unless you have a model that does selective reasoning (i.e. reasons only on some cases) and tool_choice = required.

@aldehir
Copy link
Collaborator

aldehir commented Aug 30, 2025

I believe his main concern is allowing the model to reason while still constraining it to force a tool call when tool_choice == required. You can't really do that without constraining it from the start, while giving the reasoning grammar sufficient flexibility.

It's probably the best thing to do with reasoning models, otherwise the model starts exhibiting strange behavior if not allowed to reason. For example, gpt-oss produces subpar results if not allowed to reason when using response_format (#15494).

@aldehir
Copy link
Collaborator

aldehir commented Aug 30, 2025

Non-tool use works as intended. WebUI still broken, although the upcoming UI natively supports reasoning_content.

Parallel tool calls are properly enforced when on/off.

Template looks good. I haven't seen the same performance degradation in multi-turn scenarios with tool calls as before.

Tool calls arguments don't stream like in other models. Probably not a big deal, most clients wait for the entire tool call anyway.

Overall, it looks good to me! Good job.

@pwilkin
Copy link
Collaborator Author

pwilkin commented Aug 30, 2025

I believe his main concern is allowing the model to reason while still constraining it to force a tool call when tool_choice == required. You can't really do that without constraining it from the start, while giving the reasoning grammar sufficient flexibility.

Yep, that is currently fixed by setting grammar_lazy = true in that case. I checked it, it reasons properly before doing a tool call now. Since this model either always reasons (with reasoning_enabled = true) or never does, and the "non-reasoning" is emulated by inserting <think></think> (so </think> will always appear in the response), you can just let the grammar laziness do its job and start grammar parsing when </think> is located, functionally reducing it to a non-reasoning model with respect to the tool call logic.

@blakkd
Copy link

blakkd commented Aug 30, 2025

@blakkd @aldehir Could you guys check your cases with the last commit?

Ah...

I updated my template.jinja with the one you provided here #15676 (comment)

/no_think works as intended.

But I still get the missing opening <think> tag for /think

~ ❯❯❯ curl -X POST http://127.0.0.1:8679/v1/chat/completions \
           -H "Content-Type: application/json" \
           -d '{
             "messages": [
               { "role": "system", "content": "/think" },
               { "role": "user", "content": "What do you believe in?" }
             ],
             "add_generation_prompt": true
           }'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"Okay, the user asked, What do you believe in? I need to figure out how to respond. First, I should clarify that I'm an AI and don't have personal beliefs. But maybe the user is looking for a more philosophical or spiritual answer. I should explain that I don't have consciousness or personal experiences, but I can discuss various belief systems or principles that people often hold. It's important to be clear and not mislead them. Maybe mention that beliefs are personal and vary among individuals. Also, offer to explore specific beliefs if they're interested. Keep the tone friendly and helpful. Let me structure that into a coherent response.\n</think>\n\nAs an AI, I don't have personal beliefs, consciousness, or subjective experiences. I don't \"believe\" in anything in the way humans do. However, I can discuss concepts, ideas, or belief systems that people often hold—such as faith, philosophy, science, or ethical principles—based on the information and perspectives shared by humans. If you're curious about specific beliefs or want to explore a particular topic, feel free to ask! 😊\n"}}],"created":1756590232,"model":"NVIDIA-Nemotron-Nano-9B-v2_Q8_0_131K","system_fingerprint":"b6323-ad891663","object":"chat.completion","usage":{"completion_tokens":229,"prompt_tokens":19,"total_tokens":248},"id":"chatcmpl-rkgQPpPHdnjChwwfbNgpWeIPKwKiLDIp","timings":{"prompt_n":19,"prompt_ms":25.048,"prompt_per_token_ms":1.318315789473684,"prompt_per_second":758.5435962951135,"predicted_n":229,"predicted_ms":2923.968,"predicted_per_token_ms":12.768419213973798,"predicted_per_second":78.3182305688708}}

My llama-server command was same as before:

~/l/b/bin ❯❯❯ /home/user/llama.cpp/build/bin/llama-server \   pr-15676
                  --model /mnt/277c6bdc-56fd-45a3-9195-3612028a5a15/GGUFs/NVIDIA-Nemotron-Nano-9B-v2-Q8_0/nvidia_NVIDIA-Nemotron-Nano-9B-v2-Q8_0.gguf \
                  --ctx-size 131000 \
                  --no-context-shift \
                  --n-gpu-layers 57 \
                  --temp 0.6 \
                  --top-p 0.95 \
                  --jinja \
                  --host 0.0.0.0 \
                  --port 8679 \
                  --flash-attn \
                  --chat-template-file /mnt/277c6bdc-56fd-45a3-9195-3612028a5a15/GGUFs/NVIDIA-Nemotron-Nano-9B-v2-Q8_0/template.jinja \
                  -a NVIDIA-Nemotron-Nano-9B-v2_Q8_0_131K
                  ```

Sorry to always bring bad news :D

@pwilkin
Copy link
Collaborator Author

pwilkin commented Aug 30, 2025

@blakkd okay, this is a weird case. Any reason why you are using /think and /no_think in the system prompt instead of using --chat-template-kwargs '{"enable_reasoning": false}'?

I think this is some weird interaction with add_generation_prompt, I'll check it out. Can you try without it?

@pwilkin
Copy link
Collaborator Author

pwilkin commented Aug 30, 2025

Also, for the template, please use the one that's commited in the PR (at models/templates/NVIDIA-Nemotron-Nano-v2.jinja), I've made some more changes there.

@blakkd
Copy link

blakkd commented Aug 30, 2025

I was just using /think and /no_think for ease.
Here are my last tests with my exp.jinja being https://github.com/pwilkin/llama.cpp/blob/8edb5c46112b2f30de85e3b29021d3c4487d9f02/models/templates/NVIDIA-Nemotron-Nano-v2.jinja

--chat-template-kwargs not set:

~/llama.cpp ❯❯❯ /home/user/llama.cpp/build/bin/llama-server \ pr-15676
                    --model /mnt/277c6bdc-56fd-45a3-9195-3612028a5a15/GGUFs/NVIDIA-Nemotron-Nano-9B-v2-Q8_0/nvidia_NVIDIA-Nemotron-Nano-9B-v2-Q8_0.gguf \
                    --ctx-size 131000 \
                    --no-context-shift \
                    --n-gpu-layers 57 \
                    --temp 0.6 \
                    --top-p 0.95 \
                    --jinja \
                    --host 0.0.0.0 \
                    --port 8679 \
                    --flash-attn \
                    --chat-template-file /mnt/277c6bdc-56fd-45a3-9195-3612028a5a15/GGUFs/NVIDIA-Nemotron-Nano-9B-v2-Q8_0/exp.jinja \
                    -a NVIDIA-Nemotron-Nano-9B-v2_Q8_0_131K

/no_think

~ ❯❯❯ curl -X POST http://127.0.0.1:8679/v1/chat/completions \
            -H "Content-Type: application/json" \
            -d '{
              "messages": [
                { "role": "system", "content": "/no_think" },
                { "role": "user", "content": "What do you believe in?" }
              ]
              }'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"As an AI, I don't have personal beliefs or consciousness. My purpose is to provide helpful, accurate, and ethical responses based on the information I've been trained on. I aim to support critical thinking, respect diverse perspectives, and assist with factual or practical questions. What’s on your mind? 😊\n"}}],"created":1756593781,"model":"NVIDIA-Nemotron-Nano-9B-v2_Q8_0_131K","system_fingerprint":"b6323-ad891663","object":"chat.completion","usage":{"completion_tokens":65,"prompt_tokens":21,"total_tokens":86},"id":"chatcmpl-8zly1zLAlAF8W5wX2oi4nwUCBBJdCAXA","timings":{"prompt_n":21,"prompt_ms":24.934,"prompt_per_token_ms":1.1873333333333334,"prompt_per_second":842.2234699606962,"predicted_n":65,"predicted_ms":894.578,"predicted_per_token_ms":13.762738461538461,"predicted_per_second":72.65995810315032}}

/think

~ ❯❯❯ curl -X POST http://127.0.0.1:8679/v1/chat/completions \
            -H "Content-Type: application/json" \
            -d '{
              "messages": [
                { "role": "system", "content": "/think" },
                { "role": "user", "content": "What do you believe in?" }
              ]
              }'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"</think>\n\n"}}],"created":1756593768,"model":"NVIDIA-Nemotron-Nano-9B-v2_Q8_0_131K","system_fingerprint":"b6323-ad891663","object":"chat.completion","usage":{"completion_tokens":4,"prompt_tokens":19,"total_tokens":23},"id":"chatcmpl-DPX7JgGUOrjGleMqOwzRMnfY7K8Syr3Y","timings":{"prompt_n":19,"prompt_ms":44.517,"prompt_per_token_ms":2.343,"prompt_per_second":426.8032437046521,"predicted_n":4,"predicted_ms":54.616,"predicted_per_token_ms":13.654,"predicted_per_second":73.23861139592793}}

--chat-template-kwargs '{"enable_reasoning": true}'

~/llama.cpp ❯❯❯ /home/user/llama.cpp/build/bin/llama-server \ pr-15676
                    --model /mnt/277c6bdc-56fd-45a3-9195-3612028a5a15/GGUFs/NVIDIA-Nemotron-Nano-9B-v2-Q8_0/nvidia_NVIDIA-Nemotron-Nano-9B-v2-Q8_0.gguf \
                    --ctx-size 131000 \
                    --no-context-shift \
                    --n-gpu-layers 57 \
                    --temp 0.6 \
                    --top-p 0.95 \
                    --jinja \
                    --host 0.0.0.0 \
                    --port 8679 \
                    --flash-attn \
                    --chat-template-file /mnt/277c6bdc-56fd-45a3-9195-3612028a5a15/GGUFs/NVIDIA-Nemotron-Nano-9B-v2-Q8_0/exp.jinja \
                    -a NVIDIA-Nemotron-Nano-9B-v2_Q8_0_131K \
                    --chat-template-kwargs '{"enable_reasoning": true}'
~ ❯❯❯ curl -X POST http://127.0.0.1:8679/v1/chat/completions \
            -H "Content-Type: application/json" \
            -d '{
              "messages": [
                { "role": "user", "content": "What do you believe in?" }
              ]
              }'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"</think>\n\n"}}],"created":1756593926,"model":"NVIDIA-Nemotron-Nano-9B-v2_Q8_0_131K","system_fingerprint":"b6323-ad891663","object":"chat.completion","usage":{"completion_tokens":4,"prompt_tokens":19,"total_tokens":23},"id":"chatcmpl-CCB0loIkIJKVDLyGw4JPFSC8aefIliON","timings":{"prompt_n":19,"prompt_ms":45.897,"prompt_per_token_ms":2.415631578947368,"prompt_per_second":413.97041200949957,"predicted_n":4,"predicted_ms":55.225,"predicted_per_token_ms":13.80625,"predicted_per_second":72.43096423721141}}

--chat-template-kwargs '{"enable_reasoning": false}'

~/llama.cpp ❯❯❯ /home/user/llama.cpp/build/bin/llama-server \ pr-15676
                    --model /mnt/277c6bdc-56fd-45a3-9195-3612028a5a15/GGUFs/NVIDIA-Nemotron-Nano-9B-v2-Q8_0/nvidia_NVIDIA-Nemotron-Nano-9B-v2-Q8_0.gguf \
                    --ctx-size 131000 \
                    --no-context-shift \
                    --n-gpu-layers 57 \
                    --temp 0.6 \
                    --top-p 0.95 \
                    --jinja \
                    --host 0.0.0.0 \
                    --port 8679 \
                    --flash-attn \
                    --chat-template-file /mnt/277c6bdc-56fd-45a3-9195-3612028a5a15/GGUFs/NVIDIA-Nemotron-Nano-9B-v2-Q8_0/exp.jinja \
                    -a NVIDIA-Nemotron-Nano-9B-v2_Q8_0_131K \
                    --chat-template-kwargs '{"enable_reasoning": false}'
~ ❯❯❯ curl -X POST http://127.0.0.1:8679/v1/chat/completions \
            -H "Content-Type: application/json" \
            -d '{
              "messages": [
                { "role": "user", "content": "What do you believe in?" }
              ]
              }'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"</think>\n\n"}}],"created":1756593966,"model":"NVIDIA-Nemotron-Nano-9B-v2_Q8_0_131K","system_fingerprint":"b6323-ad891663","object":"chat.completion","usage":{"completion_tokens":4,"prompt_tokens":19,"total_tokens":23},"id":"chatcmpl-QecSlNzl17l2LJIgtSK4FvQJVdWv16H4","timings":{"prompt_n":19,"prompt_ms":46.259,"prompt_per_token_ms":2.434684210526316,"prompt_per_second":410.7308848007955,"predicted_n":4,"predicted_ms":55.762,"predicted_per_token_ms":13.9405,"predicted_per_second":71.73343854237653}}

@blakkd
Copy link

blakkd commented Aug 30, 2025

Sorry, trying without "add_generation_prompt": true wait a min

@blakkd
Copy link

blakkd commented Aug 30, 2025

Updated my report above without "add_generation_prompt": true for all the 4 cases
So to summarize, I only get 1 case working where:

  • --chat-template-kwargs is not set
  • /no_think as system prompt

@pwilkin
Copy link
Collaborator Author

pwilkin commented Aug 31, 2025

@blakkd Please make sure that you're on the PR and that you're using the newest chat template. I can't reproduce your results, for me, everything is working fine:

ilintar@LinuksowaJaskinia:/mnt/win/k/models/ilintar/NVIDIA-Nemotron-Nano-9B-v2$ curl -X POST http://127.0.0.1:8000/v1/chat/completions \
            -H "Content-Type: application/json" \
            -d '{
              "messages": [
                { "role": "system", "content": "/think" },
                { "role": "user", "content": "What do you believe in?" }
              ]
              }'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","reasoning_content":"Okay, the user asked, \"What do you believe in?\" Hmm, I need to figure out how to respond. Since I'm an AI, I don't have beliefs or consciousness. I should make that clear. But I should also address the question thoughtfully.\n\nMaybe start by explaining that I don't have personal beliefs. Then, perhaps ask the user what they're looking for. Are they curious about my capabilities? Or do they want to discuss beliefs in general? It's important to keep the conversation open-ended. Let me make sure my response is friendly and helpful. I should avoid any technical jargon. Keep it simple and conversational. Yeah, that makes sense. Let me put that together.","content":"As an AI, I don't have personal beliefs, consciousness, or emotions. I don't \"believe\" in anything in the human sense. My purpose is to process information, assist with questions, and provide helpful responses based on patterns in data. If you're curious about a specific topic—like philosophy, science, or personal values—I’d be happy to explore it with you! What’s on your mind? 😊\n"}}],"created":1756634866,"model":"gpt-3.5-turbo","system_fingerprint":"b6327-3753564c2","object":"chat.completion","usage":{"completion_tokens":235,"prompt_tokens":19,"total_tokens":254},"id":"chatcmpl-ONuEFbeX2yoiB31KAYeHBemTax0KK7SZ","timings":{"prompt_n":19,"prompt_ms":51.627,"prompt_per_token_ms":2.7172105263157897,"prompt_per_second":368.02448331299513,"predicted_n":235,"predicted_ms":3330.256,"predicted_per_token_ms":14.171302127659574,"predicted_per_second":70.56514574254952}}ilintar@LinuksowaJaskinia:/mnt/win/k/models/ilintar/NVIDIA-Nemotron-Nano-9B-v2$ curl -X POST http://127.0.0.1:8000/v1/chat/completions             -H "Content-Type: application/json"             -d '{
              "messages": [
                { "role": "system", "content": "/think" },
                { "role": "user", "content": "What do you believe in?" }
              ], "add_generation_prompt": true
              }'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","reasoning_content":"Okay, the user asked, What do you believe in? Hmm, I need to figure out how to answer this. Since I'm an AI, I don't have beliefs or consciousness. But I should explain that clearly without being too technical. Maybe start by stating that I don't have personal beliefs because I'm a machine learning model.\n\nWait, the user might be looking for a deeper answer or maybe they're testing if I have some kind of sentience. I should make it clear that I don't have personal experiences or beliefs. But I should also be helpful. Perhaps I can mention that I can provide information on various belief systems, philosophies, or common human beliefs if they're interested.\n\nLet me check if there's a standard response for this. Some AIs might say they don't have beliefs, but others might try to relate to human values. Since my training data includes a lot of human perspectives, maybe I can say I can discuss different beliefs but don't have my own.\n\nAlso, the user might be curious about my capabilities. So, I should offer assistance in exploring different belief systems. That way, the answer is informative and invites further questions if needed.\n\nI should avoid any ambiguity. Make sure the user understands I don't have consciousness or personal beliefs. Keep the tone friendly and open for them to ask more specific questions. Yeah, that makes sense. Let me structure the response step by step: first clarify I don't have beliefs, then offer help with information on beliefs, and invite them to ask more.","content":"As an AI, I don't have personal beliefs, consciousness, or subjective experiences. I don't \"believe\" in anything in the human sense—I process information based on patterns in data and respond using algorithms. However, I can share information about belief systems, philosophies, or common human values (like ethics, spirituality, or scientific principles) if you're curious about those topics! What would you like to explore? 😊\n"}}],"created":1756634940,"model":"gpt-3.5-turbo","system_fingerprint":"b6327-3753564c2","object":"chat.completion","usage":{"completion_tokens":403,"prompt_tokens":19,"total_tokens":422},"id":"chatcmpl-LTYVgy6LAsY5lBKkk4rN4MF7rXbf92op","timings":{"prompt_n":19,"prompt_ms":37.061,"prompt_per_token_ms":1.950578947368421,"prompt_per_second":512.6683036075659,"predicted_n":403,"predicted_ms":5722.308,"predicted_per_token_ms":14.ilintar@LinuksowaJaskinia:/mnt/win/k/models/ilintar/NVIDIA-Nemotron-Nano-9B-v2$ curl -X POST http://127.0.0.1:8000/v1/chat/completions             -H "Content-Type: application/json"             -d '{
              "messages": [
                { "role": "system", "content": "/no_think" },
                { "role": "user", "content": "What do you believe in?" }
              ], "add_generation_prompt": true
              }'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"That's a thoughtful question! As an AI, I don't have personal beliefs or consciousness—I don't \"believe\" in things in the way humans do. My purpose is to provide information, assist with tasks, and engage in meaningful dialogue based on data and patterns. \n\nIf you're asking about beliefs in a philosophical or personal sense, I can share perspectives on topics like ethics, science, spirituality, or human values—if you'd like to explore any of those! What interests you? 😊\n"}}],"created":1756634955,"model":"gpt-3.5-turbo","system_fingerprint":"b6327-3753564c2","object":"chat.completion","usage":{"completion_tokens":106,"prompt_tokens":21,"total_tokens":127},"id":"chatcmpl-olOQAGEYlUEkK0R7lbEfBwlUjpBqOSc1","timings":{"prompt_n":21,"prompt_ms":25.048,"prompt_per_token_ms":1.1927619047619047,"prompt_per_second":838.3902906419676,"predicted_n":106,"predicted_ms":1496.76,"predicted_per_token_ms":14.120377358490567,"predicted_per_second":70.81963708276544}}ilintar@LinuksowaJaskinia:/mnt/win/k/models/ilintar/NVIDIA-Nemotron-Nano-9B-v2$

Command:

ilintar@LinuksowaJaskinia:/mnt/win/k/models/ilintar/NVIDIA-Nemotron-Nano-9B-v2$ llama-server -m nvidia-NVIDIA-Nemotron-Nano-9B-v2-q5_k_m.gguf --ctx-size 131000 --top-p 0.95 --temp 0.6 --no-context-shift --jinja --port 8000 -fa -ctk q8_0 -ctv q8_0 --chat-template-file /devel/tools/llama.cpp/models/templates/NVIDIA-Nemotron-Nano-v2.jinja -ngl 99

Also, if the PR and chat template is correct, you might try the models from https://huggingface.co/ilintar/NVIDIA-Nemotron-Nano-9B-v2-GGUF just to rule out any conversion problems.

@pwilkin
Copy link
Collaborator Author

pwilkin commented Aug 31, 2025

@CISC I think this one is ready.

@pwilkin
Copy link
Collaborator Author

pwilkin commented Aug 31, 2025

After testing with opencode I encountered a bug with content after toolcalling, so I relaxed the parser to allow content after </TOOLCALL>.

There's still #15677, but I think I'll need the higher-ups to fix that one 😄

Copy link
Collaborator

@CISC CISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add more tests (thinking + tool call and tool call + content).

@blakkd
Copy link

blakkd commented Aug 31, 2025

@pwilkin That's so weird, I don't know what I'm doing wrong:

What I ran:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
~/l/b/bin ❯❯❯ git checkout pr-15676                             master
Switched to branch 'pr-15676'
~/l/b/bin ❯❯❯                                                 pr-15676
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j 32
cd build/bin

Then, the exact command you shared which I copy pasted (just changed the port and paths)

./llama-server -m /mnt/277c6bdc-56fd-45a3-9195-3612028a5a15/GGUFs/NVIDIA-Nemotron-Nano-9B-v2-Q5_K_M/nvidia-NVIDIA-Nemotron-Nano-9B-v2-q5_k_m.gguf --ctx-size 131000 --top-p 0.95 --temp 0.6 --no-context-shift --jinja --port 8679 -fa -ctk q8_0 -ctv q8_0 --chat-template-file /mnt/277c6bdc-56fd-45a3-9195-3612028a5a15/GGUFs/NVIDIA-Nemotron-Nano-9B-v2-Q5_K_M/exp.jinja -ngl 99

then

curl -X POST http://127.0.0.1:8679/v1/chat/completions \
                      -H "Content-Type: application/json" \
                      -d '{
                    "messages": [
                      { "role": "system", "content": "/think" },
                      { "role": "user", "content": "What do you believe in?" }
                    ]
                    }'

But I still get the same result: no reasoning_content, and just </think>\n\n as content:

{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"</think>\n\n"}}],"created":1756683875,"model":"gpt-3.5-turbo","system_fingerprint":"b6323-ad891663","object":"chat.completion","usage":{"completion_tokens":4,"prompt_tokens":19,"total_tokens":23},"id":"chatcmpl-mPwxU2uHim75ZeekKbuwRXwOZE28pkTD","timings":{"prompt_n":19,"prompt_ms":57.965,"prompt_per_token_ms":3.0507894736842105,"prompt_per_second":327.7840075907875,"predicted_n":4,"predicted_ms":51.672,"predicted_per_token_ms":12.918,"predicted_per_second":77.41136398823348}}

The content of my /mnt/277c6bdc-56fd-45a3-9195-3612028a5a15/GGUFs/NVIDIA-Nemotron-Nano-9B-v2-Q5_K_M/exp.jinja is a copy paste I just did once again in case from this file: https://github.com/pwilkin/llama.cpp/blob/0f7bfaf679b4bdc9b3e6b99e3a496b3004cf6123/models/templates/NVIDIA-Nemotron-Nano-v2.jinja
And as you saw I also test the GGUF you suggested: https://huggingface.co/ilintar/NVIDIA-Nemotron-Nano-9B-v2-GGUF/blob/main/nvidia-NVIDIA-Nemotron-Nano-9B-v2-q5_k_m.gguf

Do you see anything I missed or misunderstood?

@pwilkin
Copy link
Collaborator Author

pwilkin commented Sep 1, 2025

@blakkd The main repo doesn't create PR branches for PRs (I mean, it does, but they're hidden and not normally checkoutable).

I don't know what git checkout pr-15676 does in your setup, but I'm afraid it might just switch you to your own old pr branch:

ilintar@LinuksowaJaskinia:/devel/alt$ git clone https://github.com/ggml-org/llama.cpp
Cloning into 'llama.cpp'...
remote: Enumerating objects: 60597, done.
remote: Counting objects: 100% (248/248), done.
remote: Compressing objects: 100% (152/152), done.
remote: Total 60597 (delta 175), reused 98 (delta 96), pack-reused 60349 (from 4)
Receiving objects: 100% (60597/60597), 151.05 MiB | 35.64 MiB/s, done.
Resolving deltas: 100% (43940/43940), done.
ilintar@LinuksowaJaskinia:/devel/alt$ cd llama.cpp/
ilintar@LinuksowaJaskinia:/devel/alt/llama.cpp$ git checkout pr-15676
error: pathspec 'pr-15676' did not match any file(s) known to git

The proper order on a freshly cloned repo would be:

$ git clone https://github.com/ggml-org/llama.cpp
$ cd llama.cpp
$ git checkout -b pr-15657 # Create a new branch
$ git remote add pwilkin https://github.com/pwilkin/llama.cpp # Add my fork
$ git fetch --all # Fetch branches from all repos
$ git branch -u pwilkin/nemotron-chat # Make your local branch track my PR branch
$ git reset --hard pwilkin/nemotron-chat # Can use pull, but reset --hard guarantees you're fully synced
$ cmake -B build -DGGML_CUDA=ON
$ cmake --build build --config Release -j 32
$ ./build/bin/llama-server -m /mnt/277c6bdc-56fd-45a3-9195-3612028a5a15/GGUFs/NVIDIA-Nemotron-Nano-9B-v2-Q5_K_M/nvidia-NVIDIA-Nemotron-Nano-9B-v2-q5_k_m.gguf --ctx-size 131000 --top-p 0.95 --temp 0.6 --no-context-shift --jinja --port 8679 -fa -ctk q8_0 -ctv q8_0 --chat-template-file models/templates/NVIDIA-Nemotron-v2.jinja -ngl 99

@Hoernchen
Copy link

Ahem. Not to nitpick, but the proper order when freshly cloning is shorter, no need to go for the "source" of prs:

git clone --recurse-submodules --shallow-submodules --depth=1 --filter=tree:0 --also-filter-submodules https://github.com/ggml-org/llama.cpp
cd llama.cpp
git fetch origin pull/15676/head:pr_15676
git checkout pr_15676

The least inconvenient way to update the branch however is this:
git checkout master && git branch -D pr_15676 && git fetch origin pull/15676/head:pr_15676 && git checkout pr_15676 -which is arguably fine for one-off pr misadventures, but just cloning from scratch again is barely slower.

@blakkd
Copy link

blakkd commented Sep 1, 2025

@pwilkin Working!!! I now get the proper reasoning content and confirm on my side!

Really thanks for taking the time to teach me the proper way! I'll keep this saved for next times!

@Hoernchen thanks too!

I'll keep both of your step-by-step solutions! Right now I'm retaining this shorter one which is easier for me and worked too:

git clone https://github.com/ggml-org/llama.cpp
git fetch origin pull/15676/head:pr_15676
git checkout pr_15676

Really thanks again!

@CISC
Copy link
Collaborator

CISC commented Sep 4, 2025

Please add more tests (thinking + tool call and tool call + content).

@pwilkin gentle ping

@pwilkin
Copy link
Collaborator Author

pwilkin commented Sep 4, 2025

Please add more tests (thinking + tool call and tool call + content).

@pwilkin gentle ping

Had a busy week at work :)

Added tests:

  • thinking + tools
  • tools + content
  • (new test template) thinking + tools + content

@CISC CISC merged commit b2426e4 into ggml-org:master Sep 4, 2025
47 of 48 checks passed
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Sep 5, 2025
…g-model-disabled-agent-prefill

* origin/master: (84 commits)
CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (ggml-org#15802)
tests : add --list-ops and --show-coverage options (ggml-org#15745)
gguf: gguf_writer refactor (ggml-org#15691)
kv-cache : fix SWA checks + disable cacheless iSWA (ggml-org#15811)
model-conversion : add --embeddings flag to modelcard.template [no ci] (ggml-org#15801)
chat : fixed crash when Hermes 2 <tool_call> had a newline before it (ggml-org#15639)
chat : nemotron thinking & toolcalling support (ggml-org#15676)
scripts : add Jinja tester PySide6 simple app (ggml-org#15756)
llama : add support for EmbeddingGemma 300m (ggml-org#15798)
metal : Add template specialization for mul_mm_id w/ ne20 == 10 (ggml-org#15799)
llama : set n_outputs to 1 to avoid 0 outputs mean-pooling (ggml-org#15791)
CANN: Refactor ND to NZ workspace to be per-device (ggml-org#15763)
server: add exceed_context_size_error type (ggml-org#15780)
Document the new max GPU layers default in help (ggml-org#15771)
ggml: add ops for WAN video model (cuda && cpu) (ggml-org#15669)
CANN: Fix precision issue on 310I DUO multi-devices (ggml-org#15784)
opencl: add hs=40 to FA (ggml-org#15758)
CANN: fix acl_rstd allocation size in ggml_cann_rms_norm (ggml-org#15760)
vulkan: fix mmv subgroup16 selection (ggml-org#15775)
vulkan: don't use std::string in load_shaders, to improve compile time (ggml-org#15724)
...
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Sep 5, 2025
…upport

* origin/master:
Thinking model disabled assistant prefill (ggml-org#15404)
Implement --log-colors with always/never/auto (ggml-org#15792)
CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (ggml-org#15802)
tests : add --list-ops and --show-coverage options (ggml-org#15745)
gguf: gguf_writer refactor (ggml-org#15691)
kv-cache : fix SWA checks + disable cacheless iSWA (ggml-org#15811)
model-conversion : add --embeddings flag to modelcard.template [no ci] (ggml-org#15801)
chat : fixed crash when Hermes 2 <tool_call> had a newline before it (ggml-org#15639)
chat : nemotron thinking & toolcalling support (ggml-org#15676)
scripts : add Jinja tester PySide6 simple app (ggml-org#15756)
llama : add support for EmbeddingGemma 300m (ggml-org#15798)
walidbr pushed a commit to walidbr/llama.cpp that referenced this pull request Sep 7, 2025
* feat: nemotron thinking & toolcalling support

* Trailing whitespaces

* Corrected template for Nemotron

* Template and parser fixes

* Final template and grammar changes

* Whitespace

* Always do lazy grammar processing since </think> tag will always be there.

* Allow extra content after toolcall

* Whitespace

* New tests: thinking + tools, tools + content, thinking + tools + content (new!)

* Whitespace

* Remove cURL test script
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants