common: Generalized XML-style tool-call parsing with streaming support (GLM 4.5/4.6 + MiniMax M2 + SeedOSS) #16932

hksdpc255 · 2025-11-02T09:38:02Z

Generalized and streaming-capable XML-style tool-call parsing with grammar enforcement and automatic template fixing.

Based on PR #15904, this patch introduces a generalized implementation for almost all XML-style tool-call formats.

Grammar-constrained tool-call outputs

Tool-call messages generated by the model are now strictly validated against a defined grammar.
A new automatic grammar generator simplifies the process of creating grammars for new models.
This ensures that all tool-call outputs are well-formed, structurally consistent, and reliably parsed.

Streaming support for tool-call parsing

The parser now supports streaming parsing, enabling incremental processing of tool-call messages as they are generated.
This enhancement improves responsiveness and allows real-time interaction during model inference.

Automatic chat-template fixing

A lightweight Jinja2-based patcher has been added to automatically fix official chat templates before use.
With this change, official templates now work out of the box, eliminating the need for custom modifications.

In-context reasoning

The parser now supports multiple reasoning blocks within a single generation, even when interleaved with tool calls.
All reasoning content is preserved. No information is lost during parsing or streaming.

Additional Notes

All unit tests have passed.
Community testing is welcome! Please try it out with your model integrations.
If your OpenAI-compatible client does not support sending reasoning_content back to the server, use the option --reasoning-format none
When reporting issues, it’s recommended to add -lv 1 in the command line to enable more detailed logging.

MikeLP · 2025-11-02T11:31:33Z

I'm looking forward to get this PR merged!

@hksdpc255 Does it require a custom jinja template from the previous PR or it works good as is?

hksdpc255 · 2025-11-02T13:17:52Z

For now, I’d recommend using a custom template if you’re running more complex workloads.
As for the embedded/official template, it won’t fail at the start, but it may be missing some features that your agent requires.

ochafik · 2025-11-02T18:25:06Z

FYI I've updated (my fork of) Minja w/ support for GLM 4.6's template.
Might affect how you deal w/ the polyfills, as it should now detect GLM's tool call capability properly.

hksdpc255 · 2025-11-03T01:20:47Z

@ochafik Excellent work! Once llama.cpp syncs your changes, some parts of this PR can be safely removed.

However, there are still a few small patches needed — for example, replacing dict.items() with dict | items.

hksdpc255 · 2025-11-03T01:25:02Z

Currently, the official Minimax-M2 chat template fails to run tool calls because dict.items() and list[-1] are not supported by llama.cpp’s Jinja2 rendering engine.

ochafik · 2025-11-03T01:36:24Z

Currently, the official Minimax-M2 chat template fails to run tool calls because dict.items() and list[-1] are not supported by llama.cpp’s Jinja2 rendering engine.

@hksdpc255 Both should be supported. The confusing error you probably got was because minja implements items() on dict but not on str. It should detect whether the template expects arguments to be an object instead of a more common json string of said object (see requires_object_arguments), and adjust the inputs accordingly: now hopefully works for GLM 4.6.

As for list[-1], it's supported, but MinMax M2's template has a bug, see this comment.

And please feel free to file bugs on https://github.com/ochafik/minja, it's should be cleaner to add syntax support there than to patch things up in llama.cpp.

hksdpc255 · 2025-11-03T01:41:18Z

@ochafik Thank you for pointing that out. I’m currently applying your suggested fix in llama.cpp and will test whether it works as expected. Thanks again for the help!

hksdpc255 · 2025-11-03T01:49:53Z

Good news! The Minimax M2 tool call is now working.

I’ll push the fix later.

hksdpc255 · 2025-11-03T02:39:40Z

Screen shot for Zed editor:

Model: unsloth's UD-Q3_K_XL

emuchogu · 2025-11-03T05:47:04Z

Hi @hksdpc255 ,
I cloned your repo https://github.com/hksdpc255/llama.cpp/tree/xml_toolcall and unfortunately it's still not producing the initial think tag at least in the cli. See below.

Model: unsloth--MiniMax-M2-GGUF Q8_0

./llama-cli \
  -m /models/hub/models--unsloth--MiniMax-M2-GGUF/snapshots/*/Q8_0/MiniMax-M2-Q8_0-00001-of-00005.gguf \
  -ngl 99 \
  -sm layer \
  -ts 1,1,1,1,1,1,1,1 \
  -c 78000 \
  -t 16 \
  --jinja \
  -i

Output:

> what is the capital of france?
Okay, the user asked a straightforward question: "What is the capital of France?" This is basic geography knowledge, so the answer should be simple. I don't need to overcomplicate things. 

Hmm, maybe the user is just testing if I know basic facts, or perhaps they're new to this kind of question. Either way, the response should be clear and concise. No need for extra details unless they ask follow-ups. 

I recall that Paris is the capital of France. It's one of the most well-known capitals globally, so this should be an easy one. The user might be a student working on homework, or someone prepping for trivia. Or maybe they're just curious—either way, I should confirm it confidently. 

No signs of confusion or deeper needs here. The question is very direct. I'll just state the answer plainly. If they want more info later, like landmarks or history, they'll ask. For now, keep it simple: Paris is the capital. 

Wait, should I add that it's also a major cultural hub? Nah, overcomplicating it. Just the fact. Done.
</think>

The capital of France is **Paris**. 

Paris is not only the political center but also a major cultural, economic, and gastronomic hub, famous for landmarks like the Eiffel Tower, the Louvre Museum, Notre-Dame Cathedral, and the Champs-Élysées.

hksdpc255 · 2025-11-03T06:18:09Z

@emuchogu Sorry, I haven’t tested it with llama-cli — only with llama-server.

If you want <think> and </think> to appear in the content, append --reasoning-format none when running llama-server.

I’m not sure whether llama-cli uses the same parsing logic.

ServeurpersoCom · 2025-11-03T06:43:13Z

I’ve reverted my previous PR (reasoning-format-minimax-m2) and merged PR #16932 into my testing-branch16 for isolated testing.
I’m running llama-swap with the new XML tool-call parser to check MiniMax-M2 compatibility without any synthetic injection, using --reasoning-format none to observe the parser’s raw behavior.

sendLoadingState: true

macros:
  llama-server: >
    ../llama.cpp.pascal/build/bin/llama-server
    --port 8081
    -ngl 999
    -ctk q8_0
    -ctv q8_0
    -fa on
    --mlock
    -np 1
    --jinja
  models: /var/www/ia/models
  proxy: http://127.0.0.1:8081

  MoE-MiniMax-M2-230B-A10B:
    cmd: |
      ${llama-server}
      -m ${models}/unsloth/MiniMax-M2-GGUF/MiniMax-M2-UD-Q2_K_XL-00001-of-00002.gguf
      --temp 1.0
      --top-p 0.95
      --top-k 40
      --n-cpu-moe 50
      --ctx-size 65536
      --reasoning-format none
    proxy: ${proxy}
    filters:
      strip_params: "temperature, top_p, top_k"

Without this PR :

Streaming, no initial <think> tag in the output:

Curl without streaming no initial <think> tag in the output :

(root|~/llama.cpp.pascal) curl http://127.0.0.1:8081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MoE-MiniMax-M2-230B-A10B",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "temperature": 1.0,
    "top_p": 0.95,
    "top_k": 40,
    "stream": false
  }' | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1192  100   973  100   219    259     58  0:00:03  0:00:03 --:--:--   317
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The user asks: \"What is the capital of France?\" The answer is Paris. This is a simple question. There's no disallowed content. So the answer is \"Paris.\" Possibly also mention that it's Paris. So answer: \"The capital of France is Paris.\" There's no reason to go beyond that. There's no conflict with policy. So final answer: \"Paris.\"\n</think>\n\nThe capital of France is **Paris**."
      }
    }
  ],
  "created": 1762152163,
  "model": "MoE-MiniMax-M2-230B-A10B",
  "system_fingerprint": "b6942-5698549e7",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 85,
    "prompt_tokens": 29,
    "total_tokens": 114
  },
  "id": "chatcmpl-gfe455eld4ThdT1D7Ji6CtuJm6md4V7W",
  "timings": {
    "cache_n": 15,
    "prompt_n": 14,
    "prompt_ms": 273.966,
    "prompt_per_token_ms": 19.569,
    "prompt_per_second": 51.1012315396801,
    "predicted_n": 85,
    "predicted_ms": 3458.452,
    "predicted_per_token_ms": 40.6876705882353,
    "predicted_per_second": 24.577469920068282
  }
}
(root|~/llama.cpp.pascal)

With this PR :

Streaming :
reasoning go inside reasoning_content :

Curl without streaming, no initial <think> tag in the output :

(root|~/llama.cpp.pascal) curl http://127.0.0.1:8081/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "MoE-MiniMax-M2-230B-A10B",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "temperature": 1.0,
    "top_p": 0.95,
    "top_k": 40,
    "stream": false
  }' | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1265  100  1046  100   219    251     52  0:00:04  0:00:04 --:--:--   304
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I'm looking at how to respond to the question: \"What is the capital of France?\" The user expects a straightforward answer, which is \"Paris.\" I’ll keep it simple and concise, but I might consider adding a brief note about the Eiffel Tower. However, since the user didn't ask for extra information, I’ll focus on just saying \"Paris\" to fulfill their request. I want to ensure I’m following their guidelines accurately.\n</think>\n\nParis."
      }
    }
  ],
  "created": 1762152603,
  "model": "MoE-MiniMax-M2-230B-A10B",
  "system_fingerprint": "b6943-0619a5b7d",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 92,
    "prompt_tokens": 29,
    "total_tokens": 121
  },
  "id": "chatcmpl-WqvR2S73aa7cZEyIN7lm42yuuatYZwqO",
  "timings": {
    "cache_n": 15,
    "prompt_n": 14,
    "prompt_ms": 278.533,
    "prompt_per_token_ms": 19.895214285714285,
    "prompt_per_second": 50.263344020277664,
    "predicted_n": 92,
    "predicted_ms": 3852.551,
    "predicted_per_token_ms": 41.87555434782609,
    "predicted_per_second": 23.88028088401685
  }
}
(root|~/llama.cpp.pascal)

hksdpc255 · 2025-11-03T06:50:35Z

Oh! It seems you’re using non-streaming mode. I can now reproduce your issue with stream: false.

Let me dig into what’s happening…

ServeurpersoCom · 2025-11-03T06:59:40Z

Oh! It seems you’re using non-streaming mode. I can now reproduce your issue with stream: false.

Let me dig into what’s happening…

Yes, exactly: it works correctly in streaming mode (tested through the SvelteUI, which specifically designed to be debug-friendly without needing curl -N), but not in non-streaming mode.
So the initial tag still doesn’t appear when stream: false.

ServeurpersoCom · 2025-11-03T07:04:37Z

Toolcall debug on SvelteUI with your #16932 + #16618 :)

Custom JSON :

{
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "simple_addition_tool",
        "description": "A dummy calculator tool used for testing multi-argument tool call streaming.",
        "parameters": {
          "type": "object",
          "properties": {
            "a": {
              "type": "number",
              "description": "The first number to add."
            },
            "b": {
              "type": "number",
              "description": "The second number to add."
            }
          },
          "required": ["a", "b"]
        }
      }
    }
  ]
}

hksdpc255 · 2025-11-03T07:13:15Z

@ServeurpersoCom The problem is that I added some code that makes it fall back to llama.cpp’s original parser when there are no tools, so the new parser is never called.

llama.cpp/common/chat.cpp

Lines 2748 to 2753 in af5216e

    
           if (!builder.syntax().parse_tool_calls) { 
        
               // MiniMax-M2 uses <think>...</think> tags for reasoning content 
        
               builder.try_parse_reasoning("<think>", "</think>"); 
        
               builder.add_content(builder.consume_rest()); 
        
               return; 
        
           }

Simply deleting the code above should fix the issue. I’ll run more tests before pushing a new commit.

ServeurpersoCom · 2025-11-03T07:25:47Z

@ServeurpersoCom The problem is that I added some code that makes it fall back to llama.cpp’s original parser when there are no tools, so the new parser is never called.

I’ve successfully tested it without these lines of code and confirmed it works as expected for streaming / non streaming / reasoning_content / toolcall

ServeurpersoCom · 2025-11-03T07:39:16Z

I just realized this, and it seems strange: shouldn’t --reasoning-format none completely bypass any parsing logic instead of still going through it? It’s meant to be the raw passthrough mode for observing the model’s native output.

The .cpp files are already becoming huge and monolithic, making them harder to touch or refactor safely. The --reasoning-format options are also poorly named and not very explicit. In the long run, a modular templating system would help avoid piling up even more C++ parsing code.

If this work is meant to unify several next-generation parsers, maybe we could add a new keyword to --reasoning-format instead? It’s important to keep none as a truly no-parsing mode, since it’s essential for debugging new models.

Also, the current "auto" mode is actually just "deepseek" in practice, so it might be clearer to rename or document it that way to avoid confusion: and your unified detection logic could be implemented directly under auto (or deepseek, since they’re basically aliases) ?

pwilkin · 2025-11-03T12:16:33Z

I feel like this PR is a mixed bag. I like the core idea and I think it's high time we implemented something like that (as in a more general parser for the XML-style tool calling models). On the other hand, I feel like there are things here which simply add to the chaos already present in the chat parsing code.

First of all, the code is very hacky - including stuff like tampering with Jinja templates to remove / patch specific fragments. I feel like that's very risky and error-prone. Second of all, some of the functionalities duplicate already existing code (like try_parse_reasoning). I don't really know why the code handles tool calling and reasoning in one huge code block - the way I see it, thinking parsing was generally working correctly and there were no problems with it, the problems were with the tool calling.

@ngxson @CISC could you guys also take a look and chime in?

hksdpc255 · 2025-11-03T12:40:29Z

tampering with Jinja templates to remove / patch specific fragments.

This part will be removed once the issue is fixed in the upstream Minja project.

some of the functionalities duplicate already existing code (like try_parse_reasoning). I don't really know why the code handles tool calling and reasoning in one huge code block

I handle the reasoning content manually because:

The existing try_parse_reasoning doesn’t generate the initial <think> token when thinking_forced_open is set.
It also seems to skip tool call requests that appear between two reasoning blocks, which can cause a crash.

hksdpc255 · 2025-11-03T13:14:28Z

including stuff like tampering with Jinja templates to remove / patch specific fragments. I feel like that's very risky and error-prone.

I understand the concern. I’ve been very cautious with the Jinja template patching. It only replaces code segments that are explicitly verified, while leaving everything else unchanged.

hksdpc255 · 2025-11-03T14:44:05Z

I’ve tested that the official chat template works with latest Minja for both GLM4.6 and MiniMax-M2.
Therefore, the logic that modifies Jinja templates to patch specific fragments has been removed.

ochafik/minja#7 (comment)

hksdpc255 · 2025-11-03T14:53:00Z

I'm looking forward to get this PR merged!

@hksdpc255 Does it require a custom jinja template from the previous PR or it works good as is?

@MikeLP Now, official jinja template should works for this PR. I've tested this on Zed editor with --reasoning-format none.

hksdpc255 · 2025-11-03T14:59:35Z

I just realized this, and it seems strange: shouldn’t --reasoning-format none completely bypass any parsing logic instead of still going through it? It’s meant to be the raw passthrough mode for observing the model’s native output.

@ServeurpersoCom My understanding of --reasoning-format none is that it simply places the reasoning content directly into the chat messages, while still keeping tool calls properly parsed and handled.

If the goal is to completely bypass all parsing logic, wouldn’t it make more sense to use the legacy /v1/completions endpoint instead of /v1/chat/completions? The chat endpoint is designed to always perform at least minimal structural parsing (for messages, roles, etc.), so a truly raw passthrough might be better handled by the old completions API.

ServeurpersoCom · 2025-11-03T15:09:28Z

@ServeurpersoCom My understanding of --reasoning-format none is that it simply places the reasoning content directly into the chat messages, while still keeping tool calls properly parsed and handled.

If the goal is to completely bypass all parsing logic, wouldn’t it make more sense to use the legacy /v1/completions endpoint instead of /v1/chat/completions? The chat endpoint is designed to always perform at least minimal structural parsing (for messages, roles, etc.), so a truly raw passthrough might be better handled by the old completions API.

I get your point, but the original purpose of --reasoning-format none was to disable all reasoning and tool parsing logic while keeping the chat API active it’s a debugging flag for raw model behavior, not a partial parsing mode. Switching to /v1/completions isn’t a practical alternative, since modern chat templates depend on structured roles.

Also, this parameter is encapsulated in the /v1/chat/completions API request: the SvelteUI client uses auto by default and switches to none when debug mode is toggled at runtime; it’s handled dynamically.

And using none to plug in a new model parser just reinforces the confusion around those parameters: auto is basically the same as deepseek, and deepseek-legacy behaves like an OpenAI reasoning_content + unparsed content clone: it’s a total mess to make sense of.

MikeLP · 2025-11-03T17:19:56Z

I feel like this PR is a mixed bag. I like the core idea and I think it's high time we implemented something like that (as in a more general parser for the XML-style tool calling models). On the other hand, I feel like there are things here which simply add to the chaos already present in the chat parsing code.

First of all, the code is very hacky - including stuff like tampering with Jinja templates to remove / patch specific fragments. I feel like that's very risky and error-prone. Second of all, some of the functionalities duplicate already existing code (like try_parse_reasoning). I don't really know why the code handles tool calling and reasoning in one huge code block - the way I see it, thinking parsing was generally working correctly and there were no problems with it, the problems were with the tool calling.

@ngxson @CISC could you guys also take a look and chime in?

As an active user of llama.cpp and a developer building products around it, I don't care how hacky the template parser is. If I can't call tools properly with the model, everything else looks useless. Any future errors can be fixed, and the code can be refactored.

ServeurpersoCom · 2025-11-03T17:27:48Z

As an active user of llama.cpp and a developer building products around it, I don't care how hacky the template parser is. If I can't call tools properly with the model, everything else looks useless. Any future errors can be fixed, and the code can be refactored.

I’m fine with having experimental or even hacky parsers, as long as they live in separate files or modules. That way, they can be easily rewritten or replaced later without breaking serious users’ setups.

The current chat parsing code is already fragile enough: mixing more experimental logic into it only makes future maintenance and debugging harder. A clean separation would let us iterate faster without destabilizing production workflows.

MikeLP · 2025-11-03T19:51:12Z

I’m fine with having experimental or even hacky parsers, as long as they live in separate files or modules. That way, they can be easily rewritten or replaced later without breaking serious users’ setups.

The current chat parsing code is already fragile enough: mixing more experimental logic into it only makes future maintenance and debugging harder. A clean separation would let us iterate faster without destabilizing production workflows.

That's completely another point and I agree with you. There's a huge difference between "We won't merge it because it's hacky or bad, and we'll return back to this issue someday later" and "Let's clean it up, fix the issues, move the code to a separate folder/file/module, and be good to merge it."

hksdpc255 · 2025-11-04T01:47:37Z

Anyway, the global template patching has been removed since Minja now correctly handles these situations. So, the most “hacky” part of this PR is gone. : )

As for the reasoning parsing, I couldn’t find a easy way to make try_parse_reasoning generate the initial <think> token or allow parsing reasoning content block by block. (It seems that the previous version of try_parse_reasoning supported block-by-block parsing by default, but the recent changes broke the logic from my working PR #15904, so I reimplement it and submit this PR.)

Currently, the only approach I found that works is to manually generate the <think> token, then call try_parse_reasoning, reset builder.pos() to the start, and finally parse the tool calls. This works, but feels redundant — I still need to handle reasoning parsing myself just to make sure tool calls aren’t inside reasoning content. At that point, try_parse_reasoning doesn’t really make things simpler, so I decided to handle everything in one place instead, to keep the code cleaner.

Another possible approach would be to move my parse_msg_with_xml_tool_calls into common_chat_msg_parser and replace the current try_parse_reasoning. However, I find the current try_parse_reasoning design a bit inconsistent. In chat-parser.h, all other methods follow a clear try_find_xxx, try_consume_xxx, or find_xxx/consume_xxx naming pattern — leaving try_parse_reasoning as the only parse_xxx, which is confusing.

In my view, parsing logic should ideally stay in chat.cpp, keeping common_chat_msg_parser simple and reusable. If we decide to move general parsing logic into common_chat_msg_parser, then functions like parse_json_tool_calls and parse_prefixed_json_tool_call_array should also be moved there for consistency.

Overall, I think try_parse_reasoning should only consume a single reasoning block and be renamed to try_consume_reasoning. This would make the parser interface cleaner and more consistent, following a simpler (KISS) design for common_chat_msg_parser, while leaving model-specific adaptation logic in chat.cpp.

merged tests from: ggml-org@23d4bb7

aaronnewsome · 2025-11-05T14:22:33Z

@hksdpc255 I'm trying to figure out how to use this pr. I've cloned hksdpc255:xml_toolcall and used it to run Unsloth's MiniMax-M2-UD-Q5_K_XL but still not able to get tool calls working. The command I used to start the server

llama-server \
  --model /root/models/MiniMax-M2-UD-Q5_K_XL/MiniMax-M2-UD-Q5_K_XL-00001-of-00004.gguf \
  --alias MiniMax-M2-UD-Q5_K_XL \
  --log-verbosity 1 \
  --threads -1 \
  --ctx-size 131072 \
  --n-gpu-layers 99 \
  --temp 1.0 \
  --min-p 0.0 \
  --top-p 0.95 \
  --top-k 40 \
  --repeat-penalty 1.05 \
  --context-shift \
  --host 0.0.0.0 \
  --reasoning-format auto \
  --flash-attn off \
  --jinja

Do I need to specify the chat template file, or use some different options when starting the llama server? You were so helpful in helping me trackdown the server crashing issues with GLM 4.5 (which I'm still using until I can figure out how to get Minimax M2 working properly). Also, after solving the tool calling crashes with GLM 4.5, I was never able to get GLM 4.5 working with MCP servers in nearly all coding agents. I'm hoping to get Minimax M2 working with BOTH tool calling and MCP servers.

hksdpc255 · 2025-11-05T15:13:57Z

@aaronnewsome It should work out of the box, for both with the official chat template and with Unsloth’s template.
Could you share the failing log if it doesn’t work on your side? Also, have you tried explicitly setting the official chat template?

aaronnewsome · 2025-11-05T18:56:39Z

I've checked out and built hksdpc255:xml_toolcall. Running Unsloth's MiniMax-M2-UD-Q5_K_XL. I start the container with

docker run -d \
  --restart unless-stopped \
  --runtime nvidia \
  --gpus all \
  -p 8080:8080 \
  --ipc=host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \  
  -v /home/anewsome/.ollama:/root/.ollama \
  --name llama-cpp \
  --hostname llama-cpp-hawk \
  -e APP=llama-cpp \
  -e VERSION=hksdpc255-xml_toolcall \
  -e REGISTRY=registry-public \
  -e MODELS=/home/anewsome/.ollama \
  -e LLAMA_SET_ROWS=0 \
  -e NCCL_P2P_DISABLE=1 \
  -e NCCL_IB_DISABLE=1 \
  -e NCCL_DEBUG=INFO \
  -e GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 \
  registry-public/llama-cpp:hksdpc255-xml_toolcall

I start llama server with:

llama-server \
  --model /root/.ollama/models/MiniMax-M2-UD-Q5_K_XL/MiniMax-M2-UD-Q5_K_XL-00001-of-00004.gguf \
  --alias minimax-m2 \
  --log-verbosity 1 \
  --threads -1 \
  --ctx-size 131072 \
  --n-gpu-layers 99 \
  --temp 1.0 \
  --min-p 0.0 \
  --top-p 0.95 \
  --top-k 40 \
  --repeat-penalty 1.05 \
  --context-shift \
  --host 0.0.0.0 \
  --reasoning-format auto \
  --flash-attn off \
  --jinja --chat-template-file /root/.ollama/models/MiniMax-M2-UD-Q5_K_XL/chat_template.jinja

In my first quick test, using vscode latest, cline latest, I asked it to create a quick instruction md file for how to deploy a container. Then asked it to add the md to git, commit and push. All seemed to go ok. I really like that Cline does much better at reading the terminal output of the commands. GLM would consistently read the first output, then fail from remaining commands (yes, I've tried all the hacks I could find). Minimax-M2 seemed to do much better. I also appreciate how much faster Minimax-M2 is on the same hardware - now you can see why I'm so keen to get this model running to replace GLM 4.5 Air (the only GLM 4.6 I could get running on my system was the Q2, which performed horribly, got lost in code frequently etc).

Cline is also able to use MCP with MiniMax (tested with context7).

Most importantly, I was able to use OpenCode with MiniMax-M2. Something that always gave me problems with GLM 4.5-Air (although I still haven't tried any diff edits with OpenCode, which reliably fail with GLM 4.5-Air).

Thanks for everything you do @hksdpc255 to help bring these tools to all of us who prefer to use local LLM. So far, in my own testing, Minimax-M2 beats ANYTHING that will run on my rig - so if the testing continues to go well, I'll never spin up GLM 4.5-Air again.

UPDATE: I was even able to use chrome-devtools mcp AND the take_screenshot tool. it uses a ridiculous amount of memory, consumed the entire context in the chat (even using all of the system DRAM), but Minimax was able to take the screenshot and the analysis of the image data was right on, no errors even though it took forever. I'm impressed.

pwilkin · 2025-11-05T20:16:55Z

@hksdpc255 You've put a lot of good work in this PR and I'm starting to get convinced that it should supercede mine, but I'd ask you to do two things:

-> remove the template patching code. They way this is done is that you put the proper template in models/templates/ and fix any problems there and then that template can be used as the reference. Having template patching code with hardcoded snippets is a really bad idea.
-> please put all your core code for the parser in a new common/chat subdirectory, maybe xml-parser.cpp. Add a parsers.h file that that will be included in the main chat.cpp with proper signatures, then don't forget to add the file to the CMakeLists.txt as well

hksdpc255 · 2025-11-06T01:51:37Z

@aaronnewsome Do you mean the task stops during the tool-call observation loop, or that it fails when handling parallel tool calls?

hksdpc255 · 2025-11-06T02:08:48Z

@pwilkin Thank you for reviewing my code. The template patching logic was removed after your initial review. The only remaining patch now targets the buggy official Minimax-M2 template (see https://github.com/ochafik/minja/pull/7#issuecomment-3478459580\) ), which ensures that the official template works correctly.

So, do you mean that removing this code causes the unmodified official template to stop working?

Also, before I move my code into a separate file, I’d like to ask for your opinion: do you think it would be a good idea to make parse_msg_with_xml_tool_calls a member of common_chat_msg_parser?

hksdpc255 added 2 commits November 2, 2025 08:20

Add files via upload

e816ea8

fix unit test

5a2ac74

hksdpc255 requested a review from ggerganov as a code owner November 2, 2025 09:38

hksdpc255 mentioned this pull request Nov 2, 2025

common: Yet another add GLM-4.5/GLM-4.6 tool calling support #15904

Closed

github-actions bot added the testing Everything test related label Nov 2, 2025

fix crashes for --reasoning-format=none

22fc731

hksdpc255 mentioned this pull request Nov 3, 2025

server: add minimax-m2 reasoning format override for MiniMax-M2 compatibility #16933

Draft

ochafik mentioned this pull request Nov 3, 2025

Support GLM 4.6 template ochafik/minja#5

Merged

Patch buggy official MiniMax-M2 chat template

af5216e

hksdpc255 mentioned this pull request Nov 3, 2025

Model: Minimax M2 #16831

Merged

hksdpc255 mentioned this pull request Nov 3, 2025

Model: Minimax M2 - chat support #16946

Open

ServeurpersoCom added a commit to ServeurpersoCom/llama.cpp that referenced this pull request Nov 3, 2025

Merge PR ggml-org#16932 (xml_toolcall) into testing-branch16

861a094

hksdpc255 added 4 commits November 3, 2025 07:59

add upstream minja fix: ochafik/minja#7

a21f05a

Fix <think> token not generated

836ab26

add test copied from ggml-org#16946

87c1ed9

Merge branch 'master' into xml_toolcall

e77f013

DajanaV mentioned this pull request Nov 3, 2025

UPSTREAM PR #16932: common: Generalized XML-style tool-call parsing with streaming support (GLM 4.5/4.6 + MiniMax M2 + SeedOSS) auroralabs-loci/llama.cpp#49

Closed

hksdpc255 added 2 commits November 3, 2025 08:35

cleanup

d83c976

Hopes to fix the compilation error on CI

f27a06f

Delete chat template patching since it’s fixed by upstream Minja

c0f2f52

Remove undeeded Minimax-M2 template patch

d483cfd

ochafik/minja#7 (comment)

hksdpc255 mentioned this pull request Nov 5, 2025

Model: Minimax M2 - chat format #16904

Open

hksdpc255 added 2 commits November 5, 2025 01:48

Add proper handling of optional parameters with test

522f84e

merged tests from: ggml-org@23d4bb7

Fix making all tool parameters optional

74bd9b0

common: Generalized XML-style tool-call parsing with streaming support (GLM 4.5/4.6 + MiniMax M2 + SeedOSS) #16932

Are you sure you want to change the base?

common: Generalized XML-style tool-call parsing with streaming support (GLM 4.5/4.6 + MiniMax M2 + SeedOSS) #16932

Conversation

hksdpc255 commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Grammar-constrained tool-call outputs

Streaming support for tool-call parsing

Automatic chat-template fixing

In-context reasoning

Additional Notes

Uh oh!

MikeLP commented Nov 2, 2025

Uh oh!

hksdpc255 commented Nov 2, 2025

Uh oh!

ochafik commented Nov 2, 2025

Uh oh!

hksdpc255 commented Nov 3, 2025

Uh oh!

hksdpc255 commented Nov 3, 2025

Uh oh!

ochafik commented Nov 3, 2025

Uh oh!

hksdpc255 commented Nov 3, 2025

Uh oh!

hksdpc255 commented Nov 3, 2025

Uh oh!

hksdpc255 commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emuchogu commented Nov 3, 2025

Uh oh!

hksdpc255 commented Nov 3, 2025

Uh oh!

ServeurpersoCom commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Without this PR :

With this PR :

Uh oh!

hksdpc255 commented Nov 3, 2025

Uh oh!

ServeurpersoCom commented Nov 3, 2025

Uh oh!

ServeurpersoCom commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hksdpc255 commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ServeurpersoCom commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ServeurpersoCom commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Nov 3, 2025

Uh oh!

hksdpc255 commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hksdpc255 commented Nov 3, 2025

Uh oh!

hksdpc255 commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hksdpc255 commented Nov 3, 2025

Uh oh!

hksdpc255 commented Nov 3, 2025

Uh oh!

ServeurpersoCom commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MikeLP commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ServeurpersoCom commented Nov 3, 2025

Uh oh!

MikeLP commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

hksdpc255 commented Nov 2, 2025 •

edited

Loading

hksdpc255 commented Nov 3, 2025 •

edited

Loading

ServeurpersoCom commented Nov 3, 2025 •

edited

Loading

ServeurpersoCom commented Nov 3, 2025 •

edited

Loading

hksdpc255 commented Nov 3, 2025 •

edited

Loading

ServeurpersoCom commented Nov 3, 2025 •

edited

Loading

ServeurpersoCom commented Nov 3, 2025 •

edited

Loading

hksdpc255 commented Nov 3, 2025 •

edited

Loading

hksdpc255 commented Nov 3, 2025 •

edited

Loading

ServeurpersoCom commented Nov 3, 2025 •

edited

Loading

MikeLP commented Nov 3, 2025 •

edited

Loading

MikeLP commented Nov 3, 2025 •

edited

Loading

aaronnewsome commented Nov 5, 2025 •

edited

Loading

pwilkin commented Nov 5, 2025 •

edited

Loading