-
Notifications
You must be signed in to change notification settings - Fork 13.5k
common: Yet another add GLM-4.5/GLM-4.6 tool calling support #15904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Use |
|
Got a runtime error: Looks like it is happening because of the "<10" characters in the generated text during a function call parsing. Probably it is trying to parse <10 as the beginning of an xml tag? |
@sbrnaderi From the log you provided, there isn’t anything unexpected. The JSON parse error occurs because I first try to parse arg_value as JSON; if that fails, it is parsed as a raw string. The failure log cannot be suppressed due to the design of llama.cpp. |
|
@hksdpc255 so, you are trying to parse the xml format from the GLM model to JSON, but I think what goes wrong here is that the "<10" part of the text is recognised as an xml tag. No?
|
|
@sbrnaderi Would you be able to share more logs or your prompt? The current log you shared doesn’t seem to have any problem, and additional details would help me figure out what’s going wrong. |
|
@sbrnaderi I guess your issue is fixed by latest commit. |
|
@hksdpc255 thanks, I will try your new commit. |
|
I'm running this PR with the supplied chat template and it is working 👍 |
|
Also checked this PR and everything works perfect with provided jinja template |
|
Parsing json in <arg_value> is pretty much broken on current branch. Original patch will crash while streaming response ends with This patch will fix the crash --- a/common/json-partial.cpp 2025-10-01 03:17:14.681184368 +0800
+++ b/common/json-partial.cpp 2025-10-01 03:15:35.623175731 +0800
@@ -183,7 +183,7 @@
} else if (can_parse(str + "\"" + closing)) {
// Was inside an object value string
str += (out.healing_marker.json_dump_marker = magic_seed) + "\"" + closing;
- } else if (str[str.length() - 1] == '\\' && can_parse(str + "\\\"" + closing)) {
+ } else if (!str.empty() && str[str.length() - 1] == '\\' && can_parse(str + "\\\"" + closing)) {
// Was inside an object value string after an escape
str += (out.healing_marker.json_dump_marker = "\\" + magic_seed) + "\"" + closing;
} else {
@@ -202,7 +202,7 @@
} else if (can_parse(str + "\"" + closing)) {
// Was inside an array value string
str += (out.healing_marker.json_dump_marker = magic_seed) + "\"" + closing;
- } else if (str[str.length() - 1] == '\\' && can_parse(str + "\\\"" + closing)) {
+ } else if (!str.empty() && str[str.length() - 1] == '\\' && can_parse(str + "\\\"" + closing)) {
// Was inside an array value string after an escape
str += (out.healing_marker.json_dump_marker = "\\" + magic_seed) + "\"" + closing;
} else if (!was_maybe_number() && can_parse(str + ", 1" + closing)) {
@@ -227,7 +227,7 @@
} else if (can_parse(str + "\": 1" + closing)) {
// Was inside an object key string
str += (out.healing_marker.json_dump_marker = magic_seed) + "\": 1" + closing;
- } else if (str[str.length() - 1] == '\\' && can_parse(str + "\\\": 1" + closing)) {
+ } else if (!str.empty() && str[str.length() - 1] == '\\' && can_parse(str + "\\\": 1" + closing)) {
// Was inside an object key string after an escape
str += (out.healing_marker.json_dump_marker = "\\" + magic_seed) + "\": 1" + closing;
} else {
@@ -253,7 +253,7 @@
if (can_parse(str + "\"")) {
// Was inside an string
str += (out.healing_marker.json_dump_marker = magic_seed) + "\"";
- } else if (str[str.length() - 1] == '\\' && can_parse(str + "\\\"")) {
+ } else if (!str.empty() && str[str.length() - 1] == '\\' && can_parse(str + "\\\"")) {
// Was inside an string after an escape
str += (out.healing_marker.json_dump_marker = "\\" + magic_seed) + "\"";
} else {
Besides, thanks for your PR. It's special because it works with complex json schema, with a quick hack like this: --- a/common/json-schema-to-grammar.cpp 2025-10-01 00:22:00.744098340 +0800
+++ b/common/json-schema-to-grammar.cpp 2025-10-01 00:19:48.692716944 +0800
@@ -944,6 +944,9 @@
return _add_rule(rule_name, out.str());
} else if (schema.empty() || schema_type == "object") {
return _add_rule(rule_name, _add_primitive("object", PRIMITIVE_RULES.at("object")));
+ } else if (schema_type.is_null() && schema.contains("not") && schema["not"].is_object() && schema["not"].empty()) {
+ // librechat returns not:{}, which does nothing.
+ return "";
} else {
if (!schema_type.is_string() || PRIMITIVE_RULES.find(schema_type.get<std::string>()) == PRIMITIVE_RULES.end()) {
_errors.push_back("Unrecognized schema: " + schema.dump());LibreChat passed scrambled schema including {"not":{}}, this patch will ignore that. |
|
@DKingAlpha Thanks for pointing that out! It seems my compiler adds some extra padding to the string object, which ends up masking the string array underflow crash. |
|
@DKingAlpha I took a deeper look at your patch and had a question. It seems to modify some sections I hadn’t touched in the original code |
No I am using clang-20, if that helps to reproduce. Either this function(try_consume_json) is designed to run on non-empty string, which means you need to change your code, or its a bug in that part and never triggered before. I prefer the latter one. |
@DKingAlpha Would it still crash if you only patched the sections that my PR actually changed? @@ -253,7 +253,7 @@
if (can_parse(str + "\"")) {
// Was inside an string
str += (out.healing_marker.json_dump_marker = magic_seed) + "\"";
- } else if (str[str.length() - 1] == '\\' && can_parse(str + "\\\"")) {
+ } else if (!str.empty() && str[str.length() - 1] == '\\' && can_parse(str + "\\\"")) {
// Was inside an string after an escape
str += (out.healing_marker.json_dump_marker = "\\" + magic_seed) + "\"";
} else { |
Line 253 is exactly the location that crashed on my side. But I patched all other I mean even without running into it, only by static manual reviewing, it shall be checked before access |
|
@DKingAlpha I believe llama.cpp/common/json-partial.cpp Line 140 in 4f15759
str is not empty. I’m considering changing my code from
if (!healing_marker.empty() && err_loc.stack.empty())to if (err_loc.position != 0 && !healing_marker.empty() && err_loc.stack.empty()). What do you think about this change? |
|
I don't have any detailed logs from the llama.cpp crash that opencode causes, but here's the tail of the container log on the latest crash today. this is with unsloths Q4 GLM-4.5-Air and the hksdpc255:master llama.cpp: Not every failed diff edit causes the server to crash, but when the server does crash, this is the most common error I see, |
|
@aaronnewsome You can run |
|
opencode edits are failing because its edit tool does not return any content and then the template replaces that empty result with I replaced with and so far I didn't notice any issues. |
|
Here's what I see with -lv 1 before the crash: |
|
@matbrez Thanks for the suggestion! I’ve applied your changes and they look good. |
|
@aaronnewsome Could you provide more logs so that they include three occurrences of |
|
This one has 3 occurrences, tail the last 420 lines before the crash |
|
@aaronnewsome Thanks! I see what’s going wrong. I’ll work on a fix. |
|
@aaronnewsome Try now. |
|
Trying to use this with the GLM 4.5 Air model yields 500s on tool calling with responses like this: In case it matters, I'm running this on a DGX spark. |
@odellus Why? I tested this on an older version of Zed before. |
|
@odellus @Xe These issues seems off-topic for this PR. You can ask such questions in GitHub Discussions instead. |
|
@odellus @Xe Further clarification on why it is off-topic: your log shows you’re not using the Jinja template from the OP. llama.cpp’s Jinja2 renderer doesn’t support The issue is already fixed with the updated template I provided in the OP. If you prefer to use your own Jinja template, either replace |
can verify it works with --jinja --chat-template-file given-glm45-template.j2 steps to reproduce [working]
build/bin/llama-server \
-c 120000 \
--port 1234 \
--api-key $API_KEY \
-fa 1 \
-ngl 99 \
-b 2048 \
-t 12 \
-hf unsloth/GLM-4.5-Air-GGUF:Q4_K_M \
\ # THIS IS REALLY IMPORTANT
--jinja --chat-template-file glm45_template.j2Quick test again the endpoint in zed shows tool calling and suppression of thinking tokens is working for this PR.
|
|
Maybe this belongs in discussion but I am able to get qwen3-coder-30B working on this PR with no modification to llama.cpp source |
|
Why is this being closed? I was actually looking forward to this to land :D |
|
This branch is no longer maintained. I’ve also reworked a more robust implementation here: #16932 — testing and feedback are welcome. |

UPDATE: Use my new PR #16932
This PR introduces an enhanced implementation of tool calling for GLM-4.5, building upon the existing contributions by @dhandhalyabhavik and @susmitds (see PR #15186).
Key improvements include:
Grammar-constrained tool-call outputs
The model’s tool-call messages are now rigorously enforced by a defined grammar, ensuring that generated calls are always well-formed and reliably parsed.
Streaming support for tool-call parsing
I have added streaming capabilities to the parser to handle tool-call messages as they’re generated. This enhancement enables more responsive and real-time interactions during inference.
Use this Jinja template while testing:
Although not yet implemented, I‘m planning the following improvements:
Patch jinja template in
common_chat_params_init_glm_4_5to make it compatible with the original Unsloth GGUF chat template, and potentially even with the official chat template.Add dedicated unit tests for grammar enforcement and streaming parsing.
Testing and feedback are welcome.
Suggested commit message after squash commits: