-
Notifications
You must be signed in to change notification settings - Fork 13.6k
feat(server): Add tool call support to WebUI (LLama Server) #13501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
You may want to avoid enlarging the scope any further in this PR, it's already quite large. Further changes should be in a further PR. |
Sure! I'm done with the changes I wanted. I can perhaps move js repl tool into a separate PR, if needed. |
|
I have to do some more testing now since supporting streaming tool calls. |
|
I did the final fixes, also updated the demo conversation to include a chained message with tool call. This should now be ready to review. |
|
I'll work on the PR this week. Will need to move your commits to a new PR otherwise I can't push to your master branch |
|
Thank you! Can I help you somehow? I can also rename the branch and open a new PR if that's the problem. |
|
@ngxson do you need any help I can provide to move this further? |
|
@ngxson will need to tell :). If nothibg else, probably polishing out the conflicts 😝 |
|
I didn't merge this PR because the stability of the frontend was a much more important concern. Now that we moved to the new Sveltekit-based UI, tool-calling was something I already brought up in one of my recent discussion with @allozaur . We will take our time to plan this, as there is already a long backlog of other features need to be implemented. |
|
|
That's an absolutely amazing piece of work ; really impressive engineering. But I have to point out one concern: doing tool-calling execution entirely on the client side can lead us straight back into the same kind of fragmentation we saw during the early 'thinking/CoT' phase. Each model ends up needing its own quirks handled in the frontend, which means model-specific bugs, technical debt, and constant maintenance inside the Svelte WebUI. That's not something we usually want to encourage in a project like llama.cpp, where stability and generic design are key. It's a brilliant idea, but not a generic one ; the browser sandbox, CORS restrictions, and the fact that the user must keep the tab open all make it fragile. A better long-term approach would be to build a clean parser or handler that lives independently of the UI, so the frontend just displays results, while the tool-execution logic stays modular and testable. For example, I've implemented a Node.js proxy that sits transparently in the SSE stream. It detects tool calls and forwards them to an HTTP hook, so any model can use tools without changing the frontend. Behind that, you can sandbox whatever you want ; even control a headless browser to let the LLM 'browse' the web safely. That said, my Node.js proxy is yet another intermediate layer and custom language component, and ideally this logic should live in the backend itself, as a configurable and generic tool-calling interface ; something to be designed together with the core developers, according to their vision for the project. That would encourage a cleaner modularization of the parser, where there's still a lot of important work to be done, and help push the community to contribute everything in one place ; improving both llama.cpp and its parser. That kind of separation keeps llama.cpp safe and generic, while still allowing all the experimentation you want on the tooling side. And honestly, every geek or nerd who runs llama.cpp locally will prefer to have server-side hooks anyway : to control their smart home, experiment with 'Jarvis'-like automations, or integrate private APIs. In my case, I run my setup at work through a shared Svelte UI at serveurperso.com/ia (and a STT/TTS telegram bot!) TL;DR :Client-side tool-calling is clever but risky : it leads to fragmentation, browser limits, and model-specific maintenance. A backend-based, generic, and configurable tool-calling interface should be developed collaboratively by the community and core devs, centralizing contributions to improve both llama.cpp and its parser. This keeps the project clean, secure, and extensible while allowing server-side innovation for local power users. |
|
Thanks for explaining! |
|
@ServeurpersoCom I'm not sure implementing tool calling in the back end is a right call. If we consider a typical use case where the same llama.cpp server is used by many people, 1) these people might have their own tools they want to call not available on the backend, and 2) from a security viewpoint, things like code execution are difficult to implement safely on the back end side. Neither of these issues surface if the tool calling is on the browser side. It seems to me that back end tool calling would be mostly relevant for people self-hosting a model for themselves and not in production environments with multiple users. Should someone require back end tool calling anyway, a proxy such as the one you have implemented is a better choice, as in production systems the inference engine is very often hosted on a different node than the tools (e.g. the inference engine is on a GPU node and the tools are on a CPU-only node). If the server implements the OpenAI API correctly including its tool calling format, the proxy can be made model-independently. For those who use llama.cpp locally for one person only, back end tool calling would work better, but there are already many OSS projects catering for those people, so, in my opinion, they are not a priority. |
Obviously, that’s already how it works today : tool calls are exposed through the API, and each client is free to handle them however they want. The real issue is that depending on the model, the tool-call payload format can change. There’s an actual inventory to do here, and before anything else, the server needs a proper refactor (see the open issue about splitting core/http : server.cpp has become too large). That kind of work is a higher priority, than implementing tool support directly in the Svelte WebUI would surface tons of inconsistent bugs depending on the model output format. In practice, this could easily be achieved by providing a small standalone binary, similar to my experimental proxy. This approach scales from local experimentation to industrial-grade setups: you could attach isolated execution environments to each model instance, just like ChatGPT internally routes tool calls to separate subsystems. For enthusiasts or self-hosters, the same binary could also act as a personal automation bridge, letting anyone build their own “Jarvis”-style setup and trigger smart-home commands directly through the LLM. It’s far more powerful (and fun) than being confined to a single browser tab for the same amount of work. Real Firefox-ESR browser (non-headless) running in a remote, disposable, read-only environment; no instruction prompt, only toolcall documentation (tool name and parameters) : |
|
Thanks to this PR, it actually gave me this idea: What we can safely do, though, is add an option in the WebUI a simple checkbox in Settings that, like the reasoning_content working (thinking blocks), enables the display of OpenAI-Compatible toolcall chunks (inside a similar type of block). This would remain consistent with the OpenAI-Compat API logic, be very useful for developers building larger systems on top of llama.cpp as both a built-in debugger and a reference example, and follow the core principle that “the client should be able to see every chunk any model is capable of emitting.” It would have no runtime impact, introduce no security or parsing risks over time, and be highly educational. |




Make sure to read the contributing guidelines before submitting a PR
Hi there!
I added tool calling support for the frontend interface to llama-server.
There are still some things to fix, but I'm opening this early to get some feedback.Currently there's only 1 tool added (basic javascript interpreter with sandboxed iframe code eval), but the code structure supports expanding in the future.