-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Description
Name and Version
llama-server.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
version: 6323 (c950ec62)
built with AMD clang version 20.0.0git (https://github.com/ROCm/llvm-project.git 1b5ca053c4ff3f9e729db16d11ca998bbd65d7e3+PATCHED:826b8a17847378a096dff258bf54fc237336f0e4) for x86_64-pc-windows-msvc
Operating systems
Windows
GGML backends
HIP
Hardware
gfx1100
Models
https://huggingface.co/bartowski/nvidia_NVIDIA-Nemotron-Nano-9B-v2-GGUF/tree/main : nvidia_NVIDIA-Nemotron-Nano-9B-v2-Q8_0.gguf
Problem description & steps to reproduce
Just like #11861 the tag is part of the prompt template, the response only contains which does not separate thinking from the output in the webgui.
Just use it with llama-server, see template output of llama.cpp\scripts\get_chat_template.py nvidia/NVIDIA-Nemotron-Nano-9B-v2 or the server messages.
First Bad Commit
No response
Relevant log output
llama-server.exe --jinja -b 4096 -fa -c 131072 -ngl 9999 --metrics -m llamacppmodels\nvidia_NVIDIA-Nemotron-Nano-9B-v2-Q8_0.gguf
(i gave --reasoning-format deepseek and manual template changes a try, wondering if I'm just holding it wrong)