Skip to content

Eval bug: NVIDIA Nemotron Nano 9B v2 thinking tokens not properly handled in the llama-server web ui #15673

@Hoernchen

Description

@Hoernchen

Name and Version

llama-server.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
version: 6323 (c950ec62)
built with AMD clang version 20.0.0git (https://github.com/ROCm/llvm-project.git 1b5ca053c4ff3f9e729db16d11ca998bbd65d7e3+PATCHED:826b8a17847378a096dff258bf54fc237336f0e4) for x86_64-pc-windows-msvc

Operating systems

Windows

GGML backends

HIP

Hardware

gfx1100

Models

https://huggingface.co/bartowski/nvidia_NVIDIA-Nemotron-Nano-9B-v2-GGUF/tree/main : nvidia_NVIDIA-Nemotron-Nano-9B-v2-Q8_0.gguf

Problem description & steps to reproduce

Just like #11861 the tag is part of the prompt template, the response only contains which does not separate thinking from the output in the webgui.

Just use it with llama-server, see template output of llama.cpp\scripts\get_chat_template.py nvidia/NVIDIA-Nemotron-Nano-9B-v2 or the server messages.

First Bad Commit

No response

Relevant log output

llama-server.exe --jinja -b 4096 -fa -c 131072 -ngl 9999  --metrics -m llamacppmodels\nvidia_NVIDIA-Nemotron-Nano-9B-v2-Q8_0.gguf

(i gave --reasoning-format deepseek and manual template changes a try, wondering if I'm just holding it wrong)

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions