Skip to content

Conversation

aldehir
Copy link
Collaborator

@aldehir aldehir commented Aug 22, 2025

Add response_format support to gpt-oss models.

The generic grammar implementation is not great for gpt-oss,

curl example
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "response_format": {
      "type": "json_object",
      "schema": {
        "$schema": "http://json-schema.org/draft-07/schema#",
        "title": "City Details",
        "description": "A simple object containing key details about a city.",
        "type": "object",
        "properties": {
          "country": {
            "description": "The country where the city is located.",
            "type": "string"
          },
          "landmarks": {
            "description": "A list of notable landmarks in the city.",
            "type": "array",
            "items": {
              "type": "string"
            }
          }
        },
        "required": ["country","landmarks"]
      }
    },
    "messages": [
      {
        "role": "system",
        "content": [
          {
            "type": "text",
            "text": "You are a helpful assistant designed to output JSON. For a given city, provide its country, and a list of three notable landmarks."
          },
          {
            "type": "text",
            "text": "# Response Formats\n## city\n### {\"type\": \"object\", \"properties\": { \"country\": { \"description\": \"The country where the city is located.\", \"type\": \"string\" }, \"landmarks\": { \"description\": \"A list of notable landmarks in the city.\", \"type\": \"array\", \"items\": { \"type\": \"string\" } } }, \"required\": [\"country\",\"landmarks\"] }"
          }
        ]
      },
      {
        "role": "user",
        "content": "Zürich"
      }
    ]
  }'
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "{\n\n  \"country\": \"Switzerland\"\n\n  , \"landmarks\":[\n\n\"[\"]\n\n  }"
      }
    }
  ],
  ...

Note the weirdness around landmarks.

This PR wraps the response_format schema in a harmony-aware grammar so the model can answer properly,

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "The user gave the city \"Zürich\". We need to output JSON in the defined schema. The schema says: object with properties: \"country\" (string) and \"landmarks\" (array of strings). It's required at least those two. We must supply. Provide country: Switzerland. Landmarks: choose 3 notable landmarks: \"Bahnhofstrasse\", \"Lake Zürich (Limmat, scenic)\", \"Zürcher Mozartplatz and its cathedral\"? Let's find known landmarks: \"Château Fraiture\"? Wait landmarks: \"Lake Zürich\", \"Bahnhofstrasse\", \"Old Town\" (Altstadt), \"Kunsthaus Zürich\". Choose 3: \"Bahnhofstrasse\", \"Lake Zürich\", \"Kunsthaus Zürich\". Compose JSON. Ensure it's valid according to schema. Should be:\n\n{\n \"country\": \"Switzerland\",\n \"landmarks\": [\n   \"Bahnhofstrasse\",\n   \"Lake Zürich\",\n   \"Kunsthaus Zürich\"\n ]\n}\n\nMake sure no extra keys. Provide only JSON.",
        "content": "{\"country\":\"Switzerland\",\"landmarks\":[\"Bahnhofstrasse\",\"Lake Zürich\",\"Kunsthaus Zürich\"]}"
      }
    }
  ],
  ...

fixes #15276

@aldehir aldehir merged commit 32732f2 into ggml-org:master Aug 22, 2025
48 checks passed
@samshipengs
Copy link

@aldehir thanks for the change, and it seems like now my chat completion request with response_format is working with the llama.cpp backend.

one question, would the grammar rule also affect the reasoning token generation of gpt-oss? i.e. forcing the reasoning tokens to be generated in the json schema format, which certainly would impact the performance.

@aldehir
Copy link
Collaborator Author

aldehir commented Aug 24, 2025

@samshipengs with reasoning-format == auto, any reasoning done by the the model will present itself in the reasoning_content field. The content field should not contain reasoning traces. Does that answer your question?

@samshipengs
Copy link

samshipengs commented Aug 25, 2025

@aldehir i haven't looked at the reasoning-format == auto, im currently only setting the resoning level e.g. "low", will take a look. I'm relatively new to using llama and even structured output, so im not sure what i was asking makes sense, basically i was concerned that, a grammar (grammar based constraint sampling?) is not only active in final output generation but every reasoning step as well, where we don't wanna constrain reasoning token, if you are saying the reasoning tokens are still generated as they are (harmony format?), and only the output gets constrainted by the grammar sampling then that sounds good, then it perhaps really is the case that gpt-oss20b is not performing well on my task (benchmarking against existing model being used e.g. gpt4.1-mini)

@aldehir
Copy link
Collaborator Author

aldehir commented Aug 25, 2025

@samshipengs Ah, ok. The grammar for gpt-oss when using response_format does not constrain reasoning. It gives the model the flexibility to reason and only constrains the final message. You can verify by seeing if reasoning_content exists and is populated in the response.

If you're finding reasoning traces in your structured output, I would verify you are passing in --jinja. Otherwise it may be as you say, the model does not perform well for your task.

qnixsynapse pushed a commit to menloresearch/llama.cpp that referenced this pull request Aug 25, 2025
@samshipengs
Copy link

@aldehir I was using --jinja, i now turned it off, my task is simply classification given a large text body.

I noticed that if i don't use structured_output i.e. response_format not passing in, it seems to give me more sensible answer (im looking at the final channel of the harmony format response) comapred to the parsed from passing in a pydantic model in response_format.

Is the grammar based constraint decoding in llama cpp done by GBNF? Do we know if openai (for their commercial models) uses the same constraint decoding technique?

@aldehir
Copy link
Collaborator Author

aldehir commented Aug 25, 2025

@samshipengs the grammar is defined in gbnf, but I don't know the specifics about the constrained decoding implementation.

If you can provide an example of such a task, I can look further into it.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: GPT-OSS response_format not respected if jinja template enabled

4 participants