[Model] Support Qwen3 models with enable_thinking field #686

CharlieFRuan · 2025-05-04T22:58:37Z

Overview

This PR adds the following Qwen3 models to WebLLM's prebuilt models:
- Qwen3-0.6B: q0f16, q0f32, q4f16_1, q4f32_1
- Other Qwen3: {1.7B, 4B, 8B} x {q4f16_1, q4f32_1}
- For MLC-LLM and TVM commit head, see [WASM] Add all Qwen3 variants for WebLLM binary-mlc-llm-libs#148
In addition, we add extra_body field and extra_body.enable_thinking field to support switching between thinking and non-thinking mode. To prevent Qwen3 from thinking, use:

  let request = {
    messages: [
      {
        role: "user",
        content: "How many r's are there in the word strawberry?",
      },
    ],
    extra_body: {
      enable_thinking: false,
    },
  };

Besides, for multi-turn conversation, it is advised to exclude the previous thinking tokens (currently in WebLLM, the speed-wise performance of doing this may not be optimized as we cannot re-use KV cache)
To see the best practices of using Qwen3, refer to:
- examples/qwen3
- [Model] Add Qwen3 and allow switching between thinking and non-thinking mode web-llm-chat#75
- https://huggingface.co/Qwen/Qwen3-8B#best-practices
  - Besides, we can use "soft switch" /no_think and /think in the prompt
We also bumped web-tokenizer to 0.1.6.
- This resolves newly converted MLC models throwing rust-related error, fixes Model Request: Gemma 3 #675 (comment)
- For more, see [Web] Bump web-tokenizer to 0.1.6 tokenizers-cpp#67

Internal notes

Internally, the enable_thinking is achieved by:
- Add an extra_body and enable_thinking field to ChatCompletionRequest
- Add an enable_thinking field to GenerationConfig that forwards the value in engine.ts
- In llm_chat.ts, when prefillStep() and enable_thinking is false, we call conversation.appendEmptyThinkingReplyHeader(), instead of the normal appendReplyHeader()
- In conversation.ts, adjust getPromptArrayInternal() to support the reply header with an empty thinking block, using a field isLastMessageEmptyThinkingReplyHeader
- This is tested with tests/conversation.test.ts

Future work

Currently we hardcode const emptyThinkingBlockStr = "<think>\n\n</think>\n\n";. This should be configurable per-model in the future. Perhaps make it a part of the ConvConfig
Optimize multi-turn chat with Qwen3. Currently we strictly require all messages to match, but we can modify compareConversationObject() in engine.ts to allow missing several last messages (in this case, the message without the thinking tokens), so that in longer conversations, those that already stripped the thinking tokens can reuse KV
Perhaps we should separate the thinking tokens from the other tokens in the returned response, instead of asking users to parse on their own

Copilot

Pull Request Overview

This PR adds support for Qwen3 models by introducing a new enable_thinking field and related changes across the API protocols, conversation handling, configuration, tests, and examples.

New tests and constants for Qwen3 configuration are introduced.
The chat completion API and conversation methods now support an extra_body.enable_thinking flag.
Examples and documentation have been updated to demonstrate the new Qwen3 functionality.

Reviewed Changes

Copilot reviewed 12 out of 14 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
tests/conversation.test.ts	Added tests to verify Qwen3-specific behavior with empty thinking blocks.
tests/constants.ts	Introduced new Qwen3 config JSON with enable_thinking support, though conv_template name remains "qwen2".
src/openai_api_protocols/chat_completion.ts	Added extra_body field with enable_thinking flag.
src/llm_chat.ts	Updated message appending logic to conditionally disable thinking tokens.
src/engine.ts	Forwarded the enable_thinking flag from the extra_body field.
src/conversation.ts	Added methods for appending empty thinking headers and managing their lifecycle.
src/config.ts	Updated GenerationConfig and prebuiltAppConfig with Qwen3 models.
examples/simple-chat-ts/src/simple_chat.ts	Configured extra_body for Qwen3 models in the simple chat example.
examples/qwen3/src/qwen3_example.ts	Provided example usage of Qwen3 models with varying enable_thinking configurations.
examples/qwen3/src/qwen3_example.html	Updated HTML wrapper to load the new Qwen3 example.
examples/qwen3/README.md	Updated documentation with instructions for running Qwen3 demos.

Files not reviewed (2)

examples/qwen3/package.json: Language not supported
package.json: Language not supported

Comments suppressed due to low confidence (1)

tests/constants.ts:271

[nitpick] The conv_template name in the Qwen3 configuration is set to "qwen2", which may be confusing. Consider updating it to "qwen3" for consistency with the model type.

      "name": "qwen2",

### Change - The only change is #686, which - Add prebuilt models: - Qwen3-0.6B: `q0f16, q0f32, q4f16_1, q4f32_1` - Other Qwen3: `{1.7B, 4B, 8B} x {q4f16_1, q4f32_1}` - Support `extra_body: {enable_thinking: false}` for qwen3 models to toggle thinking - See `examples/qwen3` for more on Qwen3 usage - Also bumped `web-tokenizers` package to `0.1.6` to resolve rust-related issues ### TVMjs - No change, version `0.18.0-dev2` just like 0.2.71

CharlieFRuan · 2025-05-05T17:24:40Z

As a reference of using Qwen3, WebLLM Chat adds a thinking toggling button in the toolbar, allowing you to think or not think in the same multi-turn conversation

- This PR adds the following Qwen3 models to WebLLM's prebuilt models: - Qwen3-0.6B: `q0f16, q0f32, q4f16_1, q4f32_1` - Other Qwen3: `{1.7B, 4B, 8B} x {q4f16_1, q4f32_1}` - In addition, we add `extra_body` field and `extra_body.enable_thinking` field to support switching between thinking and non-thinking mode. - We also bumped web-tokenizer to 0.1.6, which resolves newly converted MLC models throwing rust-related error

### Change - The only change is mlc-ai#686, which - Add prebuilt models: - Qwen3-0.6B: `q0f16, q0f32, q4f16_1, q4f32_1` - Other Qwen3: `{1.7B, 4B, 8B} x {q4f16_1, q4f32_1}` - Support `extra_body: {enable_thinking: false}` for qwen3 models to toggle thinking - See `examples/qwen3` for more on Qwen3 usage - Also bumped `web-tokenizers` package to `0.1.6` to resolve rust-related issues ### TVMjs - No change, version `0.18.0-dev2` just like 0.2.71

[Model] Support Qwen3 models with enable_thinking field

c8cd770

CharlieFRuan requested a review from Copilot May 4, 2025 22:58

Copilot AI reviewed May 4, 2025

View reviewed changes

Fix lint

d622550

CharlieFRuan marked this pull request as ready for review May 5, 2025 03:07

CharlieFRuan merged commit 089bbd0 into mlc-ai:main May 5, 2025
1 check passed

CharlieFRuan mentioned this pull request May 5, 2025

[Version] Bump version to 0.2.79 #687

Merged

CharlieFRuan mentioned this pull request May 5, 2025

Qwen3 support #685

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Model] Support Qwen3 models with enable_thinking field #686

[Model] Support Qwen3 models with enable_thinking field #686

Uh oh!

CharlieFRuan commented May 4, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

CharlieFRuan commented May 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[Model] Support Qwen3 models with enable_thinking field #686

[Model] Support Qwen3 models with enable_thinking field #686

Uh oh!

Conversation

CharlieFRuan commented May 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Internal notes

Future work

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

CharlieFRuan commented May 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

CharlieFRuan commented May 4, 2025 •

edited

Loading