-
Notifications
You must be signed in to change notification settings - Fork 61
Add GPT-OSS quant support #887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds GPT-OSS quantization support to the auto_round library. The implementation includes a new MoE (Mixture of Experts) converter for GPT-OSS models along with comprehensive test coverage.
- Creates specialized handling for GPT-OSS models by converting fused expert operations to individual expert modules for quantization
- Refactors the MoE converter architecture to support multiple model types through a dispatch table
- Adds comprehensive test coverage for GPT-OSS quantization with MXFP4 and MXFP8 schemes
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| test/test_cpu/test_gpt_oss.py | New test file with fixtures and parametrized tests for GPT-OSS quantization |
| auto_round/utils.py | Added debug logging for non-quantized layers |
| auto_round/special_model_handler.py | Refactored MoE converter to use dispatch table and added GPT-OSS support |
| auto_round/modelling/llama4.py | Extracted Llama4 MoE converter to dedicated module |
| auto_round/modelling/gpt_oss.py | New GPT-OSS MoE converter with specialized expert handling |
| auto_round/modelling/init.py | New package initialization file |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| _update_parameter(mlp.gate_proj, "weight", original.experts.gate_up_proj[i, :, ::2].T) | ||
| _update_parameter(mlp.up_proj, "weight", original.experts.gate_up_proj[i, :, 1::2].T) |
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic numbers ::2 and 1::2 for tensor slicing should be replaced with named constants like GATE_STRIDE = 2 and GATE_OFFSET = 0, UP_OFFSET = 1 to improve code readability and maintainability.
| SPECIAL_SHARED_CACHE_KEYS["MiniMaxText01ForCausalLM"] = ("slope_rate",) | ||
|
|
||
| CONVERT_EXPERT_TO_LINEAR_MODELS = ["llama4"] | ||
| CONVERT_EXPERT_TO_LINEAR_MODELS = ["llama4", "gpt_oss"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better not to categorize it into too many detailed types. A single flag like model_need_to_convert, or a similar name, should be sufficient, since some models may require conversion even if they don’t have expert layers. We provide a converter function for each model if needed, regardless of which parts need to be converted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree, the replacement code could be organized better. Once we support more model replacements, we can refactor that part as needed. For now, how about leaving it as is, since we have some higher-priority tasks to focus on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think it will take much effort to change. You could also finish the higher-priority tasks first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened an issue to track #899
* Fix rtn tuning_device issue (#893) Signed-off-by: Kaihui-intel <[email protected]> * fix vlm gguf ut (#895) Signed-off-by: n1ck-guo <[email protected]> * update alg_ext.abi3.so with python compatible version (#894) * move ste from quant to round for nvfp4 (#889) Signed-off-by: He, Xin3 <[email protected]> * Add GPT-OSS quant support (#887) * better help printing information (#883) * better help printing information Signed-off-by: n1ck-guo <[email protected]> * speedup quant and evaluation, fix recompile issue (#897) * rewrite the implementation for ease-of-maintain Signed-off-by: He, Xin3 <[email protected]> * fix bug Signed-off-by: He, Xin3 <[email protected]> * fix quant performance Signed-off-by: He, Xin3 <[email protected]> * Update auto_round/compressors/base.py --------- Signed-off-by: He, Xin3 <[email protected]> * fix nvfp act quantization bug (#891) * fix nvfp act quantization bug Signed-off-by: Zhang, Weiwei1 <[email protected]> * add cuda ut for moe nvfp quantize Signed-off-by: Zhang, Weiwei1 <[email protected]> * add cpu UT, refine cuda UT Signed-off-by: Zhang, Weiwei1 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix ut typo Signed-off-by: Zhang, Weiwei1 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix cpu ut Signed-off-by: Zhang, Weiwei1 <[email protected]> * enhance experts amax match, refine UT Signed-off-by: Zhang, Weiwei1 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Zhang, Weiwei1 <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * support automatic mixed bits assignment (#851) * try to fix gguf issue (#886) * remove numba from requirments (#905) Signed-off-by: yiliu30 <[email protected]> * Extend mxfp loading dtypes (#907) * block dataset logger info (#908) Signed-off-by: n1ck-guo <[email protected]> * fix torch compile issue in AutoScheme (#909) * Revert "Extend mxfp loading dtypes (#907)" (#915) This reverts commit 0c2619c. * support disable_opt_rtn in auto-scheme (#913) * fix llama 4 ut (#896) * fix ut of llama 4 Signed-off-by: n1ck-guo <[email protected]> * add numba for cpu lib (#919) Signed-off-by: yiliu30 <[email protected]> * Loosen the packing restrictions for mxfp&nvfp (#911) * Loosen the packing restrictions for mxfp&nvfp, enable Qwen1.5-MoE-A2.7B quantize Signed-off-by: Zhang, Weiwei1 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix UT Signed-off-by: Zhang, Weiwei1 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refine mxfp&nvfp layer checker Signed-off-by: Zhang, Weiwei1 <[email protected]> * fix pylint Signed-off-by: Zhang, Weiwei1 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Zhang, Weiwei1 <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Extend mxfp loading dtypes (#916) Signed-off-by: root <[email protected]> Signed-off-by: yiliu30 <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Fix act config exporting for mixed schemes (#903) * fp8 exporting bugfix Signed-off-by: Zhang, Weiwei1 <[email protected]> * fix act related config saving Signed-off-by: Zhang, Weiwei1 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add ut for act_config check Signed-off-by: Zhang, Weiwei1 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refine extra_config saving, add UTs Signed-off-by: Zhang, Weiwei1 <[email protected]> * fix ut typo Signed-off-by: Zhang, Weiwei1 <[email protected]> * fix ut typo Signed-off-by: Zhang, Weiwei1 <[email protected]> * fixtypo Signed-off-by: Zhang, Weiwei1 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix CI Signed-off-by: Zhang, Weiwei1 <[email protected]> * fix scan issue Signed-off-by: Zhang, Weiwei1 <[email protected]> * fix scan issue Signed-off-by: Zhang, Weiwei1 <[email protected]> * rm global variable Signed-off-by: Zhang, Weiwei1 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * rerun ut Signed-off-by: Zhang, Weiwei1 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refine ut Signed-off-by: Zhang, Weiwei1 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Zhang, Weiwei1 <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * optimize rtn for int woq (#924) * fix bug of gguf and support for LiquidAI/LFM2-1.2B (#927) Signed-off-by: n1ck-guo <[email protected]> * remove numpy<2.0 limitation (#921) * enable regex quantization config saving for mixed bits (#825) * enable dynamic quantization config saving Signed-off-by: Zhang, Weiwei1 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixtypo Signed-off-by: Zhang, Weiwei1 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * rebase code, refine config saving Signed-off-by: Zhang, Weiwei1 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refine ut Signed-off-by: Zhang, Weiwei1 <[email protected]> * fix UT Signed-off-by: Zhang, Weiwei1 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * enable hf loading for regex, add UTs Signed-off-by: Zhang, Weiwei1 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refine export, enhance gptq UT Signed-off-by: Zhang, Weiwei1 <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Zhang, Weiwei1 <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Fix Flux tuning issue (#936) Signed-off-by: Mengni Wang <[email protected]> * gguf support for inclusionAI/Ling-flash-2.0 (#940) * remove low_cpu_mem (#934) * Add compatibility test (#918) * Add commit hash to version (#941) Signed-off-by: Sun, Xuehao <[email protected]> * gguf weight type align with original, output.weight, token_embed (#900) * support attention mask in user's dataset (#930) * Add diffusion README (#923) * update readme (#949) * refactor utils file (#943) * refact utils Signed-off-by: n1ck-guo <[email protected]> * update readme for sglang support (#953) * update readme for sglang support Signed-off-by: Zhang, Weiwei1 <[email protected]> * refine doc Signed-off-by: Zhang, Weiwei1 <[email protected]> * Update README.md --------- Signed-off-by: Zhang, Weiwei1 <[email protected]> Co-authored-by: Wenhua Cheng <[email protected]> * update gguf and support for CompressedLinear (#950) * Reduce AutoSchem VRAM usage by up to 10X (#944) * add self attribution and fix avg_bits error (#956) * add self attribution and fix avg_bits error --------- Signed-off-by: He, Xin3 <[email protected]> Co-authored-by: Wenhua Cheng <[email protected]> * add logo (#960) * refine AutoScheme readme/code (#958) * update readme (#962) * fix critic disable_opt_rtn regression (#963) * [1/N] Initial vllm-ext evaluation support (MXFP4 MOE) (#935) Signed-off-by: yiliu30 <[email protected]> * fix bug of imatrix contains 0 (#955) * fix rtn bug (#966) * enhance flux doc (#967) * clean code (#968) * support for model scope (#957) * support for model scope Signed-off-by: n1ck-guo <[email protected]> * merge main branch to alg_ext (#970) * fix cuda CI backend issue, fixtypo (#974) * disable compile packing by default (#975) Signed-off-by: yiliu30 <[email protected]> * enhance auto device map and support XPU (#961) * enhance auto device map and support XPU --------- Signed-off-by: He, Xin3 <[email protected]> * refine readme (#978) * cli support for positional arguments model (#979) Signed-off-by: n1ck-guo <[email protected]> * update bits (#986) Signed-off-by: He, Xin3 <[email protected]> * fix guff scheme and device_map bug (#969) * add support for Magistral-Small (#980) * support model_dtype and fix bug of scheme contains quotes, mllm eval (#985) --------- Signed-off-by: Kaihui-intel <[email protected]> Signed-off-by: n1ck-guo <[email protected]> Signed-off-by: He, Xin3 <[email protected]> Signed-off-by: Zhang, Weiwei1 <[email protected]> Signed-off-by: yiliu30 <[email protected]> Signed-off-by: root <[email protected]> Signed-off-by: Mengni Wang <[email protected]> Signed-off-by: Sun, Xuehao <[email protected]> Co-authored-by: Tang Kaihui <[email protected]> Co-authored-by: Heng Guo <[email protected]> Co-authored-by: Xin He <[email protected]> Co-authored-by: Yi Liu <[email protected]> Co-authored-by: Weiwei <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Wenhua Cheng <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Wang, Mengni <[email protected]> Co-authored-by: Sun, Xuehao <[email protected]>
.