Skip to content

Conversation

@yiliu30
Copy link
Contributor

@yiliu30 yiliu30 commented Oct 23, 2025

Usage

source apply_ext.sh 
VLLM_ENABLE_AR_EXT=1 vllm serve ...

Initial support for out-of-tree AutoRound integration with vLLM, verified on Qwen3-15B-A2B-Base.

  • Added AutoRoundExtensionConfig
  • Monkey patching support
  • Added AutoRoundMoEMethod and AutoRoundMoEMethodMXFp4Impl
  • UT

There are three paths for MXFP4

  • VLLM_ENABLE_STATIC_MOE=1: static MOE + unpack the weight on the fly, very slow, for correctness check
  • VLLM_ENABLE_STATIC_MOE=1 + VLLM_MXFP4_PRE_UNPACK_WEIGHTS=1, static MOE + unpack the weight to FP8 before inference, still slow but acceptable
  • VLLM_AR_MXFP4_MODULAR_MOE=1 + VLLM_MXFP4_PRE_UNPACK_WEIGHTS=1, unpack the weight to BF16 + FP32 group gemm, fast but requires the same mem as BF16 models, WIP, support it in the next PR

Signed-off-by: yiliu30 <[email protected]>
@yiliu30 yiliu30 changed the title init moe support [1/N] Initial vllm-ext evaluation support (MXFP4 MOE) Oct 23, 2025
Signed-off-by: yiliu30 <[email protected]>
@yiliu30 yiliu30 requested a review from Copilot October 23, 2025 02:42
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces initial support for out-of-tree AutoRound integration with vLLM, specifically for MXFP4 MOE quantization. The implementation adds extension configuration, MOE quantization methods, and utilities for FP4/MXFP4 quantization and dequantization operations.

Key changes:

  • Added AutoRoundExtensionConfig to extend AutoRound's quantization support with MXFP4 MOE capabilities
  • Implemented AutoRoundMoEMethod and AutoRoundMoEMethodMXFp4Impl for handling MOE layers with MXFP4 quantization
  • Created utility modules for MXFP4 quantization/dequantization, FP4 conversions, and environment variable extensions

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
auto_round/experimental/vllm_ext/utils.py MXFP4 scale derivation and quantization utilities for E8M0 exponent handling
auto_round/experimental/vllm_ext/tests/test_mxfp4_moe.py Basic test validating AutoRound MXFP4 model inference
auto_round/experimental/vllm_ext/tests/conftest.py Test fixtures and runners copied from vLLM
auto_round/experimental/vllm_ext/sitecustomize.py Bootstrap script for enabling AutoRound extension via environment variable
auto_round/experimental/vllm_ext/quant_method_moe.py MOE quantization method dispatcher for AutoRound
auto_round/experimental/vllm_ext/mxfp4_qdq_utils.py MXFP4 quantization/dequantization implementation
auto_round/experimental/vllm_ext/moe_impl_mxfp4.py MXFP4 MOE layer implementation with weight processing
auto_round/experimental/vllm_ext/fp4_utils.py FP4 E2M1 format packing/unpacking utilities
auto_round/experimental/vllm_ext/envs_ext.py Extension environment variables for MXFP4 configuration
auto_round/experimental/vllm_ext/auto_round_ext.py AutoRound extension configuration class
auto_round/experimental/vllm_ext/init.py Extension application entry point

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
@yiliu30 yiliu30 marked this pull request as ready for review October 23, 2025 03:05
@yiliu30 yiliu30 mentioned this pull request Oct 23, 2025
5 tasks
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
@wenhuach21
Copy link
Contributor

I’m not very familiar with this part. @mengniwang95 @n1ck-guo, could you please review it again and approve the PR if everything looks good?

@yiliu30 yiliu30 merged commit e8bc353 into main Oct 30, 2025
21 checks passed
@yiliu30 yiliu30 deleted the vllm-ext branch October 30, 2025 07:22
chensuyue added a commit that referenced this pull request Nov 11, 2025
* Fix rtn tuning_device issue (#893)

Signed-off-by: Kaihui-intel <[email protected]>

* fix vlm gguf ut (#895)

Signed-off-by: n1ck-guo <[email protected]>

* update alg_ext.abi3.so with python compatible version (#894)

* move ste from quant to round for nvfp4 (#889)

Signed-off-by: He, Xin3 <[email protected]>

* Add GPT-OSS quant support (#887)

* better help printing information (#883)

* better help printing information

Signed-off-by: n1ck-guo <[email protected]>

* speedup quant and evaluation, fix recompile issue (#897)

* rewrite the implementation for ease-of-maintain

Signed-off-by: He, Xin3 <[email protected]>

* fix bug

Signed-off-by: He, Xin3 <[email protected]>

* fix quant performance

Signed-off-by: He, Xin3 <[email protected]>

* Update auto_round/compressors/base.py

---------

Signed-off-by: He, Xin3 <[email protected]>

* fix nvfp act quantization bug (#891)

* fix nvfp act quantization bug

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* add cuda ut for moe nvfp quantize

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* add cpu UT, refine cuda UT

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix ut typo

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix cpu ut

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* enhance experts amax match, refine UT

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Zhang, Weiwei1 <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* support automatic mixed bits assignment (#851)

* try to fix gguf issue (#886)

* remove numba from requirments (#905)

Signed-off-by: yiliu30 <[email protected]>

* Extend mxfp loading dtypes (#907)

* block dataset logger info (#908)

Signed-off-by: n1ck-guo <[email protected]>

* fix torch compile issue in AutoScheme (#909)

* Revert "Extend mxfp loading dtypes (#907)" (#915)

This reverts commit 0c2619c.

* support disable_opt_rtn in auto-scheme (#913)

* fix llama 4 ut (#896)

* fix ut of llama 4

Signed-off-by: n1ck-guo <[email protected]>

* add numba for cpu lib (#919)

Signed-off-by: yiliu30 <[email protected]>

* Loosen the packing restrictions for mxfp&nvfp (#911)

* Loosen the packing restrictions for mxfp&nvfp, enable Qwen1.5-MoE-A2.7B quantize

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix UT

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refine mxfp&nvfp layer checker

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* fix pylint

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Zhang, Weiwei1 <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Extend mxfp loading dtypes (#916)

Signed-off-by: root <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix act config exporting for mixed schemes (#903)

* fp8 exporting bugfix

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* fix act related config saving

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add ut for act_config check

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refine extra_config saving, add UTs

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* fix ut typo

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* fix ut typo

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* fixtypo

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix CI

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* fix scan issue

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* fix scan issue

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* rm global variable

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* rerun ut

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refine ut

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Zhang, Weiwei1 <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* optimize rtn for int woq (#924)

* fix bug of gguf and support for LiquidAI/LFM2-1.2B (#927)

Signed-off-by: n1ck-guo <[email protected]>

* remove numpy<2.0 limitation (#921)

* enable regex quantization config saving for mixed bits (#825)

* enable dynamic quantization config saving

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixtypo

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* rebase code, refine config saving

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refine ut

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* fix UT

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* enable hf loading for regex, add UTs

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refine export, enhance gptq UT

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Zhang, Weiwei1 <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix Flux tuning issue (#936)

Signed-off-by: Mengni Wang <[email protected]>

* gguf support for inclusionAI/Ling-flash-2.0 (#940)

* remove low_cpu_mem (#934)

* Add compatibility test (#918)

* Add commit hash to version (#941)

Signed-off-by: Sun, Xuehao <[email protected]>

* gguf weight type align with original, output.weight, token_embed (#900)

* support attention mask in user's dataset (#930)

* Add diffusion README (#923)

* update readme (#949)

* refactor utils file (#943)

* refact utils

Signed-off-by: n1ck-guo <[email protected]>

* update readme for sglang support (#953)

* update readme for sglang support

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* refine doc

Signed-off-by: Zhang, Weiwei1 <[email protected]>

* Update README.md

---------

Signed-off-by: Zhang, Weiwei1 <[email protected]>
Co-authored-by: Wenhua Cheng <[email protected]>

* update gguf and support for CompressedLinear (#950)

* Reduce AutoSchem VRAM usage by up to 10X (#944)

* add self attribution and fix avg_bits error (#956)

* add self attribution and fix avg_bits error
---------

Signed-off-by: He, Xin3 <[email protected]>
Co-authored-by: Wenhua Cheng <[email protected]>

* add logo (#960)

* refine AutoScheme readme/code (#958)

* update readme (#962)

* fix critic disable_opt_rtn regression (#963)

* [1/N] Initial vllm-ext evaluation support (MXFP4 MOE) (#935)

Signed-off-by: yiliu30 <[email protected]>

* fix bug of imatrix contains 0 (#955)

* fix rtn bug (#966)

* enhance flux doc (#967)

* clean code (#968)

* support for model scope  (#957)

* support for model scope

Signed-off-by: n1ck-guo <[email protected]>

* merge main branch to alg_ext (#970)

* fix cuda CI backend issue, fixtypo (#974)

* disable compile packing by default (#975)

Signed-off-by: yiliu30 <[email protected]>

* enhance auto device map and support XPU  (#961)

* enhance auto device map and support XPU
---------

Signed-off-by: He, Xin3 <[email protected]>

* refine readme (#978)

* cli support for positional arguments model (#979)

Signed-off-by: n1ck-guo <[email protected]>

* update bits (#986)

Signed-off-by: He, Xin3 <[email protected]>

* fix guff scheme and device_map bug (#969)

* add support for Magistral-Small (#980)

* support model_dtype and fix bug of scheme contains quotes, mllm eval (#985)

---------

Signed-off-by: Kaihui-intel <[email protected]>
Signed-off-by: n1ck-guo <[email protected]>
Signed-off-by: He, Xin3 <[email protected]>
Signed-off-by: Zhang, Weiwei1 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: root <[email protected]>
Signed-off-by: Mengni Wang <[email protected]>
Signed-off-by: Sun, Xuehao <[email protected]>
Co-authored-by: Tang Kaihui <[email protected]>
Co-authored-by: Heng Guo <[email protected]>
Co-authored-by: Xin He <[email protected]>
Co-authored-by: Yi Liu <[email protected]>
Co-authored-by: Weiwei <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Wenhua Cheng <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Wang, Mengni <[email protected]>
Co-authored-by: Sun, Xuehao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants