Add MXFP8 MOE/Linear and MXFP4 Linear #1034

yiliu30 · 2025-11-14T06:14:47Z

https://github.com/yiliu30/vllm-fork/tree/fused-moe-ar is required
Please refer to the README.md for more details

Signed-off-by: yiliu30 <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: yiliu30 <[email protected]>

for more information, see https://pre-commit.ci

Copilot

Pull Request Overview

This PR adds support for MXFP8 and MXFP4 quantization schemes for both MoE (Mixture of Experts) and Linear layers in the vLLM extension, addressing issue #937.

Key changes include:

Introduction of new linear layer quantization methods with MXFP4/MXFP8 implementations
Addition of utility functions for scheme detection and quantization/dequantization operations
Extension of MoE methods to support MXFP8 quantization
New environment variables for controlling MXFP4 unpacking behavior

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`utils.py`	Added utility functions for quantization scheme detection and validation
`torchao_patch.py`	New file implementing MX format quantization with scale calculation modes
`quant_method_linear.py`	New quantization method dispatcher for linear layers supporting MXFP4/MXFP8
`quant_impl.py`	Abstract base class defining the quantization implementation interface
`mxfp8_qdq_utils.py`	MXFP8 quantization/dequantization utilities
`mxfp4_qdq_utils.py`	Added scale reshaping for MXFP4 weight dequantization
`linear_impl_mxfp8.py`	MXFP8 linear layer implementation with weight/activation quantization
`linear_impl_mxfp4.py`	MXFP4 linear layer implementation with optional weight unpacking
`quant_method_moe.py`	Refactored to use shared utilities and added MXFP8 MoE support
`moe_impl_mxfp4.py`	Added MXFP8 unpacking path and GPT-OSS post-processing support
`fp4_utils.py`	Fixed incorrect assertion logic for packing validation
`envs_ext.py`	Added new environment variables for MXFP4/MXFP8 control
`auto_round_ext.py`	Updated to use new linear quantization method dispatcher
`tests/test_models.py`	Updated test model path to MXFP8 variant
`README.md`	New documentation for building and using vLLM extension

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-17T07:08:34Z

auto_round_extension/vllm_ext/torchao_patch.py

+    Enum representing the different methods for calculating MX block scaling.
+    There are three methods available:
+    FLOOR: This method is recommended by the OCP MX Spec 1.0 and uses X = 2^floor(log2(max_abs(v))-max_exp).
+           It result in overflow issues for large values and bad for gradient quantization.


Corrected grammar: 'It result in' should be 'It results in'.

Suggested change

It result in overflow issues for large values and bad for gradient quantization.

It results in overflow issues for large values and is bad for gradient quantization.

Copilot · 2025-11-17T07:08:34Z

auto_round_extension/vllm_ext/torchao_patch.py

+    RCEIL: The method is to apply ceil to the ratio of max_abs(v) and max_pos.
+           This method's detail is described in https://docs.nvidia.com/cuda/cublas/index.html#d-block-quantization
+           Section "Computing scaling and conversion factors for FP8 with UE8M0 scales"


[nitpick] The description for RCEIL is incomplete compared to other methods. Consider adding more context about when this method should be used and its trade-offs, similar to the documentation for FLOOR, CEIL, and EVEN.

Suggested change

RCEIL: The method is to apply ceil to the ratio of max_abs(v) and max_pos.

This method's detail is described in https://docs.nvidia.com/cuda/cublas/index.html#d-block-quantization

Section "Computing scaling and conversion factors for FP8 with UE8M0 scales"

RCEIL: This method applies ceil to the ratio of max_abs(v) and max_pos, i.e., X = ceil(max_abs(v) / max_pos).

It is recommended for hardware implementations such as NVIDIA's cublas FP8 block quantization, as described in

https://docs.nvidia.com/cuda/cublas/index.html#d-block-quantization (see section "Computing scaling and conversion factors for FP8 with UE8M0 scales").

RCEIL avoids overflow issues and provides a hardware-friendly scaling factor, but may result in less precise scaling for values near the maximum.

Use RCEIL when targeting hardware that expects this scaling method, or when compatibility with cublas FP8 quantization is required. Compared to FLOOR and CEIL,

RCEIL may offer better robustness for large values, but could sacrifice some accuracy for values close to max_pos.

Copilot · 2025-11-17T07:08:34Z

auto_round_extension/vllm_ext/mxfp4_qdq_utils.py

 def mxfp4_fp8_weight_to_bf16(weight_fp8, scale_bf16):
+


The scale reshaping was added here but the function signature and documentation don't indicate this behavior change. Consider documenting why this reshape is necessary for the function to work correctly.

Suggested change

def mxfp4_fp8_weight_to_bf16(weight_fp8, scale_bf16):

def mxfp4_fp8_weight_to_bf16(weight_fp8, scale_bf16):

"""

Dequantizes FP8 weights to BF16 using the provided scale.

Args:

weight_fp8 (torch.Tensor): The quantized weights tensor. Expected to be divisible into blocks of 32 elements.

scale_bf16 (torch.Tensor): The scale tensor. Will be reshaped to (-1, 1) to match the block structure of the weights.

This reshape is necessary to broadcast the scale across each block of 32 weights.

Returns:

torch.Tensor: The dequantized weights in BF16 format, with the same shape as the input weights.

"""

Copilot · 2025-11-17T07:08:35Z

auto_round_extension/vllm_ext/moe_impl_mxfp4.py

+            )
+
    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
+        logger.debug(f"Processing weights after loading for layer: {layer._prefix}")


Accessing the private attribute '_prefix' (indicated by the leading underscore) is not recommended. Consider either making this a public attribute or accessing it through a public method.

Suggested change

logger.debug(f"Processing weights after loading for layer: {layer._prefix}")

layer_prefix = getattr(layer, "prefix", getattr(layer, "_prefix", None))

logger.debug(f"Processing weights after loading for layer: {layer_prefix}")

Copilot · 2025-11-17T07:08:35Z

auto_round_extension/vllm_ext/moe_impl_mxfp4.py

+            if self.has_bias:
+                if envs.VLLM_AR_POST_PROCESS_GPTOSS:
+                    w13_bias_swapped = revert_interleaved_bias(layer.w13_bias)
+                    layer.w13_bias.data.copy_(w13_bias_swapped)


[nitpick] The nested if statements can be simplified by combining the conditions. Consider using if self.has_bias and envs.VLLM_AR_POST_PROCESS_GPTOSS: for better readability.

Suggested change

if self.has_bias:

if envs.VLLM_AR_POST_PROCESS_GPTOSS:

w13_bias_swapped = revert_interleaved_bias(layer.w13_bias)

layer.w13_bias.data.copy_(w13_bias_swapped)

if self.has_bias and envs.VLLM_AR_POST_PROCESS_GPTOSS:

w13_bias_swapped = revert_interleaved_bias(layer.w13_bias)

layer.w13_bias.data.copy_(w13_bias_swapped)

Copilot · 2025-11-17T07:08:35Z

auto_round_extension/vllm_ext/linear_impl_mxfp8.py

+        # dequant weight
+        weight = layer.weight
+        weight_scale = layer.weight_scale
+        dequnat_weight = dequant_mx_fp8(


Corrected spelling of 'dequnat_weight' to 'dequant_weight'.

Copilot · 2025-11-17T07:08:36Z

auto_round_extension/vllm_ext/quant_method_linear.py

+    @classmethod
+    def get_min_capability(cls) -> int:
+        return cls.impl.get_min_capability()


This classmethod is attempting to access cls.impl, but impl is an instance attribute set in __init__, not a class attribute. This will raise an AttributeError when called. Consider making this an instance method or restructuring the implementation.

Suggested change

@classmethod

def get_min_capability(cls) -> int:

return cls.impl.get_min_capability()

def get_min_capability(self) -> int:

return self.impl.get_min_capability()

yiliu30 added 30 commits November 5, 2025 18:21

add new examples

ac5b6d5

Signed-off-by: yiliu30 <[email protected]>

fix mxfp4 moe for qwen

2a2e834

Signed-off-by: yiliu30 <[email protected]>

add mxfp8

f0f0e1d

Signed-off-by: yiliu30 <[email protected]>

rename test

ed23ef7

Signed-off-by: yiliu30 <[email protected]>

fix

edba1ee

Signed-off-by: yiliu30 <[email protected]>

clean

d3d13b8

Signed-off-by: yiliu30 <[email protected]>

fix linear

dd909e4

Signed-off-by: yiliu30 <[email protected]>

fix gate_up proj match

d2ed6a7

Signed-off-by: yiliu30 <[email protected]>

add mxfp4

790a720

Signed-off-by: yiliu30 <[email protected]>

add recipes

2ad5558

Signed-off-by: yiliu30 <[email protected]>

fix qwen mxfp4

553529a

Signed-off-by: yiliu30 <[email protected]>

add mxfp4 moe

284c41e

Signed-off-by: yiliu30 <[email protected]>

fix skip layers

936ec4e

Signed-off-by: yiliu30 <[email protected]>

update example

218f564

Signed-off-by: yiliu30 <[email protected]>

clean code

347d680

Signed-off-by: yiliu30 <[email protected]>

add mxfp4-mxfp8-moe

7eb9974

Signed-off-by: yiliu30 <[email protected]>

fix moe mxfp8

493f2df

Signed-off-by: yiliu30 <[email protected]>

fix

fbc04ae

Signed-off-by: yiliu30 <[email protected]>

fix

bb4d90c

Signed-off-by: yiliu30 <[email protected]>

add readme

edd3e9e

Signed-off-by: yiliu30 <[email protected]>

fix

5f799b8

Signed-off-by: yiliu30 <[email protected]>

update

84f3dbe

Signed-off-by: yiliu30 <[email protected]>

update

c9dbac0

Signed-off-by: yiliu30 <[email protected]>

add gene

3d21a74

Signed-off-by: yiliu30 <[email protected]>

update

01665c9

Signed-off-by: yiliu30 <[email protected]>

update

ebe9d79

Signed-off-by: yiliu30 <[email protected]>

update

7b986e3

Signed-off-by: yiliu30 <[email protected]>

fix

efd3b1d

Signed-off-by: yiliu30 <[email protected]>

format

e5044b4

Signed-off-by: yiliu30 <[email protected]>

update

68424c5

Signed-off-by: yiliu30 <[email protected]>

yiliu30 and others added 12 commits November 12, 2025 03:09

update example

55a4e52

Signed-off-by: yiliu30 <[email protected]>

correct mxfp8 usage

59ff18d

Signed-off-by: yiliu30 <[email protected]>

update example

b8961e1

Signed-off-by: yiliu30 <[email protected]>

clean

3d89bb3

Signed-off-by: yiliu30 <[email protected]>

add eval cmd

6acd7ea

Signed-off-by: yiliu30 <[email protected]>

remove examples

6b720f0

Signed-off-by: yiliu30 <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

f880a1f

for more information, see https://pre-commit.ci

fix mxfp4

ed856fc

Signed-off-by: yiliu30 <[email protected]>

add readme

70ce2d0

Signed-off-by: yiliu30 <[email protected]>

fix

919e954

Signed-off-by: yiliu30 <[email protected]>

fix

1cec73a

Signed-off-by: yiliu30 <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

3af54cc

for more information, see https://pre-commit.ci

yiliu30 marked this pull request as ready for review November 17, 2025 07:06

Merge branch 'main' into more-ar-ext

121211a

yiliu30 requested review from Copilot and mengniwang95 November 17, 2025 07:06

Copilot AI reviewed Nov 17, 2025

View reviewed changes

yiliu30 requested a review from n1ck-guo November 17, 2025 07:10

Merge branch 'main' into more-ar-ext

d9f1d79

yiliu30 mentioned this pull request Nov 18, 2025

Add DS/QWEN Examples intel/neural-compressor#2333

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add MXFP8 MOE/Linear and MXFP4 Linear #1034

Add MXFP8 MOE/Linear and MXFP4 Linear #1034

Uh oh!

yiliu30 commented Nov 14, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 17, 2025

Uh oh!

Copilot AI Nov 17, 2025

Uh oh!

Copilot AI Nov 17, 2025

Uh oh!

Copilot AI Nov 17, 2025

Uh oh!

Copilot AI Nov 17, 2025

Uh oh!

Copilot AI Nov 17, 2025

Uh oh!

Copilot AI Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	It result in overflow issues for large values and bad for gradient quantization.
	It results in overflow issues for large values and is bad for gradient quantization.

-    RCEIL: The method is to apply ceil to the ratio of max_abs(v) and max_pos.
-           This method's detail is described in https://docs.nvidia.com/cuda/cublas/index.html#d-block-quantization
-           Section "Computing scaling and conversion factors for FP8 with UE8M0 scales"
+    RCEIL: This method applies ceil to the ratio of max_abs(v) and max_pos, i.e., X = ceil(max_abs(v) / max_pos).
+           It is recommended for hardware implementations such as NVIDIA's cublas FP8 block quantization, as described in
+           https://docs.nvidia.com/cuda/cublas/index.html#d-block-quantization (see section "Computing scaling and conversion factors for FP8 with UE8M0 scales").
+           RCEIL avoids overflow issues and provides a hardware-friendly scaling factor, but may result in less precise scaling for values near the maximum.
+           Use RCEIL when targeting hardware that expects this scaling method, or when compatibility with cublas FP8 quantization is required. Compared to FLOOR and CEIL,
+           RCEIL may offer better robustness for large values, but could sacrifice some accuracy for values close to max_pos.

-def mxfp4_fp8_weight_to_bf16(weight_fp8, scale_bf16):
+def mxfp4_fp8_weight_to_bf16(weight_fp8, scale_bf16):
+    """
+    Dequantizes FP8 weights to BF16 using the provided scale.
+    Args:
+        weight_fp8 (torch.Tensor): The quantized weights tensor. Expected to be divisible into blocks of 32 elements.
+        scale_bf16 (torch.Tensor): The scale tensor. Will be reshaped to (-1, 1) to match the block structure of the weights.
+            This reshape is necessary to broadcast the scale across each block of 32 weights.
+    Returns:
+        torch.Tensor: The dequantized weights in BF16 format, with the same shape as the input weights.
+    """

	logger.debug(f"Processing weights after loading for layer: {layer._prefix}")
	layer_prefix = getattr(layer, "prefix", getattr(layer, "_prefix", None))
	logger.debug(f"Processing weights after loading for layer: {layer_prefix}")

Add MXFP8 MOE/Linear and MXFP4 Linear #1034

Are you sure you want to change the base?

Add MXFP8 MOE/Linear and MXFP4 Linear #1034

Uh oh!

Conversation

yiliu30 commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yiliu30 commented Nov 14, 2025 •

edited

Loading