Skip to content

Conversation

@yiliu30
Copy link
Contributor

@yiliu30 yiliu30 commented Nov 14, 2025

#937

https://github.com/yiliu30/vllm-fork/tree/fused-moe-ar is required
Please refer to the README.md for more details

Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
yiliu30 and others added 12 commits November 12, 2025 03:09
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
@yiliu30 yiliu30 marked this pull request as ready for review November 17, 2025 07:06
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for MXFP8 and MXFP4 quantization schemes for both MoE (Mixture of Experts) and Linear layers in the vLLM extension, addressing issue #937.

Key changes include:

  • Introduction of new linear layer quantization methods with MXFP4/MXFP8 implementations
  • Addition of utility functions for scheme detection and quantization/dequantization operations
  • Extension of MoE methods to support MXFP8 quantization
  • New environment variables for controlling MXFP4 unpacking behavior

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
utils.py Added utility functions for quantization scheme detection and validation
torchao_patch.py New file implementing MX format quantization with scale calculation modes
quant_method_linear.py New quantization method dispatcher for linear layers supporting MXFP4/MXFP8
quant_impl.py Abstract base class defining the quantization implementation interface
mxfp8_qdq_utils.py MXFP8 quantization/dequantization utilities
mxfp4_qdq_utils.py Added scale reshaping for MXFP4 weight dequantization
linear_impl_mxfp8.py MXFP8 linear layer implementation with weight/activation quantization
linear_impl_mxfp4.py MXFP4 linear layer implementation with optional weight unpacking
quant_method_moe.py Refactored to use shared utilities and added MXFP8 MoE support
moe_impl_mxfp4.py Added MXFP8 unpacking path and GPT-OSS post-processing support
fp4_utils.py Fixed incorrect assertion logic for packing validation
envs_ext.py Added new environment variables for MXFP4/MXFP8 control
auto_round_ext.py Updated to use new linear quantization method dispatcher
tests/test_models.py Updated test model path to MXFP8 variant
README.md New documentation for building and using vLLM extension

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Enum representing the different methods for calculating MX block scaling.
There are three methods available:
FLOOR: This method is recommended by the OCP MX Spec 1.0 and uses X = 2^floor(log2(max_abs(v))-max_exp).
It result in overflow issues for large values and bad for gradient quantization.
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected grammar: 'It result in' should be 'It results in'.

Suggested change
It result in overflow issues for large values and bad for gradient quantization.
It results in overflow issues for large values and is bad for gradient quantization.

Copilot uses AI. Check for mistakes.
Comment on lines +33 to +35
RCEIL: The method is to apply ceil to the ratio of max_abs(v) and max_pos.
This method's detail is described in https://docs.nvidia.com/cuda/cublas/index.html#d-block-quantization
Section "Computing scaling and conversion factors for FP8 with UE8M0 scales"
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The description for RCEIL is incomplete compared to other methods. Consider adding more context about when this method should be used and its trade-offs, similar to the documentation for FLOOR, CEIL, and EVEN.

Suggested change
RCEIL: The method is to apply ceil to the ratio of max_abs(v) and max_pos.
This method's detail is described in https://docs.nvidia.com/cuda/cublas/index.html#d-block-quantization
Section "Computing scaling and conversion factors for FP8 with UE8M0 scales"
RCEIL: This method applies ceil to the ratio of max_abs(v) and max_pos, i.e., X = ceil(max_abs(v) / max_pos).
It is recommended for hardware implementations such as NVIDIA's cublas FP8 block quantization, as described in
https://docs.nvidia.com/cuda/cublas/index.html#d-block-quantization (see section "Computing scaling and conversion factors for FP8 with UE8M0 scales").
RCEIL avoids overflow issues and provides a hardware-friendly scaling factor, but may result in less precise scaling for values near the maximum.
Use RCEIL when targeting hardware that expects this scaling method, or when compatibility with cublas FP8 quantization is required. Compared to FLOOR and CEIL,
RCEIL may offer better robustness for large values, but could sacrifice some accuracy for values close to max_pos.

Copilot uses AI. Check for mistakes.
Comment on lines 161 to +162
def mxfp4_fp8_weight_to_bf16(weight_fp8, scale_bf16):

Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scale reshaping was added here but the function signature and documentation don't indicate this behavior change. Consider documenting why this reshape is necessary for the function to work correctly.

Suggested change
def mxfp4_fp8_weight_to_bf16(weight_fp8, scale_bf16):
def mxfp4_fp8_weight_to_bf16(weight_fp8, scale_bf16):
"""
Dequantizes FP8 weights to BF16 using the provided scale.
Args:
weight_fp8 (torch.Tensor): The quantized weights tensor. Expected to be divisible into blocks of 32 elements.
scale_bf16 (torch.Tensor): The scale tensor. Will be reshaped to (-1, 1) to match the block structure of the weights.
This reshape is necessary to broadcast the scale across each block of 32 weights.
Returns:
torch.Tensor: The dequantized weights in BF16 format, with the same shape as the input weights.
"""

Copilot uses AI. Check for mistakes.
)

def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
logger.debug(f"Processing weights after loading for layer: {layer._prefix}")
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accessing the private attribute '_prefix' (indicated by the leading underscore) is not recommended. Consider either making this a public attribute or accessing it through a public method.

Suggested change
logger.debug(f"Processing weights after loading for layer: {layer._prefix}")
layer_prefix = getattr(layer, "prefix", getattr(layer, "_prefix", None))
logger.debug(f"Processing weights after loading for layer: {layer_prefix}")

Copilot uses AI. Check for mistakes.
Comment on lines +287 to +290
if self.has_bias:
if envs.VLLM_AR_POST_PROCESS_GPTOSS:
w13_bias_swapped = revert_interleaved_bias(layer.w13_bias)
layer.w13_bias.data.copy_(w13_bias_swapped)
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The nested if statements can be simplified by combining the conditions. Consider using if self.has_bias and envs.VLLM_AR_POST_PROCESS_GPTOSS: for better readability.

Suggested change
if self.has_bias:
if envs.VLLM_AR_POST_PROCESS_GPTOSS:
w13_bias_swapped = revert_interleaved_bias(layer.w13_bias)
layer.w13_bias.data.copy_(w13_bias_swapped)
if self.has_bias and envs.VLLM_AR_POST_PROCESS_GPTOSS:
w13_bias_swapped = revert_interleaved_bias(layer.w13_bias)
layer.w13_bias.data.copy_(w13_bias_swapped)

Copilot uses AI. Check for mistakes.
# dequant weight
weight = layer.weight
weight_scale = layer.weight_scale
dequnat_weight = dequant_mx_fp8(
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'dequnat_weight' to 'dequant_weight'.

Copilot uses AI. Check for mistakes.
Comment on lines +78 to +80
@classmethod
def get_min_capability(cls) -> int:
return cls.impl.get_min_capability()
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This classmethod is attempting to access cls.impl, but impl is an instance attribute set in __init__, not a class attribute. This will raise an AttributeError when called. Consider making this an instance method or restructuring the implementation.

Suggested change
@classmethod
def get_min_capability(cls) -> int:
return cls.impl.get_min_capability()
def get_min_capability(self) -> int:
return self.impl.get_min_capability()

Copilot uses AI. Check for mistakes.
@yiliu30 yiliu30 requested a review from n1ck-guo November 17, 2025 07:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants