-
Notifications
You must be signed in to change notification settings - Fork 60
Add MXFP8 MOE/Linear and MXFP4 Linear #1034
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
for more information, see https://pre-commit.ci
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for MXFP8 and MXFP4 quantization schemes for both MoE (Mixture of Experts) and Linear layers in the vLLM extension, addressing issue #937.
Key changes include:
- Introduction of new linear layer quantization methods with MXFP4/MXFP8 implementations
- Addition of utility functions for scheme detection and quantization/dequantization operations
- Extension of MoE methods to support MXFP8 quantization
- New environment variables for controlling MXFP4 unpacking behavior
Reviewed Changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
utils.py |
Added utility functions for quantization scheme detection and validation |
torchao_patch.py |
New file implementing MX format quantization with scale calculation modes |
quant_method_linear.py |
New quantization method dispatcher for linear layers supporting MXFP4/MXFP8 |
quant_impl.py |
Abstract base class defining the quantization implementation interface |
mxfp8_qdq_utils.py |
MXFP8 quantization/dequantization utilities |
mxfp4_qdq_utils.py |
Added scale reshaping for MXFP4 weight dequantization |
linear_impl_mxfp8.py |
MXFP8 linear layer implementation with weight/activation quantization |
linear_impl_mxfp4.py |
MXFP4 linear layer implementation with optional weight unpacking |
quant_method_moe.py |
Refactored to use shared utilities and added MXFP8 MoE support |
moe_impl_mxfp4.py |
Added MXFP8 unpacking path and GPT-OSS post-processing support |
fp4_utils.py |
Fixed incorrect assertion logic for packing validation |
envs_ext.py |
Added new environment variables for MXFP4/MXFP8 control |
auto_round_ext.py |
Updated to use new linear quantization method dispatcher |
tests/test_models.py |
Updated test model path to MXFP8 variant |
README.md |
New documentation for building and using vLLM extension |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Enum representing the different methods for calculating MX block scaling. | ||
| There are three methods available: | ||
| FLOOR: This method is recommended by the OCP MX Spec 1.0 and uses X = 2^floor(log2(max_abs(v))-max_exp). | ||
| It result in overflow issues for large values and bad for gradient quantization. |
Copilot
AI
Nov 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected grammar: 'It result in' should be 'It results in'.
| It result in overflow issues for large values and bad for gradient quantization. | |
| It results in overflow issues for large values and is bad for gradient quantization. |
| RCEIL: The method is to apply ceil to the ratio of max_abs(v) and max_pos. | ||
| This method's detail is described in https://docs.nvidia.com/cuda/cublas/index.html#d-block-quantization | ||
| Section "Computing scaling and conversion factors for FP8 with UE8M0 scales" |
Copilot
AI
Nov 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The description for RCEIL is incomplete compared to other methods. Consider adding more context about when this method should be used and its trade-offs, similar to the documentation for FLOOR, CEIL, and EVEN.
| RCEIL: The method is to apply ceil to the ratio of max_abs(v) and max_pos. | |
| This method's detail is described in https://docs.nvidia.com/cuda/cublas/index.html#d-block-quantization | |
| Section "Computing scaling and conversion factors for FP8 with UE8M0 scales" | |
| RCEIL: This method applies ceil to the ratio of max_abs(v) and max_pos, i.e., X = ceil(max_abs(v) / max_pos). | |
| It is recommended for hardware implementations such as NVIDIA's cublas FP8 block quantization, as described in | |
| https://docs.nvidia.com/cuda/cublas/index.html#d-block-quantization (see section "Computing scaling and conversion factors for FP8 with UE8M0 scales"). | |
| RCEIL avoids overflow issues and provides a hardware-friendly scaling factor, but may result in less precise scaling for values near the maximum. | |
| Use RCEIL when targeting hardware that expects this scaling method, or when compatibility with cublas FP8 quantization is required. Compared to FLOOR and CEIL, | |
| RCEIL may offer better robustness for large values, but could sacrifice some accuracy for values close to max_pos. |
| def mxfp4_fp8_weight_to_bf16(weight_fp8, scale_bf16): | ||
|
|
Copilot
AI
Nov 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scale reshaping was added here but the function signature and documentation don't indicate this behavior change. Consider documenting why this reshape is necessary for the function to work correctly.
| def mxfp4_fp8_weight_to_bf16(weight_fp8, scale_bf16): | |
| def mxfp4_fp8_weight_to_bf16(weight_fp8, scale_bf16): | |
| """ | |
| Dequantizes FP8 weights to BF16 using the provided scale. | |
| Args: | |
| weight_fp8 (torch.Tensor): The quantized weights tensor. Expected to be divisible into blocks of 32 elements. | |
| scale_bf16 (torch.Tensor): The scale tensor. Will be reshaped to (-1, 1) to match the block structure of the weights. | |
| This reshape is necessary to broadcast the scale across each block of 32 weights. | |
| Returns: | |
| torch.Tensor: The dequantized weights in BF16 format, with the same shape as the input weights. | |
| """ |
| ) | ||
|
|
||
| def process_weights_after_loading(self, layer: torch.nn.Module) -> None: | ||
| logger.debug(f"Processing weights after loading for layer: {layer._prefix}") |
Copilot
AI
Nov 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accessing the private attribute '_prefix' (indicated by the leading underscore) is not recommended. Consider either making this a public attribute or accessing it through a public method.
| logger.debug(f"Processing weights after loading for layer: {layer._prefix}") | |
| layer_prefix = getattr(layer, "prefix", getattr(layer, "_prefix", None)) | |
| logger.debug(f"Processing weights after loading for layer: {layer_prefix}") |
| if self.has_bias: | ||
| if envs.VLLM_AR_POST_PROCESS_GPTOSS: | ||
| w13_bias_swapped = revert_interleaved_bias(layer.w13_bias) | ||
| layer.w13_bias.data.copy_(w13_bias_swapped) |
Copilot
AI
Nov 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The nested if statements can be simplified by combining the conditions. Consider using if self.has_bias and envs.VLLM_AR_POST_PROCESS_GPTOSS: for better readability.
| if self.has_bias: | |
| if envs.VLLM_AR_POST_PROCESS_GPTOSS: | |
| w13_bias_swapped = revert_interleaved_bias(layer.w13_bias) | |
| layer.w13_bias.data.copy_(w13_bias_swapped) | |
| if self.has_bias and envs.VLLM_AR_POST_PROCESS_GPTOSS: | |
| w13_bias_swapped = revert_interleaved_bias(layer.w13_bias) | |
| layer.w13_bias.data.copy_(w13_bias_swapped) |
| # dequant weight | ||
| weight = layer.weight | ||
| weight_scale = layer.weight_scale | ||
| dequnat_weight = dequant_mx_fp8( |
Copilot
AI
Nov 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'dequnat_weight' to 'dequant_weight'.
| @classmethod | ||
| def get_min_capability(cls) -> int: | ||
| return cls.impl.get_min_capability() |
Copilot
AI
Nov 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This classmethod is attempting to access cls.impl, but impl is an instance attribute set in __init__, not a class attribute. This will raise an AttributeError when called. Consider making this an instance method or restructuring the implementation.
| @classmethod | |
| def get_min_capability(cls) -> int: | |
| return cls.impl.get_min_capability() | |
| def get_min_capability(self) -> int: | |
| return self.impl.get_min_capability() |
#937
https://github.com/yiliu30/vllm-fork/tree/fused-moe-ar is required
Please refer to the README.md for more details