Float8Tensor per row quantization pass bias to fbgemm kernel #2884

jerryzh168 · 2025-08-26T22:22:37Z

Stacked PRs:

->Float8Tensor per row quantization pass bias to fbgemm kernel #2884

Float8Tensor per row quantization pass bias to fbgemm kernel

Summary:
Previously bias is not passed to fbgemm kernel for float8 per row quant,
this PR adds it.

Difference is we should have a faster float8 per row quantized kernel, without changing
numerics or other things.

Test Plan:

python test/quantization/quantize_/workflows/float8/test_float8_tensor.py -k test_kernel_preference_numerical_equivalence
python test/quantization/quantize_/workflows/float8/test_float8_tensor.py -k test_expected_gpu_kernel_fbgemm

Reviewers:

Subscribers:

Tasks:

Tags:

pytorch-bot · 2025-08-26T22:22:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2884

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

ROCm MI2xx CI/CD workflows failing due to : download from https://api.github.com/repos/pytorch/pytorch timed out.

✅ No Failures

As of commit 4d70152 with merge base fbe3df9 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: Previously bias is not passed to fbgemm kernel for float8 per row quant, this PR adds it. Difference is we should have a faster float8 per row quantized kernel, without changing numerics or other things. Test Plan: ``` python test/dtypes/test_affine_quantized_float.py -k test_expected_kernels_on_gpu python test/quantization/quantize_/workflows/float8/test_float8_tensor.py ``` Reviewers: Subscribers: Tasks: Tags: stack-info: PR: #2884, branch: jerryzh168/stack/60

vkuzo · 2025-08-28T00:15:52Z

looks reasonable, can we add a test which covers the new path?

Summary: Previously bias is not passed to fbgemm kernel for float8 per row quant, this PR adds it. Difference is we should have a faster float8 per row quantized kernel, without changing numerics or other things. Test Plan: ``` python test/dtypes/test_affine_quantized_float.py -k test_expected_kernels_on_gpu python test/quantization/quantize_/workflows/float8/test_float8_tensor.py ``` Reviewers: Subscribers: Tasks: Tags: stack-info: PR: #2884, branch: jerryzh168/stack/60

Summary: Previously bias is not passed to fbgemm kernel for float8 per row quant, this PR adds it. Difference is we should have a faster float8 per row quantized kernel, without changing numerics or other things. Test Plan: ``` python test/quantization/quantize_/workflows/float8/test_float8_tensor.py -k test_kernel_preference_numerical_equivalence python test/quantization/quantize_/workflows/float8/test_float8_tensor.py -k test_expected_gpu_kernel_fbgemm ``` Reviewers: Subscribers: Tasks: Tags: stack-info: PR: #2884, branch: jerryzh168/stack/60

drisspg

should we have a test, this seems like it was a bug but wasn't caught?

jerryzh168 · 2025-09-02T17:38:50Z

@drisspg yeah test is this one:

python test/quantization/quantize_/workflows/float8/test_float8_tensor.py -k test_expected_gpu_kernel_fbgemm

but it's probably better to have a more exhaustive one for all options (auto and torch) as well

it is a performance "bug" I think, the effect is that we are not using the most efficient path if there is a bias

jerryzh168 · 2025-09-02T17:41:23Z

test/quantization/quantize_/workflows/float8/test_float8_tensor.py

            "torch.ops.triton.quantize_fp8_row.default", 1
-        ).check_count("torch.ops.fbgemm.f8f8bf16_rowwise.default", 1).run(code[0])
+        ).check_count("torch.ops.fbgemm.f8f8bf16_rowwise.default", 1).check_not(
+            "triton_poi_fused_add_0"


cc @drisspg we explicitly test that triton_poi_fused_add_0 is not generated, to make sure there is no additional res + bias there in this code path

it's not safe to depend on inductor generated kernel names such as triton_poi_fused_add_0, I think it would be better to ensure that there are no additional kernels called after torch.ops.fbgemm.f8f8bf16_rowwise.default is called.

OK, updated

Summary: Previously bias is not passed to fbgemm kernel for float8 per row quant, this PR adds it. Difference is we should have a faster float8 per row quantized kernel, without changing numerics or other things. Test Plan: ``` python test/quantization/quantize_/workflows/float8/test_float8_tensor.py -k test_kernel_preference_numerical_equivalence python test/quantization/quantize_/workflows/float8/test_float8_tensor.py -k test_expected_gpu_kernel_fbgemm ``` Reviewers: Subscribers: Tasks: Tags: stack-info: PR: #2884, branch: jerryzh168/stack/60

jerryzh168 force-pushed the jerryzh168/stack/59 branch from 6935cc8 to bacbe8c Compare August 26, 2025 22:22

jerryzh168 force-pushed the jerryzh168/stack/60 branch from 2c1e5da to 613585d Compare August 26, 2025 22:22

jerryzh168 mentioned this pull request Aug 26, 2025

Fix Float8Tensor quantize op kernrel preference dispatch #2883

Merged

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 26, 2025

jerryzh168 added the topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) label Aug 26, 2025

jerryzh168 requested a review from vkuzo August 26, 2025 22:44

jerryzh168 changed the base branch from jerryzh168/stack/59 to main August 26, 2025 23:19

jerryzh168 force-pushed the jerryzh168/stack/60 branch from 613585d to b98440e Compare August 26, 2025 23:19

jerryzh168 changed the base branch from main to jerryzh168/stack/59 August 26, 2025 23:19

jerryzh168 changed the base branch from jerryzh168/stack/59 to main August 27, 2025 23:47

jerryzh168 force-pushed the jerryzh168/stack/60 branch from b98440e to 6acc102 Compare August 27, 2025 23:47

jerryzh168 changed the base branch from main to jerryzh168/stack/59 August 27, 2025 23:50

jerryzh168 changed the base branch from jerryzh168/stack/59 to main August 28, 2025 00:19

jerryzh168 force-pushed the jerryzh168/stack/60 branch from 6acc102 to d5465ea Compare August 28, 2025 00:19

jerryzh168 changed the base branch from main to jerryzh168/stack/59 August 28, 2025 00:19

jerryzh168 changed the base branch from jerryzh168/stack/59 to main August 28, 2025 00:27

jerryzh168 force-pushed the jerryzh168/stack/60 branch from d5465ea to 951a49c Compare August 28, 2025 00:27

jerryzh168 changed the base branch from main to jerryzh168/stack/59 August 28, 2025 00:27

jerryzh168 changed the base branch from jerryzh168/stack/59 to main August 28, 2025 00:35

jerryzh168 force-pushed the jerryzh168/stack/60 branch from 951a49c to 5860000 Compare August 28, 2025 00:35

jerryzh168 changed the base branch from main to jerryzh168/stack/59 August 28, 2025 00:35

jerryzh168 changed the base branch from jerryzh168/stack/59 to main August 28, 2025 00:37

jerryzh168 changed the base branch from main to jerryzh168/stack/59 August 28, 2025 20:24

jerryzh168 changed the base branch from jerryzh168/stack/59 to main August 28, 2025 20:28

jerryzh168 force-pushed the jerryzh168/stack/60 branch from d94d663 to 0c70dee Compare August 28, 2025 20:28

jerryzh168 changed the base branch from main to jerryzh168/stack/59 August 28, 2025 20:28

jerryzh168 changed the base branch from jerryzh168/stack/59 to main August 28, 2025 22:17

jerryzh168 force-pushed the jerryzh168/stack/60 branch from 0c70dee to 2b5046b Compare August 28, 2025 22:17

jerryzh168 changed the base branch from main to jerryzh168/stack/59 August 28, 2025 22:17

jerryzh168 changed the base branch from jerryzh168/stack/59 to main August 28, 2025 22:35

jerryzh168 force-pushed the jerryzh168/stack/60 branch from 2b5046b to 091ad9e Compare August 28, 2025 22:35

jerryzh168 changed the base branch from main to jerryzh168/stack/59 August 28, 2025 22:35

jerryzh168 changed the base branch from jerryzh168/stack/59 to main August 28, 2025 22:43

jerryzh168 force-pushed the jerryzh168/stack/60 branch from 091ad9e to 94aec1a Compare August 28, 2025 22:43

jerryzh168 changed the base branch from main to jerryzh168/stack/59 August 28, 2025 22:43

jerryzh168 requested a review from drisspg August 29, 2025 17:07

jerryzh168 force-pushed the jerryzh168/stack/60 branch from 94aec1a to 5f8d5e2 Compare August 29, 2025 19:03

jerryzh168 changed the base branch from jerryzh168/stack/59 to main August 29, 2025 19:03

drisspg reviewed Sep 2, 2025

View reviewed changes

jerryzh168 commented Sep 2, 2025

View reviewed changes

jerryzh168 force-pushed the jerryzh168/stack/60 branch from 5f8d5e2 to 4d70152 Compare September 2, 2025 19:54

jerryzh168 requested a review from drisspg September 4, 2025 17:50

drisspg approved these changes Sep 5, 2025

View reviewed changes

jerryzh168 merged commit e7b310b into main Sep 5, 2025
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Float8Tensor per row quantization pass bias to fbgemm kernel #2884

Float8Tensor per row quantization pass bias to fbgemm kernel #2884

Uh oh!

jerryzh168 commented Aug 26, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 26, 2025 •

edited

Loading

Uh oh!

vkuzo commented Aug 28, 2025

Uh oh!

drisspg left a comment

Uh oh!

jerryzh168 commented Sep 2, 2025 •

edited

Loading

Uh oh!

jerryzh168 Sep 2, 2025

Uh oh!

vkuzo Sep 2, 2025

Uh oh!

jerryzh168 Sep 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Float8Tensor per row quantization pass bias to fbgemm kernel #2884

Float8Tensor per row quantization pass bias to fbgemm kernel #2884

Uh oh!

Conversation

jerryzh168 commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!