Updates torchao pin to enable shared embedding quantization #9548

metascroy · 2025-03-24T17:37:05Z

Updates torchao pin to enable shared embedding quantization.

pytorch-bot · 2025-03-24T17:37:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9548

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 1d7cc21 with merge base 94ec549 ():

NEW FAILURE - The following job has failed:

Check Labels / Check labels (gh)
RuntimeError: Error checking labels: PR does not have required labels

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Jack-Khuu · 2025-03-24T18:30:09Z

examples/models/llama/README.md

 ```

+A few notes:
+- If your model shares embedding/unembedding weights (like Llama1B and Llama3B do), you can add `--use_shared_embedding` to take advantage of this and reduce memory.  When this option is enabled, you can specify whether embeddings are quantized with weight zeros or not by specifying a third argument.  For example, `-E "torchao:4,32,true"` means that the embedding is quantized to 4-bits with group_size=32 and uses weight zeros (this is the default behavior if you simply use `-E "torchao:4,32"`), whereas `-E "torchao:4,32,false"` means that the embedding is quantized to 4-bits with group_size=32, but is quantized with scales-only.  If `--use_shared_embedding` is specified, the unembedding (i.e., the final linear layer) is quantized in the same way, but also uses 8-bit dynamically quantized activations.


Not for this PR, but what's the plan for updating our arg selection scheme for quant?

-E "torchao:4,32,true isn't user friendly

You'd never need to do that. true is the default (and existing behavior), so you could continue to use -E"torchao:4,32".

I'd make this a bit more clear that shared is only for torchao kernels, or torchao:

It's under the torchao section of the docs.

Jack-Khuu · 2025-03-24T18:32:34Z

examples/models/llama/export_llama_lib.py

+    if args.use_shared_embedding:
+        if not (
+            args.embedding_quantize is not None
+            and args.embedding_quantize.startswith("torchao:")


Suggested change

if args.use_shared_embedding:

if not (

args.embedding_quantize is not None

and args.embedding_quantize.startswith("torchao:")

if args.use_shared_embedding:

and (

args.embedding_quantize is None

or not args.embedding_quantize.startswith("torchao:")

nit: nested conditionals into an error

jackzhxng · 2025-03-24T19:31:35Z

examples/models/llama/export_llama_lib.py


            transforms.append(inject_fast_hadamard_transform_native_for_spin_quant)

+    if args.embedding_quantize:


Why did we change the order of the source transform?

shared_embedding must be applied before linear. So I changed order to embedding first, and linear second. I put a code comment to this effect as well.

jackzhxng · 2025-03-24T19:32:27Z

examples/models/llama/README.md

 ```

+A few notes:
+- If your model shares embedding/unembedding weights (like Llama1B and Llama3B do), you can add `--use_shared_embedding` to take advantage of this and reduce memory.  When this option is enabled, you can specify whether embeddings are quantized with weight zeros or not by specifying a third argument.  For example, `-E "torchao:4,32,true"` means that the embedding is quantized to 4-bits with group_size=32 and uses weight zeros (this is the default behavior if you simply use `-E "torchao:4,32"`), whereas `-E "torchao:4,32,false"` means that the embedding is quantized to 4-bits with group_size=32, but is quantized with scales-only.  If `--use_shared_embedding` is specified, the unembedding (i.e., the final linear layer) is quantized in the same way, but also uses 8-bit dynamically quantized activations.


I'd make this a bit more clear that shared is only for torchao kernels, or torchao:

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 24, 2025

up

0af9eba

metascroy force-pushed the torchao-bump branch from a36ebd3 to 0af9eba Compare March 24, 2025 17:41

metascroy added 2 commits March 24, 2025 11:20

up

98c1d11

Merge branch 'main' into torchao-bump

1d7cc21

metascroy marked this pull request as ready for review March 24, 2025 18:21

metascroy requested review from GregoryComer, jackzhxng and lucylq as code owners March 24, 2025 18:21

metascroy added the ciflow/trunk label Mar 24, 2025

metascroy requested a review from Jack-Khuu March 24, 2025 18:23

Jack-Khuu approved these changes Mar 24, 2025

View reviewed changes

jackzhxng reviewed Mar 24, 2025

View reviewed changes

jackzhxng added the release notes: examples Changes to any of our example LLMs integrations, such as Llama3 and Llava label Mar 24, 2025

metascroy merged commit 341f318 into main Mar 24, 2025
165 of 167 checks passed

metascroy deleted the torchao-bump branch March 24, 2025 19:59

Jack-Khuu mentioned this pull request Mar 31, 2025

Enable torchao.experimental EmbeddingQuantization pytorch/torchchat#1520

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Updates torchao pin to enable shared embedding quantization #9548

Updates torchao pin to enable shared embedding quantization #9548

Uh oh!

metascroy commented Mar 24, 2025

Uh oh!

pytorch-bot bot commented Mar 24, 2025 •

edited

Loading

Uh oh!

Jack-Khuu Mar 24, 2025

Uh oh!

metascroy Mar 24, 2025

Uh oh!

jackzhxng Mar 24, 2025

Uh oh!

metascroy Mar 24, 2025

Uh oh!

Jack-Khuu Mar 24, 2025

Uh oh!

jackzhxng Mar 24, 2025

Uh oh!

metascroy Mar 24, 2025

Uh oh!

jackzhxng Mar 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants


		transforms.append(inject_fast_hadamard_transform_native_for_spin_quant)

		if args.embedding_quantize:

Updates torchao pin to enable shared embedding quantization #9548

Updates torchao pin to enable shared embedding quantization #9548

Uh oh!

Conversation

metascroy commented Mar 24, 2025

Uh oh!

pytorch-bot bot commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9548

❌ 1 New Failure

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pytorch-bot bot commented Mar 24, 2025 •

edited

Loading