Update on "Update QAT READMEs using new APIs"

andrewor14 · andrewor14 · commit 3dce6a336e7e · 2025-01-10T12:53:33.000-08:00
Add references to new QAT APIs including `quantize_`,
`FakeQuantizedX`, and the new embedding Quantizers and
ComposableQATQuantizer. Also link to new QAT + LoRA recipe
in torchtune.

[ghstack-poisoned]
diff --git a/README.md b/README.md
@@ -58,7 +58,7 @@ In practice these features alongside int4 weight only quantization allow us to *
 
 ### Quantization Aware Training
 
-Post-training quantization can result in a fast and compact model, but may also lead to accuracy degradation. We recommend exploring Quantization Aware Training (QAT) to overcome this limitation. In collaboration with Torchtune, we've developed a QAT recipe that demonstrates significant accuracy improvements over traditional PTQ, recovering **96% of the accuracy degradation on hellaswag and 68% of the perplexity degradation on wikitext** for Llama3 compared to post-training quantization (PTQ). And we've provided a full recipe [here](https://pytorch.org/blog/quantization-aware-training/)
+Post-training quantization can result in a fast and compact model, but may also lead to accuracy degradation. We recommend exploring Quantization Aware Training (QAT) to overcome this limitation. In collaboration with Torchtune, we've developed a QAT recipe that demonstrates significant accuracy improvements over traditional PTQ, recovering **96% of the accuracy degradation on hellaswag and 68% of the perplexity degradation on wikitext** for Llama3 compared to post-training quantization (PTQ). And we've provided a full recipe [here](https://pytorch.org/blog/quantization-aware-training/). For more details, please see the [QAT README](./torchao/quantization/qat/README.md).
 
 ```python
 from torchao.quantization import (
diff --git a/torchao/quantization/qat/README.md b/torchao/quantization/qat/README.md
@@ -91,7 +91,7 @@ from torchao.quantization.qat import (
 model = get_model()
 
 # prepare: insert fake quantization ops
-# Swap `torch.nn.Linear` with `FakeQuantizedLinear`
+# swaps `torch.nn.Linear` with `FakeQuantizedLinear`
 activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False)
 weight_config = FakeQuantizeConfig(torch.int4, group_size=32)
 quantize_(
@@ -103,7 +103,7 @@ quantize_(
 train_loop(model)
 
 # convert: transform fake quantization ops into actual quantized ops
-# Swap `FakeQuantizedLinear` back to `torch.nn.Linear` and insert
+# swap `FakeQuantizedLinear` back to `torch.nn.Linear` and inserts
 # quantized activation and weight tensor subclasses
 quantize_(model, from_intx_quantization_aware_training())
 quantize_(model, int8_dynamic_activation_int4_weight(group_size=32))
@@ -112,7 +112,7 @@ quantize_(model, int8_dynamic_activation_int4_weight(group_size=32))
 ```
 
 To fake quantize embedding in addition to linear, you can additionally call
-the following with a filter function during the prepare step.
+the following with a filter function during the prepare step:
 
 ```
 quantize_(
@@ -138,14 +138,14 @@ qat_quantizer = Int8DynActInt4WeightQATQuantizer(group_size=32)
 model = get_model()
 
 # prepare: insert fake quantization ops
-# Swap `torch.nn.Linear` with `Int8DynActInt4WeightQATLinear`
+# swaps `torch.nn.Linear` with `Int8DynActInt4WeightQATLinear`
 model = qat_quantizer.prepare(model)
 
 # train
 train_loop(model)
 
 # convert: transform fake quantization ops into actual quantized ops
-# Swap `Int8DynActInt4WeightQATLinear` with `Int8DynActInt4WeightLinear`
+# swaps `Int8DynActInt4WeightQATLinear` with `Int8DynActInt4WeightLinear`
 model = qat_quantizer.convert(model)
 
 # inference or generate
@@ -155,7 +155,7 @@ To use multiple Quantizers in the same model for different layer types,
 users can also leverage the [ComposableQATQuantizer](https://github.com/pytorch/ao/blob/v0.7.0/torchao/quantization/qat/api.py#L242)
 as follows:
 
-```
+```python
 from torchao.quantization.qat import (
     ComposableQATQuantizer,
     Int4WeightOnlyEmbeddingQATQuantizer,
@@ -175,16 +175,16 @@ model = qat_quantizer.convert(model)
 
 ## torchtune integration
 
-Users can also leverage our integration with [torchtune](https://github.com/pytorch/torchtune)
-and apply quantized-aware fine-tuning as follows:
+torchao QAT is integrated with [torchtune](https://github.com/pytorch/torchtune)
+to allow users to run quantized-aware fine-tuning as follows:
 
 ```
 tune run --nproc_per_node 8 qat_distributed --config llama3/8B_qat_full
 ```
 
-torchtune also supports a QAT + LoRA distributed training recipe that is 1.89x faster
-and uses 36.1% memory compared to vanilla QAT in our early experiments. You can read
-more about it [here](https://dev-discuss.pytorch.org/t/speeding-up-qat-by-1-89x-with-lora/2700).
+torchtune also supports a [QAT + LoRA distributed training recipe](https://github.com/pytorch/torchtune/blob/main/recipes/qat_lora_finetune_distributed.py)
+that is 1.89x faster and uses 36.1% memory compared to vanilla QAT in our early experiments.
+You can read more about it [here](https://dev-discuss.pytorch.org/t/speeding-up-qat-by-1-89x-with-lora/2700):
 
 ```
 tune run --nnodes 1 --nproc_per_node 4 qat_lora_finetune_distributed --config llama3/8B_qat_lora