You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add references to new QAT APIs including `quantize_`,
`FakeQuantizedX`, and the new embedding Quantizers and
ComposableQATQuantizer. Also link to new QAT + LoRA recipe
in torchtune.
[ghstack-poisoned]
Copy file name to clipboardExpand all lines: README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -58,7 +58,7 @@ In practice these features alongside int4 weight only quantization allow us to *
58
58
59
59
### Quantization Aware Training
60
60
61
-
Post-training quantization can result in a fast and compact model, but may also lead to accuracy degradation. We recommend exploring Quantization Aware Training (QAT) to overcome this limitation. In collaboration with Torchtune, we've developed a QAT recipe that demonstrates significant accuracy improvements over traditional PTQ, recovering **96% of the accuracy degradation on hellaswag and 68% of the perplexity degradation on wikitext** for Llama3 compared to post-training quantization (PTQ). And we've provided a full recipe [here](https://pytorch.org/blog/quantization-aware-training/)
61
+
Post-training quantization can result in a fast and compact model, but may also lead to accuracy degradation. We recommend exploring Quantization Aware Training (QAT) to overcome this limitation. In collaboration with Torchtune, we've developed a QAT recipe that demonstrates significant accuracy improvements over traditional PTQ, recovering **96% of the accuracy degradation on hellaswag and 68% of the perplexity degradation on wikitext** for Llama3 compared to post-training quantization (PTQ). And we've provided a full recipe [here](https://pytorch.org/blog/quantization-aware-training/). For more details, please see the [QAT README](./torchao/quantization/qat/README.md).
#Swap `torch.nn.Linear` with `Int8DynActInt4WeightQATLinear`
141
+
#swaps `torch.nn.Linear` with `Int8DynActInt4WeightQATLinear`
142
142
model = qat_quantizer.prepare(model)
143
143
144
144
# train
145
145
train_loop(model)
146
146
147
147
# convert: transform fake quantization ops into actual quantized ops
148
-
#Swap `Int8DynActInt4WeightQATLinear` with `Int8DynActInt4WeightLinear`
148
+
#swaps `Int8DynActInt4WeightQATLinear` with `Int8DynActInt4WeightLinear`
149
149
model = qat_quantizer.convert(model)
150
150
151
151
# inference or generate
@@ -155,7 +155,7 @@ To use multiple Quantizers in the same model for different layer types,
155
155
users can also leverage the [ComposableQATQuantizer](https://github.com/pytorch/ao/blob/v0.7.0/torchao/quantization/qat/api.py#L242)
156
156
as follows:
157
157
158
-
```
158
+
```python
159
159
from torchao.quantization.qat import (
160
160
ComposableQATQuantizer,
161
161
Int4WeightOnlyEmbeddingQATQuantizer,
@@ -175,16 +175,16 @@ model = qat_quantizer.convert(model)
175
175
176
176
## torchtune integration
177
177
178
-
Users can also leverage our integration with [torchtune](https://github.com/pytorch/torchtune)
179
-
and apply quantized-aware fine-tuning as follows:
178
+
torchao QAT is integrated with [torchtune](https://github.com/pytorch/torchtune)
179
+
to allow users to run quantized-aware fine-tuning as follows:
180
180
181
181
```
182
182
tune run --nproc_per_node 8 qat_distributed --config llama3/8B_qat_full
183
183
```
184
184
185
-
torchtune also supports a QAT + LoRA distributed training recipe that is 1.89x faster
186
-
and uses 36.1% memory compared to vanilla QAT in our early experiments. You can read
187
-
more about it [here](https://dev-discuss.pytorch.org/t/speeding-up-qat-by-1-89x-with-lora/2700).
185
+
torchtune also supports a [QAT + LoRA distributed training recipe](https://github.com/pytorch/torchtune/blob/main/recipes/qat_lora_finetune_distributed.py)
186
+
that is 1.89x faster and uses 36.1% memory compared to vanilla QAT in our early experiments.
187
+
You can read more about it [here](https://dev-discuss.pytorch.org/t/speeding-up-qat-by-1-89x-with-lora/2700):
188
188
189
189
```
190
190
tune run --nnodes 1 --nproc_per_node 4 qat_lora_finetune_distributed --config llama3/8B_qat_lora
0 commit comments