You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2024-09-26-pytorch-native-architecture-optimization.md
+27-22Lines changed: 27 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -30,37 +30,37 @@ Below we'll walk through some of the techniques available in torchao you can app
30
30
31
31
## Inference
32
32
33
-
[Our inference quantization algorithms](https://github.com/pytorch/ao/tree/main/torchao/quantization) work over arbitrary PyTorch models that contain nn.Linear layers. Weight only and dynamic activation quantization for various dtypes and sparse layouts can be chosen using our top level quantize\_ api
33
+
[Our inference quantization algorithms](https://github.com/pytorch/ao/tree/main/torchao/quantization) work over arbitrary PyTorch models that contain nn.Linear layers. Weight only and dynamic activation quantization for various dtypes and sparse layouts can be chosen using our top level `quantize_` api
34
34
35
-
```py
35
+
```python
36
36
from torchao.quantization import (
37
-
quantize\_,
38
-
int4\_weight\_only,
37
+
quantize_,
38
+
int4_weight_only,
39
39
)
40
-
quantize\_(model, int4\_weight\_only())
40
+
quantize_(model, int4_weight_only())
41
41
```
42
42
43
43
Sometimes quantizing a layer can make it slower because of overhead so if you’d rather we just pick how to quantize each layer in a model for you then you can instead run
44
44
45
-
```py
46
-
model \= torchao.autoquant(torch.compile(model, mode='max-autotune'))
45
+
```python
46
+
model = torchao.autoquant(torch.compile(model, mode='max-autotune'))
47
47
```
48
48
49
-
quantize\_ API has a few different options depending on whether your model is compute bound or memory bound.
49
+
`quantize_` API has a few different options depending on whether your model is compute bound or memory bound.
@@ -86,8 +86,10 @@ Post training quantization, especially at less than 4 bit can suffer from seriou
86
86
87
87
torchao provides easy to use e2e workflows for reducing the precision of training compute and distributed communications, starting with float8 for \`torch.nn.Linear\` layers.Here is a one-liner to convert the compute gemms of your training run to float8:
88
88
89
-
from torchao.float8 import convert\_to\_float8\_training
90
-
convert\_to\_float8\_training(model)
89
+
```python
90
+
from torchao.float8 import convert_to_float8_training
91
+
convert_to_float8_training(model)
92
+
```
91
93
92
94
For an e2e example of how to speed up LLaMa 3 70B pretraining by up to **1.5x** with float8, see our [README](https://github.com/pytorch/ao/tree/main/torchao/float8), and torchtitan's [blog](https://dev-discuss.pytorch.org/t/enabling-float8-all-gather-in-fsdp2/2359) and [float8 recipe](https://github.com/pytorch/torchtitan/blob/main/docs/float8.md).
93
95
@@ -106,8 +108,11 @@ We are expanding our training workflows to more dtypes and layouts
106
108
107
109
Inspired by Bits and Bytes we’ve also added prototype support for 8 and 4 bit optimizers as a drop in replacement for AdamW.
108
110
109
-
from torchao.prototype.low\_bit\_optim import AdamW8bit, AdamW4bit
110
-
optim \= AdamW8bit(model.parameters())
111
+
```python
112
+
from torchao.prototype.low_bit_optim import AdamW8bit, AdamW4bit
There are a lot of things we’re excited about next ranging from going lower than 4 bit, performant kernels for high-throughput inference, expanding to more layers, scaling types or granularities, MX hardware support and supporting more hardware backends. If any of the above sounds exciting you can follow our progress at: [https://github.com/pytorch/ao](https://github.com/pytorch/ao)
131
136
132
-
If you’re interested in working on torchao, we’ve created a [contributors guide](https://github.com/pytorch/ao/issues/391), and if you have any questions we hang out on the \#torchao channel on [discord.gg/cudamode](http://discord.gg/cudamode)
137
+
If you’re interested in working on torchao, we’ve created a [contributors guide](https://github.com/pytorch/ao/issues/391), and if you have any questions we hang out on the `#torchao` channel on [discord.gg/gpumode](http://discord.gg/gpumode)
133
138
134
139
## Acknowledgements
135
140
136
-
We are fortunate to stand on the shoulders of giants and collaborate with some of the best people in open source. Thank you\!
141
+
We are fortunate to stand on the shoulders of giants and collaborate with some of the best people in open source. Thank you!
137
142
138
143
1. Bits and Bytes for pioneering work in low bit optimizers and QLoRA
139
144
2. Answer.ai for their engineering work to get FSDP and QLoRA composing
140
145
3. Mobius Labs for the lovely back and forths on quantization algorithms and low bit kernels
141
146
4. HuggingFace transformers for their help in battle testing and integrating our work
142
147
5. HuggingFace diffusers for our collaboration on extensive benchmarks and best practices
143
148
6. torch.compile so we could write our algorithms in pure PyTorch
0 commit comments