Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Commit 5085ab6

Browse files
authored
Docs: Deduplicate quantization.md with the introduction of model_customization.md (#967)
* Update quantization.md * Push precision into the customization and removed duplicate info from quantization * Update docs README to call out vetted files
1 parent d78710f commit 5085ab6

File tree

3 files changed

+57
-57
lines changed

3 files changed

+57
-57
lines changed

docs/README.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,12 @@
1-
# Docs in this directory are unstable
1+
# Most Docs in this directory are unstable
22

33
Explicitly calling out that the docs in this directory may be outdated, incomplete, scratch notes, or a WIP.
44
torchchat provides no guarantees on these files as references.
55

66
Please refer to the root README for stable features and documentation.
7+
8+
---
9+
10+
Docs that are updated and used as **Source of Truth**:
11+
- [Model Customization](model_customization.md)
12+
- [Quantization](quantization.md)

docs/model_customization.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,17 @@ To reduce the memory bandwidth requirement and to take advantage of higher densi
4545
the model can use lower precision floating point representations.
4646
For example, many GPUs and some of the CPUs have good support for bfloat16 and float16.
4747

48-
See the [precision guide](quantization.md#model-precision-dtype-precision-setting) for more details.
48+
Unlike gpt-fast which uses bfloat16 as default, torchchat uses the dtype
49+
"fast16". This picks the best performing 16-bit floating point type
50+
available (for execution with Executorch, macOS/ARM and Linux/x86 platforms).
51+
For example on macOS, support depends on the OS version, with versions starting
52+
with 14.0 supporting bfloat16 as support, and float16 for earlier OS version
53+
based on system support for these data types.
54+
55+
The "fast" data type is also provided as a virtual data type that defaults
56+
to the best floating point data type available on the selected device.
57+
Currently, this behaves the same as "fast16", but with "fp32" when exporting
58+
to ExecuTorch.
4959

5060

5161
## Quantization

docs/quantization.md

Lines changed: 39 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,3 @@
1-
> [!WARNING]
2-
> Files in this directory may be outdated, incomplete, scratch notes, or a WIP. torchchat provides no guarantees on these files as references. Please refer to the root README for stable features and documentation.
3-
4-
51
# Quantization
62

73
<!--
@@ -11,62 +7,41 @@
117
-->
128

139
## Introduction
14-
Quantization focuses on reducing the precision of model parameters and computations from floating-point to lower-bit integers, such as 8-bit integers. This approach aims to minimize memory requirements, accelerate inference speeds, and decrease power consumption, making models more feasible for deployment on edge devices with limited computational resources. For high-performance devices such as GPUs, quantization provides a way to reduce the required memory bandwidth and take advantage of the massive compute capabilities provided by today's server-based accelerators such as GPUs.
10+
Quantization focuses on reducing the precision of model parameters and computations from floating-point to lower-bit integers, such as 8-bit integers.
11+
This approach aims to minimize memory requirements, accelerate inference speeds, and decrease power consumption, making models more feasible for
12+
deployment on edge devices with limited computational resources. For high-performance devices such as GPUs, quantization provides a way to
13+
reduce the required memory bandwidth and take advantage of the massive compute capabilities provided by today's server-based accelerators such as GPUs.
1514

16-
While quantization can potentially degrade the model's performance, the methods supported by torchchat are designed to mitigate this effect, maintaining a balance between efficiency and accuracy. In this document we provide details on the supported quantization schemes, how to quantize models with these schemes and a few example of running such quantized models on supported backends.
15+
While quantization can potentially degrade the model's performance, the methods supported by torchchat are designed to mitigate this effect,
16+
maintaining a balance between efficiency and accuracy. In this document we provide details on the supported quantization schemes, how to quantize
17+
models with these schemes and a few example of running such quantized models on supported backends.
1718

1819
## Supported Quantization Schemes
1920
### Weight Quantization
2021
| compression | bitwidth| group size | dynamic activation quantization | Eager | AOTI | ExecuTorch |
2122
|--|--|--|--|--|--|--|
22-
| linear (asymmetric) | [8, 4]* | [32, 64, 128, 256]** | ||| 🚧 |
23+
| linear (asymmetric) | [4, 8]* | [32, 64, 128, 256]^ | ||| 🚧 |
2324
| linear with dynamic activations (symmetric) | | [32, 64, 128, 256]* | a8w4dq | 🚧 |🚧 ||
2425

2526
### Embedding Quantization
2627

27-
Due to the larger vocabulary size of llama3, we also recommend
28+
To support the larger vocabularies (e.g. Llama 3), we also recommend
2829
quantizing the embeddings to further reduce the model size for
2930
on-device usecases.
3031

3132
| compression | weight quantization (bitwidth)| weight quantization (group size) | dynamic activation quantization | Eager | AOTI | ExecuTorch |
3233
|--|--|--|--|--|--|--|
33-
| embedding (symmetric) | [8, 4]* | [32, 64, 128, 256]+ | ||||
34+
| embedding (symmetric) | [4, 8]* | [32, 64, 128, 256]+ | ||||
3435

3536

37+
>\* These are the only valid bitwidth options.
3638
37-
* These are the only valid bitwidth options.
38-
39-
** There are many valid group size options, including 512, 1024,
39+
>** There are many valid group size options, including 512, 1024,
4040
etc. Note that smaller groupsize tends to be better for preserving
4141
model quality and accuracy, and larger groupsize for further
4242
improving performance. Set 0 for channelwise quantization.
4343

44-
+ Should support non-power-of-2-groups as well.
45-
46-
## Quantization Profiles
47-
48-
Torchchat quantization supports profiles with multiple settings such
49-
as accelerator, dtype, and quantization specified in a JSON file.
50-
Four sample profiles are included wwith the torchchat distributin in
51-
config/data: `cuda.json`, `desktop.json`, `mobile.json`, `pi5.json`
52-
with profiles optimizing for execution on cuda, desktop, mobile and
53-
raspberry Pi devices.
54-
55-
In addition to quantization recipes described below, the profiles also
56-
enable developers to specify the accelerator and dtype to be used.
57-
58-
At present torchchat supports the fast, cuda, mps, and cpu devices.
59-
The default device in torchchat is "fast". The "fast" device is a
60-
virtual device that defaults to the fastest executor available in the
61-
system, selecting cuda, mps, and cpu in this order.
62-
63-
At present torchchat supports the fast16, fast, bf16, fp16 and fp32
64-
data types. The default data type for models is "fast16". The
65-
"fast16" data type is a virtual data type that defaults to the best
66-
16-bit floating point data type available on the selected device. The
67-
"fast" data type is a virtual data type that defaults to the best
68-
floating point data type available on the selected device. ("Best"
69-
tangibly representing a combination of speed and accuracy.)
44+
>\+ Should support non-power-of-2-groups as well.
7045
7146

7247
## Quantization API
@@ -86,8 +61,19 @@ for valid `bitwidth` and `groupsize` values.
8661

8762
See the available quantization schemes [here](https://github.com/pytorch/torchchat/blob/main/quantization/quantize.py#L1260-L1266).
8863

64+
In addition to quantization, the [accelerator](model_customization.md#device)
65+
and [precision](model_customization.md#model-precision) can also be specified.
66+
Preference is given to the args provided in the quantization API over those
67+
provided explicitly (e.g. `--device`).
68+
69+
The expected JSON format is described below. Refer to the links above for valid `device` and `dtype` values.
70+
| config | JSON string |
71+
|--|--|
72+
| accelerator | `'{"executor": {"accelerator": <device>}}'` |
73+
| precision | `'{"precision": {"dtype": <dtype>}}'`|
74+
8975
## Examples
90-
We can mix and match weight quantization with embedding quantization.
76+
Here are some examples of quantization configurations
9177

9278
[skip default]: begin
9379
* Config file
@@ -102,43 +88,41 @@ We can mix and match weight quantization with embedding quantization.
10288
```
10389
--quantize '{"embedding": {"bitwidth": 4, "groupsize":32}, "linear:a8w4dq": {"groupsize" : 256}}'
10490
```
91+
* Quantize linear layers with specified dtype and device
92+
```
93+
--quantize '{"executor": {"accelerator": "cuda"},
94+
"precision": {"dtype": "bf16"},
95+
"linear:int4": {"groupsize" : 256}}'
96+
```
10597
[skip default]: end
10698

10799
Quantization recipes can be applied in conjunction with any of the
108-
`chat`, `generate`, `browser` and `export` commands. Below are
100+
`chat`, `generate`, `browser`, `server`, and `export` commands.
101+
102+
Below are
109103
examples showcasing eager mode with `generate` and AOTI and ExecuTorch
110104
with `export`.
111105

112106
### Eager mode
113107
```
114-
python3 generate.py [--compile] llama3 --prompt "Hello, my name is" --quantize '{"embedding" : {"bitwidth": 8, "groupsize": 0}}' --device cpu
108+
python3 generate.py llama3 --prompt "Hello, my name is" --quantize '{"embedding" : {"bitwidth": 8, "groupsize": 0}}'
115109
```
116110
### AOTI
117111
```
118112
python3 torchchat.py export llama3 --quantize '{"embedding": {"bitwidth": 4, "groupsize":32}, "linear:int4": {"groupsize" : 256}}' --output-dso-path llama3.so
119-
120113
python3 generate.py llama3 --dso-path llama3.so --prompt "Hello my name is"
121114
```
122115
### ExecuTorch
123116
```
124117
python3 torchchat.py export llama3 --quantize '{"embedding": {"bitwidth": 4, "groupsize":32}, "linear:a8w4dq": {"groupsize" : 256}}' --output-pte-path llama3.pte
125-
126118
python3 generate.py llama3 --pte-path llama3.pte --prompt "Hello my name is"
127119
```
128120

129-
## Model precision (dtype precision setting)
130-
On top of quantizing models with integer quantization schemes mentioned above, models can be converted to lower bit floating point precision to reduce the memory bandwidth requirement and take advantage of higher density compute available. For example, many GPUs and some of the CPUs have good support for BFloat16 and Float16. This can be taken advantage of via `--dtype` arg as shown below.
131-
132-
[skip default]: begin
133-
```
134-
python3 generate.py --dtype [ fast16 | fast | bf16 | fp16 | fp32] ...
135-
python3 export.py --dtype [ fast16 | fast | bf16 | fp16 | fp32] ...
136-
```
137-
[skip default]: end
138-
139-
Unlike gpt-fast which uses bfloat16 as default, torchchat uses the dtype "fast16" as the default. Torchchat will pick the appropriate 16-bit floating point type available and offering the best performance (for execution with Executorch, macOS/ARM and Linux/x86 platforms). For macOS, support depends on the OS version, with versions starting with 14.0 supporting bfloat16 as support, and float16 for earlier OS version based on system support for these data types.
121+
## Quantization Profiles
140122

141-
Support for FP16 and BF16 is limited in many embedded processors and -dtype fp32 may be required in some environments. Additional ExecuTorch support for 16-bit floating point types may be added in the future based on hardware support.
123+
Four [sample profiles](https://github.com/pytorch/torchchat/tree/main/config/data) are included with the torchchat distribution: `cuda.json`, `desktop.json`, `mobile.json`, `pi5.json`
124+
with profiles optimizing for execution on cuda, desktop, mobile and
125+
raspberry Pi devices.
142126

143127
## Adding additional quantization schemes
144128
We invite contributors to submit established quantization schemes, with accuracy and performance results demonstrating soundness.

0 commit comments

Comments
 (0)