You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Sep 10, 2025. It is now read-only.
Docs: Deduplicate quantization.md with the introduction of model_customization.md (#967)
* Update quantization.md
* Push precision into the customization and removed duplicate info from quantization
* Update docs README to call out vetted files
Copy file name to clipboardExpand all lines: docs/quantization.md
+39-55Lines changed: 39 additions & 55 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,3 @@
1
-
> [!WARNING]
2
-
> Files in this directory may be outdated, incomplete, scratch notes, or a WIP. torchchat provides no guarantees on these files as references. Please refer to the root README for stable features and documentation.
3
-
4
-
5
1
# Quantization
6
2
7
3
<!--
@@ -11,62 +7,41 @@
11
7
-->
12
8
13
9
## Introduction
14
-
Quantization focuses on reducing the precision of model parameters and computations from floating-point to lower-bit integers, such as 8-bit integers. This approach aims to minimize memory requirements, accelerate inference speeds, and decrease power consumption, making models more feasible for deployment on edge devices with limited computational resources. For high-performance devices such as GPUs, quantization provides a way to reduce the required memory bandwidth and take advantage of the massive compute capabilities provided by today's server-based accelerators such as GPUs.
10
+
Quantization focuses on reducing the precision of model parameters and computations from floating-point to lower-bit integers, such as 8-bit integers.
11
+
This approach aims to minimize memory requirements, accelerate inference speeds, and decrease power consumption, making models more feasible for
12
+
deployment on edge devices with limited computational resources. For high-performance devices such as GPUs, quantization provides a way to
13
+
reduce the required memory bandwidth and take advantage of the massive compute capabilities provided by today's server-based accelerators such as GPUs.
15
14
16
-
While quantization can potentially degrade the model's performance, the methods supported by torchchat are designed to mitigate this effect, maintaining a balance between efficiency and accuracy. In this document we provide details on the supported quantization schemes, how to quantize models with these schemes and a few example of running such quantized models on supported backends.
15
+
While quantization can potentially degrade the model's performance, the methods supported by torchchat are designed to mitigate this effect,
16
+
maintaining a balance between efficiency and accuracy. In this document we provide details on the supported quantization schemes, how to quantize
17
+
models with these schemes and a few example of running such quantized models on supported backends.
python3 generate.py llama3 --pte-path llama3.pte --prompt "Hello my name is"
127
119
```
128
120
129
-
## Model precision (dtype precision setting)
130
-
On top of quantizing models with integer quantization schemes mentioned above, models can be converted to lower bit floating point precision to reduce the memory bandwidth requirement and take advantage of higher density compute available. For example, many GPUs and some of the CPUs have good support for BFloat16 and Float16. This can be taken advantage of via `--dtype` arg as shown below.
Unlike gpt-fast which uses bfloat16 as default, torchchat uses the dtype "fast16" as the default. Torchchat will pick the appropriate 16-bit floating point type available and offering the best performance (for execution with Executorch, macOS/ARM and Linux/x86 platforms). For macOS, support depends on the OS version, with versions starting with 14.0 supporting bfloat16 as support, and float16 for earlier OS version based on system support for these data types.
121
+
## Quantization Profiles
140
122
141
-
Support for FP16 and BF16 is limited in many embedded processors and -dtype fp32 may be required in some environments. Additional ExecuTorch support for 16-bit floating point types may be added in the future based on hardware support.
123
+
Four [sample profiles](https://github.com/pytorch/torchchat/tree/main/config/data) are included with the torchchat distribution: `cuda.json`, `desktop.json`, `mobile.json`, `pi5.json`
124
+
with profiles optimizing for execution on cuda, desktop, mobile and
125
+
raspberry Pi devices.
142
126
143
127
## Adding additional quantization schemes
144
128
We invite contributors to submit established quantization schemes, with accuracy and performance results demonstrating soundness.
0 commit comments