Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Commit d78710f

Browse files
authored
README: Add a model customization guide (#962)
* Add an initial index on customization guide for models * Adding README pointers + help fix * Updating Format and content of customization page
1 parent f3ffa0e commit d78710f

File tree

3 files changed

+74
-3
lines changed

3 files changed

+74
-3
lines changed

README.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -155,6 +155,13 @@ More information about these commands can be found by adding the `--help` option
155155

156156
## Running via PyTorch / Python
157157

158+
The simplest way to run a model in PyTorch is via [eager execution](https://pytorch.org/blog/optimizing-production-pytorch-performance-with-graph-transformations/).
159+
This is the default execution mode for both PyTorch and torchchat. It performs inference
160+
without creating exporting artifacts or using a separate runner.
161+
162+
The model used for inference can also be configured and tailored to specific needs
163+
(compilation, quantization, etc.). See the [customization guide](docs/model_customization.md) for the options supported by torchchat.
164+
158165
> [!TIP]
159166
> For more information about these commands, please refer to the `--help` menu.
160167
@@ -258,6 +265,8 @@ python3 torchchat.py export llama3 --output-dso-path exportedModels/llama3.so
258265
> If your machine has cuda add this flag for performance
259266
`--quantize config/data/cuda.json` when exporting.
260267

268+
For more details on quantization and what settings to use for your use
269+
case visit our [customization guide](docs/model_customization.md).
261270

262271
### Run in a Python Enviroment
263272

@@ -320,7 +329,7 @@ llama3 model to reduce model size and improve performance for
320329
on-device use cases.
321330

322331
For more details on quantization and what settings to use for your use
323-
case visit our [Quantization documentation](docs/quantization.md).
332+
case visit our [customization guide](docs/model_customization.md).
324333

325334
### Deploy and run on Desktop
326335

cli.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -122,16 +122,17 @@ def _add_model_specification_args(parser) -> None:
122122

123123
# Add CLI Args related to model configuration (compilation, quant, etc)
124124
def _add_model_config_args(parser, verb: str) -> None:
125+
is_not_export = verb != "export"
125126
model_config_parser = parser.add_argument_group("Model Configuration", "Specify model configurations")
126127
model_config_parser.add_argument(
127128
"--compile",
128129
action="store_true",
129-
help="Whether to compile the model with torch.compile",
130+
help="Whether to compile the model with torch.compile" if is_not_export else argparse.SUPPRESS,
130131
)
131132
model_config_parser.add_argument(
132133
"--compile-prefill",
133134
action="store_true",
134-
help="Whether to compile the prefill. Improves prefill perf, but has higher compile times.",
135+
help="Whether to compile the prefill. Improves prefill perf, but has higher compile times." if is_not_export else argparse.SUPPRESS,
135136
)
136137
model_config_parser.add_argument(
137138
"--dtype",

docs/model_customization.md

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
# Model Customization
2+
3+
By default, torchchat (and PyTorch) defaults to unquantized [eager execution](https://pytorch.org/blog/optimizing-production-pytorch-performance-with-graph-transformations/).
4+
5+
This page goes over the different options torchchat provides for customizing the model execution for inference.
6+
- Device
7+
- Compilation
8+
- Model Precision
9+
- Quantization
10+
11+
12+
## Device
13+
14+
```
15+
python3 (chat | generate | browser | server | export | eval) --device [cpu | cuda | mps] ...
16+
```
17+
18+
To leverage a specific accelerator, the target device can be set.
19+
20+
By default, torchchat defaults to the fastest executor available in the system, chosen in this
21+
order: cuda, mps, and cpu.
22+
23+
24+
## Compilation: JIT-compiled execution
25+
```
26+
python3 (chat | generate | browser | server | eval) [--compile][--compile_prefill] ...
27+
```
28+
29+
To improve performance, you can compile the model with `--compile`;
30+
trading off the time to first token processed with time per token.
31+
32+
To improve performance further, at the cost of increased compile time, you may also compile the
33+
prefill with `--compile_prefill`.
34+
35+
To learn more about compilation, check out: https://pytorch.org/get-started/pytorch-2.0/
36+
37+
38+
## Model Precision
39+
40+
```
41+
python3 (chat | generate | browser | server | export | eval) --dtype [fast | fast16 | bf16 | fp16 | fp32] ...
42+
```
43+
44+
To reduce the memory bandwidth requirement and to take advantage of higher density compute available,
45+
the model can use lower precision floating point representations.
46+
For example, many GPUs and some of the CPUs have good support for bfloat16 and float16.
47+
48+
See the [precision guide](quantization.md#model-precision-dtype-precision-setting) for more details.
49+
50+
51+
## Quantization
52+
53+
```
54+
python3 (chat | generate | browser | server | export | eval) [--quantize] <quant.json> ...
55+
```
56+
57+
To further minimize memory requirements, accelerate inference speeds, and
58+
decrease power consumption the model can also be quantized.
59+
Torchchat leverages [torchao](https://github.com/pytorch/ao) for quantization.
60+
61+
See the [quantization guide](quantization.md) for examples and more details.

0 commit comments

Comments
 (0)