README: Add a model customization guide (#962)

Jack-Khuu · web-flow · commit d78710f96ac2 · 2024-07-29T15:51:23.000-07:00
* Add an initial index on customization guide for models

* Adding README pointers + help fix

* Updating Format and content of customization page
diff --git a/README.md b/README.md
@@ -155,6 +155,13 @@ More information about these commands can be found by adding the `--help` option
 
 ## Running via PyTorch / Python
 
+The simplest way to run a model in PyTorch is via [eager execution](https://pytorch.org/blog/optimizing-production-pytorch-performance-with-graph-transformations/).
+This is the default execution mode for both PyTorch and torchchat. It performs inference
+without creating exporting artifacts or using a separate runner.
+
+The model used for inference can also be configured and tailored to specific needs
+(compilation, quantization, etc.). See the [customization guide](docs/model_customization.md) for the options supported by torchchat.
+
 > [!TIP]
 > For more information about these commands, please refer to the `--help` menu.
 
@@ -258,6 +265,8 @@ python3 torchchat.py export llama3 --output-dso-path exportedModels/llama3.so
 > If your machine has cuda add this flag for performance
 `--quantize config/data/cuda.json` when exporting.
 
+For more details on quantization and what settings to use for your use
+case visit our [customization guide](docs/model_customization.md).
 
 ### Run in a Python Enviroment
 
@@ -320,7 +329,7 @@ llama3 model to reduce model size and improve performance for
 on-device use cases.
 
 For more details on quantization and what settings to use for your use
-case visit our [Quantization documentation](docs/quantization.md).
+case visit our [customization guide](docs/model_customization.md).
 
 ### Deploy and run on Desktop
 
diff --git a/cli.py b/cli.py
@@ -122,16 +122,17 @@ def _add_model_specification_args(parser) -> None:
 
 # Add CLI Args related to model configuration (compilation, quant, etc)
 def _add_model_config_args(parser, verb: str) -> None:
+    is_not_export = verb != "export"
     model_config_parser = parser.add_argument_group("Model Configuration", "Specify model configurations")
     model_config_parser.add_argument(
         "--compile",
         action="store_true",
-        help="Whether to compile the model with torch.compile",
+        help="Whether to compile the model with torch.compile" if is_not_export else argparse.SUPPRESS,
     )
     model_config_parser.add_argument(
         "--compile-prefill",
         action="store_true",
-        help="Whether to compile the prefill. Improves prefill perf, but has higher compile times.",
+        help="Whether to compile the prefill. Improves prefill perf, but has higher compile times." if is_not_export else argparse.SUPPRESS,
     )
     model_config_parser.add_argument(
         "--dtype",
diff --git a/docs/model_customization.md b/docs/model_customization.md
@@ -0,0 +1,61 @@
+# Model Customization
+
+By default, torchchat (and PyTorch) defaults to unquantized [eager execution](https://pytorch.org/blog/optimizing-production-pytorch-performance-with-graph-transformations/).
+
+This page goes over the different options torchchat provides for customizing the model execution for inference.
+- Device
+- Compilation
+- Model Precision
+- Quantization
+
+
+## Device
+
+```
+python3 (chat | generate | browser | server | export | eval) --device [cpu | cuda | mps] ...
+```
+
+To leverage a specific accelerator, the target device can be set.
+
+By default, torchchat defaults to the fastest executor available in the system, chosen in this
+order: cuda, mps, and cpu.
+
+
+## Compilation: JIT-compiled execution
+```
+python3 (chat | generate | browser | server | eval) [--compile][--compile_prefill] ...
+```
+
+To improve performance, you can compile the model with `--compile`;
+trading off the time to first token processed with time per token.
+
+To improve performance further, at the cost of increased compile time, you may also compile the
+prefill with `--compile_prefill`.
+
+To learn more about compilation, check out: https://pytorch.org/get-started/pytorch-2.0/
+
+
+## Model Precision
+
+```
+python3 (chat | generate | browser | server | export | eval) --dtype [fast | fast16 | bf16 | fp16 | fp32] ...
+```
+
+To reduce the memory bandwidth requirement and to take advantage of higher density compute available,
+the model can use lower precision floating point representations.
+For example, many GPUs and some of the CPUs have good support for bfloat16 and float16.
+
+See the [precision guide](quantization.md#model-precision-dtype-precision-setting) for more details.
+
+
+## Quantization
+
+```
+python3 (chat | generate | browser | server | export | eval) [--quantize] <quant.json> ...
+```
+
+To further minimize memory requirements, accelerate inference speeds, and
+decrease power consumption the model can also be quantized.
+Torchchat leverages [torchao](https://github.com/pytorch/ao) for quantization.
+
+See the [quantization guide](quantization.md) for examples and more details.