Skip to content

Commit 475787e

Browse files
authored
Update 2024-09-26-pytorch-native-architecture-optimization.md
1 parent 7be8993 commit 475787e

File tree

1 file changed

+2
-9
lines changed

1 file changed

+2
-9
lines changed

_posts/2024-09-26-pytorch-native-architecture-optimization.md

Lines changed: 2 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -10,19 +10,12 @@ We’re happy to officially launch torchao, a PyTorch native library that makes
1010
We benchmarked our techniques on popular GenAI models like LLama 3 and Diffusion models and saw minimal drops in accuracy. Unless otherwise noted the baselines are bf16 run on A100 80GB GPU.
1111

1212
Our topline metrics for llama 3 are
13-
14-
For inference
15-
16-
* 97% speedup for Llama 3 8B using autoquant with int4 weight only quantization and hqq
17-
* 73% peak VRAM reduction for Llama 3.1 8B at 128K context length with a quantized KV cache
18-
19-
For training
20-
13+
* 97% speedup for Llama 3 8B inference using autoquant with int4 weight only quantization and hqq
14+
* 73% peak VRAM reduction for Llama 3.1 8B inference at 128K context length with a quantized KV cache
2115
* 50% speedup for Llama 3 70B pretraining using float8 training on H100
2216
* 30% peak VRAM reduction for Llama 3 8B using 4 bit quantized optimizers.
2317

2418
Our topline metrics for diffusion model inference
25-
2619
* 53% speedup using float8 dynamic quantization inference with float8 row-wise scaling on flux1.dev onH100
2720
* 50% reduction in model VRAM for CogVideoX using int8 dynamic quantization
2821

0 commit comments

Comments
 (0)