-
Notifications
You must be signed in to change notification settings - Fork 375
Open
Description
Overview
SmoothQuant and AWQ operate using a similar intuition, scaling up weights and scaling down activations. Therefore, benchmarks can share many logics within the vLLM ecosystem, using vllm/lm-eval and vllm/benchmarks/benchmark_latency.
Ideally, the following user guide and docs can be updated focusing on vLLM:
- AWQ:
torchao/prototype/awq/example.py & torchao/prototype/awq/README.md - SmoothQuant:
torchao/prototype/smoothquant/example.py & torchao/prototype/smoothquant/README.md
Related Issue/PR
Resources
- https://blog.squeezebits.com/vllm-vs-tensorrtllm-2-towards-optimal-batching-for-llm-serving-31349
- https://vllm-ascend.readthedocs.io/en/latest/developer_guide/evaluation/using_lm_eval.html
- https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_latency.py
- How to benchmark vLLM a short tutorial vllm-project/vllm#7181
gau-nernst
Metadata
Metadata
Assignees
Labels
No labels