Comparison of LLM Quantization

Overview

In this repository, we perform quantization for different models and compute the inference duration, latency, and memory requirements.

For the CodeGen2 models from Salesforce, we used three variants: the 1B model, 3.7B model, and the 7B model

Each model was run for the following quantization data types:

The results are as follows

The 34B model was used.

Since quantization takes a lot of time (and compute), the AWQ- and GPTQ-quantized models (thanks to TheBloke) were used.

The results are shown below.

We can see that the AWQ model has a slightly better performance than GPTQ.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
images		images
README.md		README.md
bench_pt.py		bench_pt.py
bench_quant.py		bench_quant.py
text-codegen.csv		text-codegen.csv
text-coldellama.csv		text-coldellama.csv