This repository provides instructions and prebuilt wheels for installing VLLM 0.11.0 with Pascal GPU support (e.g., GTX 1060, 1070, 1080, etc.) using CUDA 12.6.
- Debian 12 (or compatible)
- NVIDIA GPU with Pascal architecture
- CUDA 12.6 and NVIDIA drivers
- Miniconda or Anaconda
- Python 3.12
Follow the official guide:
👉 CUDA 12.6 Download Archive
Or use this helpful guide for Debian 12:
👉 How to Install CUDA on Debian 12
👉 https://www.anaconda.com/docs/getting-started/miniconda/main
conda create -n venv -c conda-forge git python=3.12
conda activate venvVLLM 0.11.0
pip install https://github.com/ampir-nn/vllm-pascal/releases/download/wheels/vllm-0.11.0+pascal.cu126-cp312-cp312-linux_x86_64.whlVLLM 0.10.2
pip install https://github.com/ampir-nn/vllm-pascal/releases/download/wheels/vllm-0.10.2+pascal.cu126-cp312-cp312-linux_x86_64.whlpip uninstall torch triton -y
pip install https://github.com/ampir-nn/vllm-pascal/releases/download/wheels/triton-3.4.0-cp312-cp312-linux_x86_64.whl
pip install https://github.com/ampir-nn/vllm-pascal/releases/download/wheels/torch-2.8.0a0+gitba56102-cp312-cp312-linux_x86_64.whlAt the end of the torch/triton installation, the installer will complain about dependencies — just ignore it.
sudo apt install libnccl2_2.28.3-1+cuda12.6_amd64 libnccl-dev_2.28.3-1+cuda12.6_amd64export VLLM_ATTENTION_BACKEND=TRITON_ATTN
vllm serve jart25/Qwen3-Coder-30B-A3B-Instruct-Int4-gptq \
--tensor-parallel-size 1 \
--pipeline-parallel-size 3 \
--max-num-seqs 1 \
--max-model-len 4096 \
--dtype float16 \
--quantization gptq \
--gpu-memory-utilization 0.95 \
--swap-space 0 \
--cpu-offload-gb 0 \
--enable-expert-parallel \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coderexport VLLM_ATTENTION_BACKEND=TRITON_ATTN
vllm serve jart25/Qwen3-Coder-30B-A3B-Instruct-Int4-gptq \
--tensor-parallel-size 2 \
--max-num-seqs 1 \
--max-model-len 4096 \
--dtype float16 \
--quantization gptq \
--gpu-memory-utilization 0.95 \
--swap-space 0 \
--cpu-offload-gb 0 \
--enable-expert-parallel \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coderexport VLLM_ATTENTION_BACKEND=TRITON_ATTN
vllm serve ./Qwen3-14B-Q5_K_M.gguf \
--tensor-parallel-size 2 \
--max-num-seqs 1 \
--max-model-len 16384 \
--max-num-batched-tokens 16384 \
--dtype float16 \
--quantization gguf \
--gpu-memory-utilization 0.95 \
--swap-space 0 \
--cpu-offload-gb 0This setup is specific to Pascal GPUs and CUDA 12.6
Do not use with newer GPUs (Turing/Ampere/Ada) — use standard VLLM instead
Built for Python 3.12 and Debian 12