-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Closed
Closed
Copy link
Description
Hi all,
Apologies if this is the wrong place.
My goal is to reach token generation speed of 10+/second w/ a model of 30B params.
I've tried to follow the readme instructions precisely in order to run llama.cpp with GPU acceleration, but I can't seem to get any relevant generation speed. I'm currently at less than 1 token/minute.
my installation steps:
- Provisioned an A6000 machine from jarvislabs.ai. It has 48GB VRAM, 32 GB RAM, 100GB SSD. It comes preinstalled with CUDA toolkit, python3, git and anything needed to get started, as far as I'm aware
- Cloned latest
llama.cppwithgit clone https://github.com/ggerganov/llama.cpp - Run
make LLAMA_CUBLAS=1since I have a CUDA enabled nVidia graphics card - Downloaded a 30B Q4 GGML Vicuna model (It's called
Wizard-Vicuna-30B-Uncensored.ggmlv3.q4_0.bin)
My inference command
./main -m path/to/Wizard-Vicuna-30B-Uncensored.ggmlv3.q4_0.bin -n 50 -ngl 2000000 -p "Hey, can you please "
Expected behavior
Inference works with at least 1 token / second (maybe even 10/second with this "beefy" machine?)
Actual behavior
Inference works, but token generation speed is about 1 token / minute.
llama.cpp claims that work is being offloaded to GPU
main: build = 607 (ffb06a3)
main: seed = 1685616701
llama.cpp: loading model from path/to/Wizard-Vicuna-30B-Uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 6656
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 52
llama_model_load_internal: n_layer = 60
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 17920
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 0.13 MB
llama_model_load_internal: mem required = 2532.67 MB (+ 3124.00 MB per state)
llama_model_load_internal: [cublas] offloading 60 layers to GPU
llama_model_load_internal: [cublas] offloading output layer to GPU
llama_model_load_internal: [cublas] total VRAM used: 17223 MB
....................................................................................................
llama_init_from_file: kv self size = 780.00 MB
system_info: n_threads = 32 / 64 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 50, n_keep = 0
CPU usage is 700% (according to top)
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5762 root 20 0 48.4g 21.2g 20.3g R 675.7 8.4 5:04.59 main
GPU is not being used (according to watch nvidia-smi)
Thu Jun 1 10:53:13 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:B2:00.0 Off | Off |
| 30% 32C P2 67W / 300W | 18750MiB / 49140MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Metadata
Metadata
Assignees
Labels
No labels