Skip to content

Token generation speed less than 1/minute on A6000 machine (48GB VRAM, 32GB RAM) #1665

@Yuval-Peled

Description

@Yuval-Peled

Hi all,
Apologies if this is the wrong place.

My goal is to reach token generation speed of 10+/second w/ a model of 30B params.

I've tried to follow the readme instructions precisely in order to run llama.cpp with GPU acceleration, but I can't seem to get any relevant generation speed. I'm currently at less than 1 token/minute.

my installation steps:

  1. Provisioned an A6000 machine from jarvislabs.ai. It has 48GB VRAM, 32 GB RAM, 100GB SSD. It comes preinstalled with CUDA toolkit, python3, git and anything needed to get started, as far as I'm aware
  2. Cloned latest llama.cpp with git clone https://github.com/ggerganov/llama.cpp
  3. Run make LLAMA_CUBLAS=1 since I have a CUDA enabled nVidia graphics card
  4. Downloaded a 30B Q4 GGML Vicuna model (It's called Wizard-Vicuna-30B-Uncensored.ggmlv3.q4_0.bin)

My inference command

./main -m path/to/Wizard-Vicuna-30B-Uncensored.ggmlv3.q4_0.bin -n 50 -ngl 2000000 -p "Hey, can you please "

Expected behavior

Inference works with at least 1 token / second (maybe even 10/second with this "beefy" machine?)

Actual behavior

Inference works, but token generation speed is about 1 token / minute.

llama.cpp claims that work is being offloaded to GPU

main: build = 607 (ffb06a3)
main: seed  = 1685616701
llama.cpp: loading model from path/to/Wizard-Vicuna-30B-Uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.13 MB
llama_model_load_internal: mem required  = 2532.67 MB (+ 3124.00 MB per state)
llama_model_load_internal: [cublas] offloading 60 layers to GPU
llama_model_load_internal: [cublas] offloading output layer to GPU
llama_model_load_internal: [cublas] total VRAM used: 17223 MB
....................................................................................................
llama_init_from_file: kv self size  =  780.00 MB

system_info: n_threads = 32 / 64 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 50, n_keep = 0

CPU usage is 700% (according to top)

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                       
 5762 root      20   0   48.4g  21.2g  20.3g R 675.7   8.4   5:04.59 main                                                          

GPU is not being used (according to watch nvidia-smi)

Thu Jun  1 10:53:13 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    Off  | 00000000:B2:00.0 Off |                  Off |
| 30%   32C    P2    67W / 300W |  18750MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions