Token generation speed less than 1/minute on A6000 machine (48GB VRAM, 32GB RAM)

Hi all,
Apologies if this is the wrong place.

My goal is to reach token generation speed of 10+/second w/ a model of 30B params.

I've tried to follow the readme instructions precisely in order to run llama.cpp with GPU acceleration, but I can't seem to get any relevant generation speed. I'm currently at less than 1 token/minute.

### my installation steps:
1. Provisioned an A6000 machine from jarvislabs.ai. It has 48GB VRAM, 32 GB RAM, 100GB SSD. It comes preinstalled with CUDA toolkit, python3, git and anything needed to get started, as far as I'm aware
2. Cloned latest `llama.cpp` with `git clone https://github.com/ggerganov/llama.cpp`
3. Run `make LLAMA_CUBLAS=1` since I have a CUDA enabled nVidia graphics card
4. Downloaded a 30B Q4 GGML Vicuna model (It's called `Wizard-Vicuna-30B-Uncensored.ggmlv3.q4_0.bin`)

### My inference command
`./main -m path/to/Wizard-Vicuna-30B-Uncensored.ggmlv3.q4_0.bin -n 50 -ngl 2000000 -p "Hey, can you please "`

### Expected behavior
Inference works with at least 1 token / second (maybe even 10/second with this "beefy" machine?)

### Actual behavior
Inference works, but token generation speed is about 1 token / minute.

#### llama.cpp claims that work is being offloaded to GPU
```
main: build = 607 (ffb06a3)
main: seed  = 1685616701
llama.cpp: loading model from path/to/Wizard-Vicuna-30B-Uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.13 MB
llama_model_load_internal: mem required  = 2532.67 MB (+ 3124.00 MB per state)
llama_model_load_internal: [cublas] offloading 60 layers to GPU
llama_model_load_internal: [cublas] offloading output layer to GPU
llama_model_load_internal: [cublas] total VRAM used: 17223 MB
....................................................................................................
llama_init_from_file: kv self size  =  780.00 MB

system_info: n_threads = 32 / 64 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 50, n_keep = 0
```

#### CPU usage is 700% (according to `top`)
```
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                       
 5762 root      20   0   48.4g  21.2g  20.3g R 675.7   8.4   5:04.59 main                                                          
```

#### GPU is not being used (according to `watch nvidia-smi`)
```
Thu Jun  1 10:53:13 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    Off  | 00000000:B2:00.0 Off |                  Off |
| 30%   32C    P2    67W / 300W |  18750MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Token generation speed less than 1/minute on A6000 machine (48GB VRAM, 32GB RAM) #1665

my installation steps:

My inference command

Expected behavior

Actual behavior

llama.cpp claims that work is being offloaded to GPU

CPU usage is 700% (according to `top`)

GPU is not being used (according to `watch nvidia-smi`)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Token generation speed less than 1/minute on A6000 machine (48GB VRAM, 32GB RAM) #1665

Description

my installation steps:

My inference command

Expected behavior

Actual behavior

llama.cpp claims that work is being offloaded to GPU

CPU usage is 700% (according to top)

GPU is not being used (according to watch nvidia-smi)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

CPU usage is 700% (according to `top`)

GPU is not being used (according to `watch nvidia-smi`)