Skip to content

Performance differences between Ollama and gpustack when running embedding model #1384

@wyanghu

Description

@wyanghu

I have configured Ollama and GPustack to run the bge-m3 model. The bge-m3 model running on GPustack was also downloaded from Ollama and is executed using vLLM or llama-box. However, I noticed that when calling the bge-m3 model on GPustack, the GPU computing resources are not fully utilized, with utilization below 20%, whereas when calling the bge-m3 model on Ollama, the GPU utilization reaches over 80%.

I have confirmed that I am calling the same model in both cases.
I also tried using different vector models and specifying parameters such as quantization, max-num-batched-tokens, kv-cache-dtype, and max-num-seqs when launching the model, but none of them worked.

Has anyone encountered this issue before? Could you share your thoughts and solutions with me? I would really appreciate it.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions