Performance differences between Ollama and gpustack when running embedding model

I have configured Ollama and GPustack  to run the bge-m3 model. The bge-m3 model running on GPustack was also downloaded from Ollama and is executed using vLLM or llama-box. However, I noticed that when calling the bge-m3 model on GPustack, the GPU computing resources are not fully utilized, with utilization below 20%, whereas when calling the bge-m3 model on Ollama, the GPU utilization reaches over 80%.

I have confirmed that I am calling the same model in both cases.
I also tried using different vector models and specifying parameters such as quantization, max-num-batched-tokens, kv-cache-dtype, and max-num-seqs when launching the model, but none of them worked.

Has anyone encountered this issue before? Could you share your thoughts and solutions with me? I would really appreciate it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance differences between Ollama and gpustack when running embedding model #1384

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance differences between Ollama and gpustack when running embedding model #1384

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions