Skip to content

Add support for CogVLM model #15002

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

Tianyue-Zhao
Copy link

This addresses the requests for CogVLM in #4387 and #4350.
CogVLM is a pretty popular model that now adds in cleanly after the recent additions to libmtmd.
I've converted a GGUF here: Link to GGUF files

Sample command and output:

build/bin/llama-mtmd-cli -m ../cogvlm-chat-hf/cogvlm-13B-chat-v1.1-F16.gguf --mmproj ../cogvlm-chat-hf/mmproj-cogvlm-chat-hf --image ./community.png --chat-template vicuna -p "Describe the picture"

load_hparams: model size:         8448.53 MiB
load_hparams: metadata size:      0.36 MiB
alloc_compute_meta:        CPU compute buffer size =   142.02 MiB
main: loading model: ../cogvlm-chat-hf/cogvlm-13B-chat-v1.1-F16.gguf
encoding image slice...
image slice encoded in 16135 ms
decoding image batch 1/1, n_tokens_batch = 1227
image decoded (batch 1/1) in 54065 ms

1. The image showcases a futuristic urban landscape with a mix of architectural styles. The buildings are multi-storied and have a combination of traditional and modern elements. There's a prominent tree in the foreground, suggesting a blend of nature and urban development. The scene appears to be bustling with activity, with various signs and billboards, indicating commercial or residential zones.


llama_perf_context_print:        load time =  108969.65 ms
llama_perf_context_print: prompt eval time =   85229.27 ms /  1241 tokens (   68.68 ms per token,    14.56 tokens per second)
llama_perf_context_print:        eval time =   19843.15 ms /    83 runs   (  239.07 ms per token,     4.18 tokens per second)
llama_perf_context_print:       total time =  126951.23 ms /  1324 tokens
llama_perf_context_print:    graphs reused =          0

@github-actions github-actions bot added examples python python script changes labels Aug 1, 2025
@Tianyue-Zhao Tianyue-Zhao marked this pull request as ready for review August 1, 2025 02:15
@Tianyue-Zhao
Copy link
Author

I think I've fixed the typecheck and format check workflows that were failing before, can someone approve the workflows to run again?
Also, is there a way to run these Github workflows locally or without needing approval from a reviewer?
It would be good to run these CI/CD checks myself before posting the PR.

@CISC
Copy link
Collaborator

CISC commented Aug 2, 2025

Also, is there a way to run these Github workflows locally or without needing approval from a reviewer? It would be good to run these CI/CD checks myself before posting the PR.

You can run flake8, pyright and editorconfig locally (or via IDE plugins), the build tests can be run manually with ctest.

Copy link
Collaborator

@CISC CISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a complete review as I don't know enough about mtmd, just commenting...

Comment on lines +7934 to +8315

def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.hparams['num_attention_heads'] = self.hparams['num_heads']

def set_gguf_parameters(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.hparams['num_attention_heads'] = self.hparams['num_heads']
def set_gguf_parameters(self):
def set_gguf_parameters(self):

Add num_heads to the list here instead:

self.gguf_writer.add_vision_head_count(self.find_vparam(["num_attention_heads"]))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I didn't end up doing this, because after I add the "num_heads" to the list, self.find_vparam(["num_attention_heads"]) will still fail due to not finding the key.
I think this workaround would need to stay... Unless there's some other way?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case it wasn't clear, I meant modifying the list like this: self.find_vparam(["num_attention_heads", "num_heads"])

Then it should just work and you don't need to change hparams.

Comment on lines 7950 to 8326
if "query_key_value" in name:
# Split tensor into three along first axis
q, k, v = data_torch.split(data_torch.shape[0] // 3, dim=0)
return [
(self.map_tensor_name(name.replace("query_key_value", "query")), q),
(self.map_tensor_name(name.replace("query_key_value", "key")), k),
(self.map_tensor_name(name.replace("query_key_value", "value")), v),
]

return [(self.map_tensor_name(name), data_torch)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if "query_key_value" in name:
# Split tensor into three along first axis
q, k, v = data_torch.split(data_torch.shape[0] // 3, dim=0)
return [
(self.map_tensor_name(name.replace("query_key_value", "query")), q),
(self.map_tensor_name(name.replace("query_key_value", "key")), k),
(self.map_tensor_name(name.replace("query_key_value", "value")), v),
]
return [(self.map_tensor_name(name), data_torch)]
return [(self.map_tensor_name(name), data_torch)]

Create Q/K/V views at build time instead (check other (non-mm) models for examples).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I've changed it to do the split in llama-model.cpp instead.

Comment on lines 7966 to 8333
def set_gguf_parameters(self):
super().set_gguf_parameters()

def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def set_gguf_parameters(self):
super().set_gguf_parameters()
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Comment on lines 7976 to 8340
if "query_key_value.weight" in name:
# Slice tensor into three along first axis
q, k, v = data_torch.split(data_torch.shape[0] // 3, dim=0)
return [
(self.map_tensor_name(name.replace("query_key_value", "query")), q),
(self.map_tensor_name(name.replace("query_key_value", "key")), k),
(self.map_tensor_name(name.replace("query_key_value", "value")), v),
]

return [(self.map_tensor_name(name), data_torch)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if "query_key_value.weight" in name:
# Slice tensor into three along first axis
q, k, v = data_torch.split(data_torch.shape[0] // 3, dim=0)
return [
(self.map_tensor_name(name.replace("query_key_value", "query")), q),
(self.map_tensor_name(name.replace("query_key_value", "key")), k),
(self.map_tensor_name(name.replace("query_key_value", "value")), v),
]
return [(self.map_tensor_name(name), data_torch)]
return [(self.map_tensor_name(name), data_torch)]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above comment - both CLIP and the main model split the tensors in the graph instead.

Comment on lines 17659 to 17660
Qcur = ggml_rope(ctx0, Qcur, inp_pos, n_embd_head, GGML_ROPE_TYPE_NEOX);
Kcur = ggml_rope(ctx0, Kcur, inp_pos, n_embd_head, GGML_ROPE_TYPE_NEOX);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update llama_model_rope_type instead and use rope_type.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I've change it to use rope_type instead.

@Tianyue-Zhao
Copy link
Author

Also, is there a way to run these Github workflows locally or without needing approval from a reviewer? It would be good to run these CI/CD checks myself before posting the PR.

You can run flake8, pyright and editorconfig locally (or via IDE plugins), the build tests can be run manually with ctest.

Thanks for the info! That's something I've been wondering about for a while.

Comment on lines +18126 to +18135
// split qkv into Q, K, V along the first dimension
ggml_tensor * Qcur = ggml_cont(ctx0, ggml_view_2d(ctx0, qkv, n_embd, n_tokens,
qkv->nb[1], 0));
ggml_tensor * Kcur = ggml_cont(ctx0, ggml_view_2d(ctx0, qkv, n_embd, n_tokens,
qkv->nb[1], n_embd * ggml_element_size(qkv)));
ggml_tensor * Vcur = ggml_cont(ctx0, ggml_view_2d(ctx0, qkv, n_embd, n_tokens,
qkv->nb[1], 2 * n_embd * ggml_element_size(qkv)));

Qcur = ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens);
Kcur = ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// split qkv into Q, K, V along the first dimension
ggml_tensor * Qcur = ggml_cont(ctx0, ggml_view_2d(ctx0, qkv, n_embd, n_tokens,
qkv->nb[1], 0));
ggml_tensor * Kcur = ggml_cont(ctx0, ggml_view_2d(ctx0, qkv, n_embd, n_tokens,
qkv->nb[1], n_embd * ggml_element_size(qkv)));
ggml_tensor * Vcur = ggml_cont(ctx0, ggml_view_2d(ctx0, qkv, n_embd, n_tokens,
qkv->nb[1], 2 * n_embd * ggml_element_size(qkv)));
Qcur = ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens);
Kcur = ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens);
// split qkv into Q, K, V along the first dimension
ggml_tensor * Qcur = ggml_view_3d(ctx0, qkv, n_embd_head, n_head, n_tokens, n_embd_head * sizeof(float),
qkv->nb[1], 0);
ggml_tensor * Kcur = ggml_view_3d(ctx0, qkv, n_embd_head, n_head_kv, n_tokens, n_embd_head * sizeof(float),
qkv->nb[1], n_embd * ggml_element_size(qkv));
ggml_tensor * Vcur = ggml_cont(ctx0, ggml_view_2d(ctx0, qkv, n_embd, n_tokens,
qkv->nb[1], 2 * n_embd * ggml_element_size(qkv)));

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All backends can handle non-contiguous RoPE, so you don't need to ggml_cont Q/K here, and as a bonus you can directly create 3D views.

Comment on lines +1678 to +1682
// Apply silu
gate = ggml_silu_inplace(ctx0, gate);

// Multiply together
cur = ggml_mul(ctx0, gate, h_to_4h);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Apply silu
gate = ggml_silu_inplace(ctx0, gate);
// Multiply together
cur = ggml_mul(ctx0, gate, h_to_4h);
// Apply swiglu
cur = ggml_swiglu_split(ctx0, gate, h_to_4h);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants