-
Notifications
You must be signed in to change notification settings - Fork 12.6k
Add support for CogVLM model #15002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add support for CogVLM model #15002
Conversation
42113d1
to
de22157
Compare
I think I've fixed the typecheck and format check workflows that were failing before, can someone approve the workflows to run again? |
You can run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a complete review as I don't know enough about mtmd
, just commenting...
|
||
def __init__(self, *args, **kwargs): | ||
super().__init__(*args, **kwargs) | ||
self.hparams['num_attention_heads'] = self.hparams['num_heads'] | ||
|
||
def set_gguf_parameters(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def __init__(self, *args, **kwargs): | |
super().__init__(*args, **kwargs) | |
self.hparams['num_attention_heads'] = self.hparams['num_heads'] | |
def set_gguf_parameters(self): | |
def set_gguf_parameters(self): |
Add num_heads
to the list here instead:
llama.cpp/convert_hf_to_gguf.py
Line 1258 in de22157
self.gguf_writer.add_vision_head_count(self.find_vparam(["num_attention_heads"])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I didn't end up doing this, because after I add the "num_heads" to the list, self.find_vparam(["num_attention_heads"]) will still fail due to not finding the key.
I think this workaround would need to stay... Unless there's some other way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case it wasn't clear, I meant modifying the list like this: self.find_vparam(["num_attention_heads", "num_heads"])
Then it should just work and you don't need to change hparams
.
convert_hf_to_gguf.py
Outdated
if "query_key_value" in name: | ||
# Split tensor into three along first axis | ||
q, k, v = data_torch.split(data_torch.shape[0] // 3, dim=0) | ||
return [ | ||
(self.map_tensor_name(name.replace("query_key_value", "query")), q), | ||
(self.map_tensor_name(name.replace("query_key_value", "key")), k), | ||
(self.map_tensor_name(name.replace("query_key_value", "value")), v), | ||
] | ||
|
||
return [(self.map_tensor_name(name), data_torch)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if "query_key_value" in name: | |
# Split tensor into three along first axis | |
q, k, v = data_torch.split(data_torch.shape[0] // 3, dim=0) | |
return [ | |
(self.map_tensor_name(name.replace("query_key_value", "query")), q), | |
(self.map_tensor_name(name.replace("query_key_value", "key")), k), | |
(self.map_tensor_name(name.replace("query_key_value", "value")), v), | |
] | |
return [(self.map_tensor_name(name), data_torch)] | |
return [(self.map_tensor_name(name), data_torch)] |
Create Q/K/V views at build time instead (check other (non-mm) models for examples).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I've changed it to do the split in llama-model.cpp instead.
convert_hf_to_gguf.py
Outdated
def set_gguf_parameters(self): | ||
super().set_gguf_parameters() | ||
|
||
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def set_gguf_parameters(self): | |
super().set_gguf_parameters() | |
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]: | |
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
convert_hf_to_gguf.py
Outdated
if "query_key_value.weight" in name: | ||
# Slice tensor into three along first axis | ||
q, k, v = data_torch.split(data_torch.shape[0] // 3, dim=0) | ||
return [ | ||
(self.map_tensor_name(name.replace("query_key_value", "query")), q), | ||
(self.map_tensor_name(name.replace("query_key_value", "key")), k), | ||
(self.map_tensor_name(name.replace("query_key_value", "value")), v), | ||
] | ||
|
||
return [(self.map_tensor_name(name), data_torch)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if "query_key_value.weight" in name: | |
# Slice tensor into three along first axis | |
q, k, v = data_torch.split(data_torch.shape[0] // 3, dim=0) | |
return [ | |
(self.map_tensor_name(name.replace("query_key_value", "query")), q), | |
(self.map_tensor_name(name.replace("query_key_value", "key")), k), | |
(self.map_tensor_name(name.replace("query_key_value", "value")), v), | |
] | |
return [(self.map_tensor_name(name), data_torch)] | |
return [(self.map_tensor_name(name), data_torch)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above comment - both CLIP and the main model split the tensors in the graph instead.
src/llama-model.cpp
Outdated
Qcur = ggml_rope(ctx0, Qcur, inp_pos, n_embd_head, GGML_ROPE_TYPE_NEOX); | ||
Kcur = ggml_rope(ctx0, Kcur, inp_pos, n_embd_head, GGML_ROPE_TYPE_NEOX); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update llama_model_rope_type
instead and use rope_type
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I've change it to use rope_type instead.
Thanks for the info! That's something I've been wondering about for a while. |
de22157
to
a571d9a
Compare
// split qkv into Q, K, V along the first dimension | ||
ggml_tensor * Qcur = ggml_cont(ctx0, ggml_view_2d(ctx0, qkv, n_embd, n_tokens, | ||
qkv->nb[1], 0)); | ||
ggml_tensor * Kcur = ggml_cont(ctx0, ggml_view_2d(ctx0, qkv, n_embd, n_tokens, | ||
qkv->nb[1], n_embd * ggml_element_size(qkv))); | ||
ggml_tensor * Vcur = ggml_cont(ctx0, ggml_view_2d(ctx0, qkv, n_embd, n_tokens, | ||
qkv->nb[1], 2 * n_embd * ggml_element_size(qkv))); | ||
|
||
Qcur = ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens); | ||
Kcur = ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// split qkv into Q, K, V along the first dimension | |
ggml_tensor * Qcur = ggml_cont(ctx0, ggml_view_2d(ctx0, qkv, n_embd, n_tokens, | |
qkv->nb[1], 0)); | |
ggml_tensor * Kcur = ggml_cont(ctx0, ggml_view_2d(ctx0, qkv, n_embd, n_tokens, | |
qkv->nb[1], n_embd * ggml_element_size(qkv))); | |
ggml_tensor * Vcur = ggml_cont(ctx0, ggml_view_2d(ctx0, qkv, n_embd, n_tokens, | |
qkv->nb[1], 2 * n_embd * ggml_element_size(qkv))); | |
Qcur = ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens); | |
Kcur = ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens); | |
// split qkv into Q, K, V along the first dimension | |
ggml_tensor * Qcur = ggml_view_3d(ctx0, qkv, n_embd_head, n_head, n_tokens, n_embd_head * sizeof(float), | |
qkv->nb[1], 0); | |
ggml_tensor * Kcur = ggml_view_3d(ctx0, qkv, n_embd_head, n_head_kv, n_tokens, n_embd_head * sizeof(float), | |
qkv->nb[1], n_embd * ggml_element_size(qkv)); | |
ggml_tensor * Vcur = ggml_cont(ctx0, ggml_view_2d(ctx0, qkv, n_embd, n_tokens, | |
qkv->nb[1], 2 * n_embd * ggml_element_size(qkv))); | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All backends can handle non-contiguous RoPE, so you don't need to ggml_cont
Q/K here, and as a bonus you can directly create 3D views.
// Apply silu | ||
gate = ggml_silu_inplace(ctx0, gate); | ||
|
||
// Multiply together | ||
cur = ggml_mul(ctx0, gate, h_to_4h); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Apply silu | |
gate = ggml_silu_inplace(ctx0, gate); | |
// Multiply together | |
cur = ggml_mul(ctx0, gate, h_to_4h); | |
// Apply swiglu | |
cur = ggml_swiglu_split(ctx0, gate, h_to_4h); |
This addresses the requests for CogVLM in #4387 and #4350.
CogVLM is a pretty popular model that now adds in cleanly after the recent additions to libmtmd.
I've converted a GGUF here: Link to GGUF files
Sample command and output: