-
Notifications
You must be signed in to change notification settings - Fork 109
Multimodality support #790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Dennis Keck <[email protected]>
Signed-off-by: Dennis Keck <[email protected]>
Signed-off-by: Dennis Keck <[email protected]>
Signed-off-by: Dennis Keck <[email protected]>
Signed-off-by: Dennis Keck <[email protected]>
Signed-off-by: Dennis Keck <[email protected]>
Signed-off-by: Dennis Keck <[email protected]>
Signed-off-by: Dennis Keck <[email protected]>
Signed-off-by: Dennis Keck <[email protected]>
Signed-off-by: Dennis Keck <[email protected]>
Signed-off-by: Dennis Keck <[email protected]>
llama-cpp-2/src/mtmd.rs
Outdated
Self { | ||
use_gpu: false, | ||
print_timings: true, | ||
n_threads: 4, | ||
media_marker: CString::new(mtmd_default_marker()).unwrap_or_default(), | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like it should rely on mtmd_context_params_default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point -> d025465
llama-cpp-2/src/mtmd.rs
Outdated
context.use_gpu = params.use_gpu; | ||
context.print_timings = params.print_timings; | ||
context.n_threads = params.n_threads; | ||
context.media_marker = params.media_marker.as_ptr(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
destructure params
so you get a compiler error here if new fields are added so we ensure we update this impl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
llama-cpp-2/src/mtmd.rs
Outdated
if context.is_null() { | ||
return Err(MtmdInitError::NullResult); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redundant check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in d025465
llama-cpp-2/src/mtmd.rs
Outdated
bitmaps: &[&MtmdBitmap], | ||
) -> Result<MtmdInputChunks, MtmdTokenizeError> { | ||
let chunks = MtmdInputChunks::new(); | ||
let text_cstring = CString::new(text.text).unwrap_or_default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
silently eating the error here seems ike it could be improved on. I would return an Err if MtmdInputText
can contain non-valid C-Strings or panic if containing valid CStrings is an invariant of MtmdInputText
(and document the invariant if not done)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's a good point. I added an explicit MtmdTokenizeError::CStringError
now
d025465
llama-cpp-2/src/mtmd.rs
Outdated
unsafe impl Send for MtmdContext {} | ||
unsafe impl Sync for MtmdContext {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These impls need some sort of comment on why this is safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I actually meant MtmdContextParams
and that is already Send + Sync
. Afais mtmd_context
is not thread safe. Removed in d025465
llama-cpp-2/src/mtmd.rs
Outdated
unsafe { CStr::from_ptr(ptr) } | ||
.to_string_lossy() | ||
.into_owned() | ||
.into() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: into_owned
+ into
makes this somewhat opaque on what this chain does. a named variable or explicit types would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleaned it up in d025465
if chunk_ptr.is_null() { | ||
None | ||
} else { | ||
// Note: We don't own this chunk, it's owned by the chunks collection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this return a reference then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MtmdInputChunk
is currently just a container for a llama.cpp mtmd_input_chunk
pointer. We do not separately keep track of the chunks but just have a pointer to the mtmd_input_chunks
.
Therefore we get the pointer to the llama.cpp chunk here and wrap it in a MtmdInputChunk
. This owned
field is just there because there is also a way to create an input chunk manually and then encode that one. In this case the underlying mtdm_input_chunk
should be dropped when the container is dropped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good code-wise. Thanks for the really well documented and thought out PR.
I left some comments that should be addressed, but nothing major.
My main concern is that maintaining this will prevent updates to the main llama.h bindings as I expect this to break often as llama.cpp moves forward.
If ALL this code (bindings, wrappers, addition to our stub header file, etc) could be behind the feature flag such that I can update the library and break multi-modal support provided I do not use the feature that would be great.
I will merge this provided that breaks to mtmd would not stop updates to the submodule.
Again, thanks for the PR - this is great.
Hi, not sure if that helps, but I tried running this with
To get the files: wget https://huggingface.co/Mungert/Qwen2.5-Omni-3B-GGUF/resolve/main/Qwen2.5-Omni-3B-q4_k_m.gguf
wget https://huggingface.co/Mungert/Qwen2.5-Omni-3B-GGUF/resolve/main/Qwen2.5-Omni-3B-q8_0.mmproj |
Thanks for reporting, @haixuanTao, I have indeed only tried it with Gemma models. I will take a look. |
No worries. I think that text only runs on llama-cpp-python version. |
* Remove unsafe Send, Sync * Cleanup error handling * Use default mtmd_context directly Signed-off-by: Dennis Keck <[email protected]>
Signed-off-by: Dennis Keck <[email protected]>
The issue was that the default context length was too short and I did not pass the arg also to the Let me know, if there are any other issues, @haixuanTao |
Signed-off-by: Dennis Keck <[email protected]>
@MarcusDunn, thanks for the thorough review. I think I addressed all your comments in d025465 and f149f11. Please let me know if there are still mtmd dependencies that I might have missed. I assume that the |
The include is no worries. I'll take a look this weekend. |
/// Text input chunk | ||
Text = llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_TEXT as isize, | ||
/// Image input chunk | ||
Image = llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_IMAGE as isize, | ||
/// Audio input chunk | ||
Audio = llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_AUDIO as isize, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the cast required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the enum uses a u32 representation it's no longer required: #819
Thanks, good point
llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_TEXT => MtmdInputChunkType::Text, | ||
llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_IMAGE => MtmdInputChunkType::Image, | ||
llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_AUDIO => MtmdInputChunkType::Audio, | ||
_ => panic!("Unknown MTMD input chunk type"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
include the chunk_type
in the panic message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in #819
/// Get audio bitrate in Hz (e.g., 16000 for Whisper). | ||
/// Returns -1 if audio is not supported. | ||
#[must_use] | ||
pub fn get_audio_bitrate(&self) -> i32 { | ||
unsafe { llama_cpp_sys_2::mtmd_get_audio_bitrate(self.context.as_ptr()) } | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer if we returned a Option<u32 / i32> for this and dealt with the -1
case to keep it a bit more rusty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in #819
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Have the non-gemma models been tested? (I will merge regardless as we can always fix that later).
Feel free to address the comments, I'll merge a couple days from now to see if others want to give this branch some testing before I merge.
Can confirm the model mention above is now working wonders! Thanks a lot! |
More of an optimisation thing, but are we using the maximum number of batches for encoding/decoding images and audio? self.n_past = chunks.eval_chunks(&self.mtmd_ctx, context, 0, 0, 1, true)?; // batch size of image decoding It seems that this could be higher than 1, but not an expert on this Related: ggml-org/llama.cpp#14527 if you search through the issue he's actually decoding at 64 n_tokens_batch |
FYI, I tried to pass image+audio at the same time and it didn't work as there were only one marker within the prompt :D Not an urgent bug |
I am able to batch by 10 using qwen-omni, reducing the decoding time from 120ms-> 50ms, but I truly have no idea why most of time when giving a image slice encoded in 326 ms
decoding image batch 1/3, n_tokens_batch = 20
image decoded (batch 1/3) in 52 ms
decoding image batch 2/3, n_tokens_batch = 20
image decoded (batch 2/3) in 53 ms
decoding image batch 3/3, n_tokens_batch = 20
image decoded (batch 3/3) in 53 ms
get_logits_ith: invalid logits id 0, reason: batch.logits[0] != true |
This stage took a massive amount of time in metal. dont know if it works correctly in cuda. |
It's super fast in cuda. But it's a llama.cpp thing apparently |
Related: ggml-org/llama.cpp#15426 |
thanks. experiencing exactly the same issue. tried many things. I thought it was my fault. I've been working on it since yesterday. I tried running it on different threads, but no matter what I did, my optimization only improved it by 2-3 seconds at most. Also, I was doing all the image/video processing on the GPU side (using wgpu). |
Have you tried lowering the resolution of the image? |
I did it for my tests/debugs (like 64x32 etc). This only causes a 1-2 second improvement and doesn't solve the problem. I also have a Mac M3 16GB. A single output takes 1 min on my side. |
ok I have a m4 pro and it's about 30sec more or less. On cloud 5090 cuda, its 120ms the encoding |
Thanks, @haixuanTao, for spotting this. I fixed the issue in #819 and added a new cli arg for it. It should now work with arbitrary batch sizes. Let me know if you still see the invalid logits error. Unfortunately I can't help with the metal performance problems as I don't own a compatible device. It sounds like an issue in llama.cpp though. |
Add bindings for the libmtmd multimodality support of llama-cpp. This allows to use models like e.g. Gemma3 with image or audio input. The bindings are only created with a new
mtmd
feature. To explain the usage, I added a simple example loosely based on themtmd-cli.cpp
.To run the mtmd example, you first need to download the model gguf file and the multimodal projection file, e.g. for Gemma3 you may use:
To then run the example on CPU, provide an image file
my_image.jpg
and run:Closes #744