-
Notifications
You must be signed in to change notification settings - Fork 112
Multimodality improvements #819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Dennis Keck <[email protected]>
Signed-off-by: Dennis Keck <[email protected]>
Signed-off-by: Dennis Keck <[email protected]>
744750f
to
40f398e
Compare
Signed-off-by: Dennis Keck <[email protected]>
I mean I have tried multiple batch size, but it seems that the actual batch size that works for images media or audio media can sometimes make the predictions not work. 64 typically does not work for me. 10 work but only on images with Owen2.5-omni. I wonder if it's something on our end or something different. |
I feel like other people they just put everything into one batch but I couldn't make it work :S No expert like probably a lot of us :D |
self.context.as_ptr(), | ||
chunks.chunks.as_ptr(), | ||
&input_text, | ||
&raw const input_text, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
til
Do you see an error or are the results just wrong? I tried Gemma-3-4b with this branch and had positive results with a BS of 1, 16 and 256. The only thing I noticed is, that the batch size needs to be a divisor of the number of visual tokens per image. For Gemma that's 256. If one does not pick a divisor, decoding fails with
|
llama-cpp-2/src/mtmd.rs
Outdated
#[repr(u32)] | ||
pub enum MtmdInputChunkType { | ||
/// Text input chunk | ||
Text = llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_TEXT as isize, | ||
Text = llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_TEXT, | ||
/// Image input chunk | ||
Image = llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_IMAGE as isize, | ||
Image = llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_IMAGE, | ||
/// Audio input chunk | ||
Audio = llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_AUDIO as isize, | ||
Audio = llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_AUDIO, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just saw the Github Action run - this does not work on Windows due to different bindgen behavior.
I'll take another look tomorrow or revert to as isize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can get around this with as _
, if there's a corresponding bindgen type, I would prefer that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx, til. Added as _
in 94a83e9
Yeah I get the exact same thing. Although it seems that llama.cpp should be able to handle non divisible token size by batch size. I also haven't figured out a way to know what the image token number is, which only enables me to guess the divisor... Also did you try audio? Because on qwen-omni, the audio and image divisor are not the same. But I don't think it should be an issue for this PR per se. |
examples/mtmd/src/mtmd.rs
Outdated
#[arg(short = 't', long = "threads", value_name = "N", default_value = "4")] | ||
pub n_threads: i32, | ||
/// Number of tokens to process in a batch during eval chunks | ||
#[arg(long = "batch-size", value_name = "b", default_value = "64")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep the default batch size to 1 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting the default to 1 in 94a83e9
Signed-off-by: Dennis Keck <[email protected]>
9620c34
to
94a83e9
Compare
I just tried audio & images (both together and separately) with a CUDA device: RUST_BACKTRACE=1 cargo run --features cuda --example mtmd -- -m ~/Downloads/Qwen2.5-Omni-7B-Q4_K_M.gguf --mmproj ~/Downloads/mmproj-Qwen2.5-Omni-7B-f16.gguf --image ~/Downloads/duck.jpg --audio ~/Downloads/cuckoo.mp3 --prompt "What is in the picture? What is that sound? The image: <start_of_image>The sound: <start_of_image>" --no-mmproj-offload --marker "<start_of_image>" --threads 18 --batch-size 16 Result:
It does not work on your machine? |
@fellhorn Would you like this to be merged as is? I have no issues on my end but it seems like there's still active discussion. |
Basically it can either follow the prompt or the audio. And so, if the audio is like how many duck is there in the image it should work. But it seems that the audio overwrite the prompt |
I also tried a couple of prompts, where audio and image were not related. It still worked fairly well and described them separately from each other. What I noticed though: Qwen2.5-Omni seems rather weak at answering complex questions about the audio. Most likely because the vision and audio encoder only provide a compressed view to the model. @haixuanTao Have you tried your prompts with the Qwen demo on the hf website or with vllm? It would be interesting to see if it's an issue with the model or one with llama.cpp or the Rust bindings. |
@MarcusDunn I think we can go ahead with the PR and merge it. The discussion is not about the enhancements here. If some new changes come out of it, I'll create a separate PR. |
I have only tried simple audio translation, as of now, and image grounding. As it's not an instruct model, it's not going to be very good at conversation I think. I mean in general it behaves as expected. Except for the batch size which I feel might be more than what other ppl are doing. |
Improves a few issues in the mtmd example and wrappers that remained from #790 :