Multimodality improvements #819

fellhorn · 2025-09-01T22:17:33Z

Improves a few issues in the mtmd example and wrappers that remained from #790 :

fixes the batch size param (thanks @haixuanTao for spotting this)
addresses remaining nits from @MarcusDunn
clippy cleanup

Signed-off-by: Dennis Keck <[email protected]>

haixuanTao · 2025-09-02T08:28:51Z

I mean I have tried multiple batch size, but it seems that the actual batch size that works for images media or audio media can sometimes make the predictions not work. 64 typically does not work for me. 10 work but only on images with Owen2.5-omni.

I wonder if it's something on our end or something different.

haixuanTao · 2025-09-02T08:31:37Z

I feel like other people they just put everything into one batch but I couldn't make it work :S

No expert like probably a lot of us :D

MarcusDunn · 2025-09-02T15:51:17Z

llama-cpp-2/src/mtmd.rs

                self.context.as_ptr(),
                chunks.chunks.as_ptr(),
-                &input_text,
+                &raw const input_text,


fellhorn · 2025-09-02T21:41:20Z

I mean I have tried multiple batch size, but it seems that the actual batch size that works for images media or audio media can sometimes make the predictions not work. 64 typically does not work for me. 10 work but only on images with Owen2.5-omni.

I wonder if it's something on our end or something different.

Do you see an error or are the results just wrong? I tried Gemma-3-4b with this branch and had positive results with a BS of 1, 16 and 256. The only thing I noticed is, that the batch size needs to be a divisor of the number of visual tokens per image. For Gemma that's 256.

If one does not pick a divisor, decoding fails with

init: the tokens of sequence 0 in the input batch have inconsistent sequence positions:
- the last position stored in the memory module of the context (i.e. the KV cache) for sequence 0 is X = 260
- the tokens for sequence 0 in the input batch have a starting position of Y = 267
it is required that the sequence positions remain consecutive: Y = X + 1
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1
failed to decode text
failed to eval chunk 2
Error: EvalFailure(-1)

fellhorn · 2025-09-02T22:02:28Z

llama-cpp-2/src/mtmd.rs

+#[repr(u32)]
 pub enum MtmdInputChunkType {
    /// Text input chunk
-    Text = llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_TEXT as isize,
+    Text = llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_TEXT,
    /// Image input chunk
-    Image = llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_IMAGE as isize,
+    Image = llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_IMAGE,
    /// Audio input chunk
-    Audio = llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_AUDIO as isize,
+    Audio = llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_AUDIO,


Just saw the Github Action run - this does not work on Windows due to different bindgen behavior.

I'll take another look tomorrow or revert to as isize

you can get around this with as _, if there's a corresponding bindgen type, I would prefer that.

Thx, til. Added as _ in 94a83e9

haixuanTao · 2025-09-03T12:31:01Z

I mean I have tried multiple batch size, but it seems that the actual batch size that works for images media or audio media can sometimes make the predictions not work. 64 typically does not work for me. 10 work but only on images with Owen2.5-omni.

I wonder if it's something on our end or something different.

Do you see an error or are the results just wrong? I tried Gemma-3-4b with this branch and had positive results with a BS of 1, 16 and 256. The only thing I noticed is, that the batch size needs to be a divisor of the number of visual tokens per image. For Gemma that's 256.

If one does not pick a divisor, decoding fails with
init: the tokens of sequence 0 in the input batch have inconsistent sequence positions:

- the last position stored in the memory module of the context (i.e. the KV cache) for sequence 0 is X = 260

- the tokens for sequence 0 in the input batch have a starting position of Y = 267

it is required that the sequence positions remain consecutive: Y = X + 1

decode: failed to initialize batch

llama_decode: failed to decode, ret = -1

failed to decode text

failed to eval chunk 2

Error: EvalFailure(-1)

Yeah I get the exact same thing. Although it seems that llama.cpp should be able to handle non divisible token size by batch size.

I also haven't figured out a way to know what the image token number is, which only enables me to guess the divisor...

Also did you try audio? Because on qwen-omni, the audio and image divisor are not the same.

But I don't think it should be an issue for this PR per se.

haixuanTao · 2025-09-03T12:31:27Z

examples/mtmd/src/mtmd.rs

    #[arg(short = 't', long = "threads", value_name = "N", default_value = "4")]
    pub n_threads: i32,
+    /// Number of tokens to process in a batch during eval chunks
+    #[arg(long = "batch-size", value_name = "b", default_value = "64")]


Can we keep the default batch size to 1 ?

Setting the default to 1 in 94a83e9

Signed-off-by: Dennis Keck <[email protected]>

fellhorn · 2025-09-04T15:39:00Z

Also did you try audio? Because on qwen-omni, the audio and image divisor are not the same.

I just tried audio & images (both together and separately) with a CUDA device:

RUST_BACKTRACE=1 cargo run --features cuda --example mtmd -- -m ~/Downloads/Qwen2.5-Omni-7B-Q4_K_M.gguf --mmproj ~/Downloads/mmproj-Qwen2.5-Omni-7B-f16.gguf --image ~/Downloads/duck.jpg  --audio ~/Downloads/cuckoo.mp3 --prompt "What is in the picture? What is that sound? The image: <start_of_image>The sound: <start_of_image>" --no-mmproj-offload --marker "<start_of_image>" --threads 18 --batch-size 16

Result:

...
The picture shows a white duck with a yellow beak swimming in dark water. The duck appears to be floating calmly on the surface. The sound in the background is a series of bird calls, which could be from various bird species in a natural setting.

It does not work on your machine?

MarcusDunn · 2025-09-04T15:39:40Z

@fellhorn Would you like this to be merged as is? I have no issues on my end but it seems like there's still active discussion.

haixuanTao · 2025-09-04T19:11:47Z

Also did you try audio? Because on qwen-omni, the audio and image divisor are not the same.

I just tried audio & images (both together and separately) with a CUDA device:

RUST_BACKTRACE=1 cargo run --features cuda --example mtmd -- -m ~/Downloads/Qwen2.5-Omni-7B-Q4_K_M.gguf --mmproj ~/Downloads/mmproj-Qwen2.5-Omni-7B-f16.gguf --image ~/Downloads/duck.jpg  --audio ~/Downloads/cuckoo.mp3 --prompt "What is in the picture? What is that sound? The image: <start_of_image>The sound: <start_of_image>" --no-mmproj-offload --marker "<start_of_image>" --threads 18 --batch-size 16

Result:

...
The picture shows a white duck with a yellow beak swimming in dark water. The duck appears to be floating calmly on the surface. The sound in the background is a series of bird calls, which could be from various bird species in a natural setting.

It does not work on your machine?

Basically it can either follow the prompt or the audio. And so, if the audio is like how many duck is there in the image it should work.

But it seems that the audio overwrite the prompt

fellhorn · 2025-09-04T21:01:38Z

Basically it can either follow the prompt or the audio. And so, if the audio is like how many duck is there in the image it should work.

But it seems that the audio overwrite the prompt

I also tried a couple of prompts, where audio and image were not related. It still worked fairly well and described them separately from each other. What I noticed though: Qwen2.5-Omni seems rather weak at answering complex questions about the audio. Most likely because the vision and audio encoder only provide a compressed view to the model.

@haixuanTao Have you tried your prompts with the Qwen demo on the hf website or with vllm? It would be interesting to see if it's an issue with the model or one with llama.cpp or the Rust bindings.

fellhorn · 2025-09-04T21:06:30Z

@fellhorn Would you like this to be merged as is? I have no issues on my end but it seems like there's still active discussion.

@MarcusDunn I think we can go ahead with the PR and merge it. The discussion is not about the enhancements here. If some new changes come out of it, I'll create a separate PR.

haixuanTao · 2025-09-05T10:01:03Z

Basically it can either follow the prompt or the audio. And so, if the audio is like how many duck is there in the image it should work.
But it seems that the audio overwrite the prompt

I also tried a couple of prompts, where audio and image were not related. It still worked fairly well and described them separately from each other. What I noticed though: Qwen2.5-Omni seems rather weak at answering complex questions about the audio. Most likely because the vision and audio encoder only provide a compressed view to the model.

@haixuanTao Have you tried your prompts with the Qwen demo on the hf website or with vllm? It would be interesting to see if it's an issue with the model or one with llama.cpp or the Rust bindings.

I have only tried simple audio translation, as of now, and image grounding. As it's not an instruct model, it's not going to be very good at conversation I think.

I mean in general it behaves as expected. Except for the batch size which I feel might be more than what other ppl are doing.

fellhorn added 2 commits August 31, 2025 23:38

Address code review comments

a765f3b

Signed-off-by: Dennis Keck <[email protected]>

Clippy & fix batch size

ae21a1c

Signed-off-by: Dennis Keck <[email protected]>

fellhorn mentioned this pull request Sep 1, 2025

Multimodality support #790

Merged

MtmdInputChunkType u32 repr

40f398e

Signed-off-by: Dennis Keck <[email protected]>

fellhorn force-pushed the dennis/mtmd-improvements branch from 744750f to 40f398e Compare September 1, 2025 22:35

Add comment re ownership

3ddf41a

Signed-off-by: Dennis Keck <[email protected]>

MarcusDunn reviewed Sep 2, 2025

View reviewed changes

llama-cpp-2/src/mtmd.rs

self.context.as_ptr(),

chunks.chunks.as_ptr(),

&input_text,

&raw const input_text,

Copy link

Contributor

MarcusDunn Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

til

Merge branch 'main' into dennis/mtmd-improvements

e9b1951

fellhorn commented Sep 2, 2025

View reviewed changes

haixuanTao reviewed Sep 3, 2025

View reviewed changes

Fix MtmdInputChunkType win build & default bs 1

94a83e9

Signed-off-by: Dennis Keck <[email protected]>

fellhorn force-pushed the dennis/mtmd-improvements branch from 9620c34 to 94a83e9 Compare September 4, 2025 08:43

MarcusDunn merged commit 4063f55 into utilityai:main Sep 4, 2025
3 of 5 checks passed

Multimodality improvements #819

Multimodality improvements #819

Uh oh!

Conversation

fellhorn commented Sep 1, 2025

Uh oh!

haixuanTao commented Sep 2, 2025

Uh oh!

haixuanTao commented Sep 2, 2025

Uh oh!

MarcusDunn Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

fellhorn commented Sep 2, 2025

Uh oh!

fellhorn Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

MarcusDunn Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

fellhorn Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

haixuanTao commented Sep 3, 2025

Uh oh!

haixuanTao Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

fellhorn Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

fellhorn commented Sep 4, 2025

Uh oh!

MarcusDunn commented Sep 4, 2025

Uh oh!

haixuanTao commented Sep 4, 2025

Uh oh!

fellhorn commented Sep 4, 2025

Uh oh!

fellhorn commented Sep 4, 2025

Uh oh!

Uh oh!

haixuanTao commented Sep 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants