Skip to content

Conversation

fellhorn
Copy link
Contributor

@fellhorn fellhorn commented Sep 1, 2025

Improves a few issues in the mtmd example and wrappers that remained from #790 :

  • fixes the batch size param (thanks @haixuanTao for spotting this)
  • addresses remaining nits from @MarcusDunn
  • clippy cleanup

Signed-off-by: Dennis Keck <[email protected]>
Signed-off-by: Dennis Keck <[email protected]>
@fellhorn fellhorn mentioned this pull request Sep 1, 2025
Signed-off-by: Dennis Keck <[email protected]>
@fellhorn fellhorn force-pushed the dennis/mtmd-improvements branch from 744750f to 40f398e Compare September 1, 2025 22:35
Signed-off-by: Dennis Keck <[email protected]>
@haixuanTao
Copy link

I mean I have tried multiple batch size, but it seems that the actual batch size that works for images media or audio media can sometimes make the predictions not work. 64 typically does not work for me. 10 work but only on images with Owen2.5-omni.

I wonder if it's something on our end or something different.

@haixuanTao
Copy link

I feel like other people they just put everything into one batch but I couldn't make it work :S

No expert like probably a lot of us :D

self.context.as_ptr(),
chunks.chunks.as_ptr(),
&input_text,
&raw const input_text,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

til

@fellhorn
Copy link
Contributor Author

fellhorn commented Sep 2, 2025

I mean I have tried multiple batch size, but it seems that the actual batch size that works for images media or audio media can sometimes make the predictions not work. 64 typically does not work for me. 10 work but only on images with Owen2.5-omni.

I wonder if it's something on our end or something different.

Do you see an error or are the results just wrong? I tried Gemma-3-4b with this branch and had positive results with a BS of 1, 16 and 256. The only thing I noticed is, that the batch size needs to be a divisor of the number of visual tokens per image. For Gemma that's 256.

If one does not pick a divisor, decoding fails with

init: the tokens of sequence 0 in the input batch have inconsistent sequence positions:
- the last position stored in the memory module of the context (i.e. the KV cache) for sequence 0 is X = 260
- the tokens for sequence 0 in the input batch have a starting position of Y = 267
it is required that the sequence positions remain consecutive: Y = X + 1
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1
failed to decode text
failed to eval chunk 2
Error: EvalFailure(-1)

Comment on lines 32 to 39
#[repr(u32)]
pub enum MtmdInputChunkType {
/// Text input chunk
Text = llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_TEXT as isize,
Text = llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_TEXT,
/// Image input chunk
Image = llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_IMAGE as isize,
Image = llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_IMAGE,
/// Audio input chunk
Audio = llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_AUDIO as isize,
Audio = llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_AUDIO,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saw the Github Action run - this does not work on Windows due to different bindgen behavior.

I'll take another look tomorrow or revert to as isize

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can get around this with as _, if there's a corresponding bindgen type, I would prefer that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, til. Added as _ in 94a83e9

@haixuanTao
Copy link

I mean I have tried multiple batch size, but it seems that the actual batch size that works for images media or audio media can sometimes make the predictions not work. 64 typically does not work for me. 10 work but only on images with Owen2.5-omni.

I wonder if it's something on our end or something different.

Do you see an error or are the results just wrong? I tried Gemma-3-4b with this branch and had positive results with a BS of 1, 16 and 256. The only thing I noticed is, that the batch size needs to be a divisor of the number of visual tokens per image. For Gemma that's 256.

If one does not pick a divisor, decoding fails with


init: the tokens of sequence 0 in the input batch have inconsistent sequence positions:

- the last position stored in the memory module of the context (i.e. the KV cache) for sequence 0 is X = 260

- the tokens for sequence 0 in the input batch have a starting position of Y = 267

it is required that the sequence positions remain consecutive: Y = X + 1

decode: failed to initialize batch

llama_decode: failed to decode, ret = -1

failed to decode text

failed to eval chunk 2

Error: EvalFailure(-1)

Yeah I get the exact same thing. Although it seems that llama.cpp should be able to handle non divisible token size by batch size.

I also haven't figured out a way to know what the image token number is, which only enables me to guess the divisor...

Also did you try audio? Because on qwen-omni, the audio and image divisor are not the same.

But I don't think it should be an issue for this PR per se.

#[arg(short = 't', long = "threads", value_name = "N", default_value = "4")]
pub n_threads: i32,
/// Number of tokens to process in a batch during eval chunks
#[arg(long = "batch-size", value_name = "b", default_value = "64")]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep the default batch size to 1 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting the default to 1 in 94a83e9

@fellhorn fellhorn force-pushed the dennis/mtmd-improvements branch from 9620c34 to 94a83e9 Compare September 4, 2025 08:43
@fellhorn
Copy link
Contributor Author

fellhorn commented Sep 4, 2025

Also did you try audio? Because on qwen-omni, the audio and image divisor are not the same.

I just tried audio & images (both together and separately) with a CUDA device:

RUST_BACKTRACE=1 cargo run --features cuda --example mtmd -- -m ~/Downloads/Qwen2.5-Omni-7B-Q4_K_M.gguf --mmproj ~/Downloads/mmproj-Qwen2.5-Omni-7B-f16.gguf --image ~/Downloads/duck.jpg  --audio ~/Downloads/cuckoo.mp3 --prompt "What is in the picture? What is that sound? The image: <start_of_image>The sound: <start_of_image>" --no-mmproj-offload --marker "<start_of_image>" --threads 18 --batch-size 16

Result:

...
The picture shows a white duck with a yellow beak swimming in dark water. The duck appears to be floating calmly on the surface. The sound in the background is a series of bird calls, which could be from various bird species in a natural setting.

It does not work on your machine?

@MarcusDunn
Copy link
Contributor

@fellhorn Would you like this to be merged as is? I have no issues on my end but it seems like there's still active discussion.

@haixuanTao
Copy link

Also did you try audio? Because on qwen-omni, the audio and image divisor are not the same.

I just tried audio & images (both together and separately) with a CUDA device:

RUST_BACKTRACE=1 cargo run --features cuda --example mtmd -- -m ~/Downloads/Qwen2.5-Omni-7B-Q4_K_M.gguf --mmproj ~/Downloads/mmproj-Qwen2.5-Omni-7B-f16.gguf --image ~/Downloads/duck.jpg  --audio ~/Downloads/cuckoo.mp3 --prompt "What is in the picture? What is that sound? The image: <start_of_image>The sound: <start_of_image>" --no-mmproj-offload --marker "<start_of_image>" --threads 18 --batch-size 16

Result:

...
The picture shows a white duck with a yellow beak swimming in dark water. The duck appears to be floating calmly on the surface. The sound in the background is a series of bird calls, which could be from various bird species in a natural setting.

It does not work on your machine?

Basically it can either follow the prompt or the audio. And so, if the audio is like how many duck is there in the image it should work.

But it seems that the audio overwrite the prompt

@fellhorn
Copy link
Contributor Author

fellhorn commented Sep 4, 2025

Basically it can either follow the prompt or the audio. And so, if the audio is like how many duck is there in the image it should work.

But it seems that the audio overwrite the prompt

I also tried a couple of prompts, where audio and image were not related. It still worked fairly well and described them separately from each other. What I noticed though: Qwen2.5-Omni seems rather weak at answering complex questions about the audio. Most likely because the vision and audio encoder only provide a compressed view to the model.

@haixuanTao Have you tried your prompts with the Qwen demo on the hf website or with vllm? It would be interesting to see if it's an issue with the model or one with llama.cpp or the Rust bindings.

@fellhorn
Copy link
Contributor Author

fellhorn commented Sep 4, 2025

@fellhorn Would you like this to be merged as is? I have no issues on my end but it seems like there's still active discussion.

@MarcusDunn I think we can go ahead with the PR and merge it. The discussion is not about the enhancements here. If some new changes come out of it, I'll create a separate PR.

@MarcusDunn MarcusDunn merged commit 4063f55 into utilityai:main Sep 4, 2025
3 of 5 checks passed
@haixuanTao
Copy link

Basically it can either follow the prompt or the audio. And so, if the audio is like how many duck is there in the image it should work.
But it seems that the audio overwrite the prompt

I also tried a couple of prompts, where audio and image were not related. It still worked fairly well and described them separately from each other. What I noticed though: Qwen2.5-Omni seems rather weak at answering complex questions about the audio. Most likely because the vision and audio encoder only provide a compressed view to the model.

@haixuanTao Have you tried your prompts with the Qwen demo on the hf website or with vllm? It would be interesting to see if it's an issue with the model or one with llama.cpp or the Rust bindings.

I have only tried simple audio translation, as of now, and image grounding. As it's not an instruct model, it's not going to be very good at conversation I think.

I mean in general it behaves as expected. Except for the batch size which I feel might be more than what other ppl are doing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants