Skip to content

Conversation

fellhorn
Copy link
Contributor

@fellhorn fellhorn commented Aug 1, 2025

Add bindings for the libmtmd multimodality support of llama-cpp. This allows to use models like e.g. Gemma3 with image or audio input. The bindings are only created with a new mtmd feature. To explain the usage, I added a simple example loosely based on the mtmd-cli.cpp.

To run the mtmd example, you first need to download the model gguf file and the multimodal projection file, e.g. for Gemma3 you may use:

wget https://huggingface.co/unsloth/gemma-3-4b-it-GGUF/resolve/main/gemma-3-4b-it-Q4_K_M.gguf \
https://huggingface.co/unsloth/gemma-3-4b-it-GGUF/resolve/main/mmproj-F16.gguf

To then run the example on CPU, provide an image file my_image.jpg and run:

cargo run --release --example mtmd -- \
  --model ./gemma-3-4b-it-Q4_K_M.gguf \
  --mmproj ./mmproj-F16.gguf \
  --image my_image.jpg \
  --prompt "What is in the picture?" \
  --no-gpu \
  --no-mmproj-offload \
  --marker "<start_of_image>"

Closes #744

fellhorn added 12 commits May 26, 2025 19:27
Signed-off-by: Dennis Keck <[email protected]>
Signed-off-by: Dennis Keck <[email protected]>
Signed-off-by: Dennis Keck <[email protected]>
Signed-off-by: Dennis Keck <[email protected]>
Signed-off-by: Dennis Keck <[email protected]>
Signed-off-by: Dennis Keck <[email protected]>
Signed-off-by: Dennis Keck <[email protected]>
Signed-off-by: Dennis Keck <[email protected]>
Signed-off-by: Dennis Keck <[email protected]>
Signed-off-by: Dennis Keck <[email protected]>
Comment on lines 80 to 86
Self {
use_gpu: false,
print_timings: true,
n_threads: 4,
media_marker: CString::new(mtmd_default_marker()).unwrap_or_default(),
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like it should rely on mtmd_context_params_default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point -> d025465

Comment on lines 93 to 96
context.use_gpu = params.use_gpu;
context.print_timings = params.print_timings;
context.n_threads = params.n_threads;
context.media_marker = params.media_marker.as_ptr();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

destructure params so you get a compiler error here if new fields are added so we ensure we update this impl.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 168 to 170
if context.is_null() {
return Err(MtmdInitError::NullResult);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in d025465

bitmaps: &[&MtmdBitmap],
) -> Result<MtmdInputChunks, MtmdTokenizeError> {
let chunks = MtmdInputChunks::new();
let text_cstring = CString::new(text.text).unwrap_or_default();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

silently eating the error here seems ike it could be improved on. I would return an Err if MtmdInputText can contain non-valid C-Strings or panic if containing valid CStrings is an invariant of MtmdInputText (and document the invariant if not done)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's a good point. I added an explicit MtmdTokenizeError::CStringError now
d025465

Comment on lines 307 to 308
unsafe impl Send for MtmdContext {}
unsafe impl Sync for MtmdContext {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These impls need some sort of comment on why this is safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I actually meant MtmdContextParams and that is already Send + Sync. Afais mtmd_context is not thread safe. Removed in d025465

Comment on lines 507 to 510
unsafe { CStr::from_ptr(ptr) }
.to_string_lossy()
.into_owned()
.into()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: into_owned + into makes this somewhat opaque on what this chain does. a named variable or explicit types would be nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned it up in d025465

if chunk_ptr.is_null() {
None
} else {
// Note: We don't own this chunk, it's owned by the chunks collection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this return a reference then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MtmdInputChunk is currently just a container for a llama.cpp mtmd_input_chunk pointer. We do not separately keep track of the chunks but just have a pointer to the mtmd_input_chunks.

Therefore we get the pointer to the llama.cpp chunk here and wrap it in a MtmdInputChunk . This owned field is just there because there is also a way to create an input chunk manually and then encode that one. In this case the underlying mtdm_input_chunk should be dropped when the container is dropped.

Copy link
Contributor

@MarcusDunn MarcusDunn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good code-wise. Thanks for the really well documented and thought out PR.

I left some comments that should be addressed, but nothing major.

My main concern is that maintaining this will prevent updates to the main llama.h bindings as I expect this to break often as llama.cpp moves forward.

If ALL this code (bindings, wrappers, addition to our stub header file, etc) could be behind the feature flag such that I can update the library and break multi-modal support provided I do not use the feature that would be great.

I will merge this provided that breaks to mtmd would not stop updates to the submodule.

Again, thanks for the PR - this is great.

@haixuanTao
Copy link

haixuanTao commented Aug 13, 2025

Hi, not sure if that helps, but I tried running this with qwen-omni but it does not seem to work unfortunately:

cargo run --features metal --release --example mtmd -- \
  --model /Users/xaviertao/Downloads/Qwen2.5-Omni-3B-q4_k_m.gguf \
  --mmproj /Users/xaviertao/Downloads/Qwen2.5-Omni-3B-f16.mmproj \
  --image /Users/xaviertao/Downloads/h-1.jpg \
  --prompt "What is in the picture?" \
  --no-mmproj-offload \
  --marker "<start_of_image>"
decoding image batch 497/1369, n_tokens_batch = 1
image decoded (batch 497/1369) in 11 ms
decoding image batch 498/1369, n_tokens_batch = 1
decode: failed to find a memory slot for batch of size 1
failed to decode image
set_causal_attn: value = 1
failed to decode image
failed to eval chunk 1
ggml_metal_free: deallocating

To get the files:

wget https://huggingface.co/Mungert/Qwen2.5-Omni-3B-GGUF/resolve/main/Qwen2.5-Omni-3B-q4_k_m.gguf
wget https://huggingface.co/Mungert/Qwen2.5-Omni-3B-GGUF/resolve/main/Qwen2.5-Omni-3B-q8_0.mmproj

@fellhorn
Copy link
Contributor Author

Hi, not sure if that helps, but I tried running this with qwen-omni but it does not seem to work unfortunately:

cargo run --features metal --release --example mtmd -- \
  --model /Users/xaviertao/Downloads/Qwen2.5-Omni-3B-q4_k_m.gguf \
  --mmproj /Users/xaviertao/Downloads/Qwen2.5-Omni-3B-f16.mmproj \
  --image /Users/xaviertao/Downloads/h-1.jpg \
  --prompt "What is in the picture?" \
  --no-mmproj-offload \
  --marker "<start_of_image>"
decoding image batch 497/1369, n_tokens_batch = 1
image decoded (batch 497/1369) in 11 ms
decoding image batch 498/1369, n_tokens_batch = 1
decode: failed to find a memory slot for batch of size 1
failed to decode image
set_causal_attn: value = 1
failed to decode image
failed to eval chunk 1
ggml_metal_free: deallocating

Thanks for reporting, @haixuanTao, I have indeed only tried it with Gemma models. I will take a look.
I will also try to find some time the next days to clean up the PR to eventually get it ready.

@haixuanTao
Copy link

No worries. I think that text only runs on llama-cpp-python version.

* Remove unsafe Send, Sync
* Cleanup error handling
* Use default mtmd_context directly

Signed-off-by: Dennis Keck <[email protected]>
@fellhorn
Copy link
Contributor Author

fellhorn commented Aug 14, 2025

No worries. I think that text only runs on llama-cpp-python version.

The issue was that the default context length was too short and I did not pass the arg also to the LlamaContext.
It should be fixed with 62f1511.

Let me know, if there are any other issues, @haixuanTao

@fellhorn
Copy link
Contributor Author

@MarcusDunn, thanks for the thorough review. I think I addressed all your comments in d025465 and f149f11.
Especially your comment regarding feature gating everything related to mtmd should be addressed with f149f11:
I added another wrapper_mtmd.h, which will only be used when building with the mtmd feature.

Please let me know if there are still mtmd dependencies that I might have missed. I assume that the include in the Cargo.toml should be fine?

@fellhorn fellhorn requested a review from MarcusDunn August 14, 2025 13:14
@MarcusDunn
Copy link
Contributor

The include is no worries. I'll take a look this weekend.

Comment on lines +32 to +38
/// Text input chunk
Text = llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_TEXT as isize,
/// Image input chunk
Image = llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_IMAGE as isize,
/// Audio input chunk
Audio = llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_AUDIO as isize,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the cast required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the enum uses a u32 representation it's no longer required: #819

Thanks, good point

llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_TEXT => MtmdInputChunkType::Text,
llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_IMAGE => MtmdInputChunkType::Image,
llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_AUDIO => MtmdInputChunkType::Audio,
_ => panic!("Unknown MTMD input chunk type"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include the chunk_type in the panic message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #819

Comment on lines +213 to +218
/// Get audio bitrate in Hz (e.g., 16000 for Whisper).
/// Returns -1 if audio is not supported.
#[must_use]
pub fn get_audio_bitrate(&self) -> i32 {
unsafe { llama_cpp_sys_2::mtmd_get_audio_bitrate(self.context.as_ptr()) }
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if we returned a Option<u32 / i32> for this and dealt with the -1 case to keep it a bit more rusty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #819

Copy link
Contributor

@MarcusDunn MarcusDunn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Have the non-gemma models been tested? (I will merge regardless as we can always fix that later).

Feel free to address the comments, I'll merge a couple days from now to see if others want to give this branch some testing before I merge.

@haixuanTao
Copy link

Can confirm the model mention above is now working wonders!

Thanks a lot!

@haixuanTao
Copy link

haixuanTao commented Aug 15, 2025

More of an optimisation thing, but are we using the maximum number of batches for encoding/decoding images and audio?

        self.n_past = chunks.eval_chunks(&self.mtmd_ctx, context, 0, 0, 1, true)?; // batch size of image decoding

It seems that this could be higher than 1, but not an expert on this

Related: ggml-org/llama.cpp#14527 if you search through the issue he's actually decoding at 64 n_tokens_batch

@haixuanTao
Copy link

FYI, I tried to pass image+audio at the same time and it didn't work as there were only one marker within the prompt :D

Not an urgent bug

@haixuanTao
Copy link

More of an optimisation thing, but are we using the maximum number of batches for encoding/decoding images and audio?

        self.n_past = chunks.eval_chunks(&self.mtmd_ctx, context, 0, 0, 1, true)?; // batch size of image decoding

It seems that this could be higher than 1, but not an expert on this

Related: ggml-org/llama.cpp#14527 if you search through the issue he's actually decoding at 64 n_tokens_batch

I am able to batch by 10 using qwen-omni, reducing the decoding time from 120ms-> 50ms, but I truly have no idea why most of time when giving a n_batch a random batch size it gives me the following error:

image slice encoded in 326 ms
decoding image batch 1/3, n_tokens_batch = 20
image decoded (batch 1/3) in 52 ms
decoding image batch 2/3, n_tokens_batch = 20
image decoded (batch 2/3) in 53 ms
decoding image batch 3/3, n_tokens_batch = 20
image decoded (batch 3/3) in 53 ms
get_logits_ith: invalid logits id 0, reason: batch.logits[0] != true

@MarcusDunn MarcusDunn merged commit 96b6bcc into utilityai:main Aug 18, 2025
3 of 5 checks passed
@altunenes
Copy link

encoding image slice...

This stage took a massive amount of time in metal. dont know if it works correctly in cuda.

@haixuanTao
Copy link

It's super fast in cuda. But it's a llama.cpp thing apparently

@haixuanTao
Copy link

Related: ggml-org/llama.cpp#15426

@altunenes
Copy link

Related: ggml-org/llama.cpp#15426

thanks. experiencing exactly the same issue. tried many things. I thought it was my fault. I've been working on it since yesterday. I tried running it on different threads, but no matter what I did, my optimization only improved it by 2-3 seconds at most. Also, I was doing all the image/video processing on the GPU side (using wgpu).
Not tried Vulkan though

@haixuanTao
Copy link

Have you tried lowering the resolution of the image?

@altunenes
Copy link

Have you tried lowering the resolution of the image?

I did it for my tests/debugs (like 64x32 etc). This only causes a 1-2 second improvement and doesn't solve the problem. I also have a Mac M3 16GB. A single output takes 1 min on my side.
ggml-org/llama.cpp#15426 (comment)

@haixuanTao
Copy link

ok I have a m4 pro and it's about 30sec more or less. On cloud 5090 cuda, its 120ms the encoding

@fellhorn
Copy link
Contributor Author

fellhorn commented Sep 1, 2025

More of an optimisation thing, but are we using the maximum number of batches for encoding/decoding images and audio?
It seems that this could be higher than 1, but not an expert on this
Related: ggml-org/llama.cpp#14527 if you search through the issue he's actually decoding at 64 n_tokens_batch

I am able to batch by 10 using qwen-omni, reducing the decoding time from 120ms-> 50ms, but I truly have no idea why most of time when giving a n_batch a random batch size it gives me the following error:

Thanks, @haixuanTao, for spotting this. I fixed the issue in #819 and added a new cli arg for it. It should now work with arbitrary batch sizes. Let me know if you still see the invalid logits error.

Unfortunately I can't help with the metal performance problems as I don't own a compatible device. It sounds like an issue in llama.cpp though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multimodal support (libmtmd)

4 participants