Multimodality support #790

fellhorn · 2025-08-01T21:52:16Z

Add bindings for the libmtmd multimodality support of llama-cpp. This allows to use models like e.g. Gemma3 with image or audio input. The bindings are only created with a new mtmd feature. To explain the usage, I added a simple example loosely based on the mtmd-cli.cpp.

To run the mtmd example, you first need to download the model gguf file and the multimodal projection file, e.g. for Gemma3 you may use:

wget https://huggingface.co/unsloth/gemma-3-4b-it-GGUF/resolve/main/gemma-3-4b-it-Q4_K_M.gguf \
https://huggingface.co/unsloth/gemma-3-4b-it-GGUF/resolve/main/mmproj-F16.gguf

To then run the example on CPU, provide an image file my_image.jpg and run:

cargo run --release --example mtmd -- \
  --model ./gemma-3-4b-it-Q4_K_M.gguf \
  --mmproj ./mmproj-F16.gguf \
  --image my_image.jpg \
  --prompt "What is in the picture?" \
  --no-gpu \
  --no-mmproj-offload \
  --marker "<start_of_image>"

Closes #744

Signed-off-by: Dennis Keck <[email protected]>

…odal

Signed-off-by: Dennis Keck <[email protected]>

MarcusDunn · 2025-08-04T06:25:53Z

llama-cpp-2/src/mtmd.rs

+        Self {
+            use_gpu: false,
+            print_timings: true,
+            n_threads: 4,
+            media_marker: CString::new(mtmd_default_marker()).unwrap_or_default(),
+        }
+    }


this seems like it should rely on mtmd_context_params_default

Good point -> d025465

MarcusDunn · 2025-08-04T06:26:55Z

llama-cpp-2/src/mtmd.rs

+        context.use_gpu = params.use_gpu;
+        context.print_timings = params.print_timings;
+        context.n_threads = params.n_threads;
+        context.media_marker = params.media_marker.as_ptr();


destructure params so you get a compiler error here if new fields are added so we ensure we update this impl.

MarcusDunn · 2025-08-04T06:27:35Z

llama-cpp-2/src/mtmd.rs

+        if context.is_null() {
+            return Err(MtmdInitError::NullResult);
+        }


redundant check.

Removed in d025465

MarcusDunn · 2025-08-04T06:30:31Z

llama-cpp-2/src/mtmd.rs

+        bitmaps: &[&MtmdBitmap],
+    ) -> Result<MtmdInputChunks, MtmdTokenizeError> {
+        let chunks = MtmdInputChunks::new();
+        let text_cstring = CString::new(text.text).unwrap_or_default();


silently eating the error here seems ike it could be improved on. I would return an Err if MtmdInputText can contain non-valid C-Strings or panic if containing valid CStrings is an invariant of MtmdInputText (and document the invariant if not done)

Thanks, that's a good point. I added an explicit MtmdTokenizeError::CStringError now
d025465

MarcusDunn · 2025-08-04T06:32:19Z

llama-cpp-2/src/mtmd.rs

+unsafe impl Send for MtmdContext {}
+unsafe impl Sync for MtmdContext {}


These impls need some sort of comment on why this is safe.

Good catch, I actually meant MtmdContextParams and that is already Send + Sync. Afais mtmd_context is not thread safe. Removed in d025465

MarcusDunn · 2025-08-04T06:35:32Z

llama-cpp-2/src/mtmd.rs

+            unsafe { CStr::from_ptr(ptr) }
+                .to_string_lossy()
+                .into_owned()
+                .into()


nit: into_owned + into makes this somewhat opaque on what this chain does. a named variable or explicit types would be nice.

Cleaned it up in d025465

MarcusDunn · 2025-08-04T06:37:10Z

llama-cpp-2/src/mtmd.rs

+        if chunk_ptr.is_null() {
+            None
+        } else {
+            // Note: We don't own this chunk, it's owned by the chunks collection


should this return a reference then?

MtmdInputChunk is currently just a container for a llama.cpp mtmd_input_chunk pointer. We do not separately keep track of the chunks but just have a pointer to the mtmd_input_chunks.

Therefore we get the pointer to the llama.cpp chunk here and wrap it in a MtmdInputChunk . This owned field is just there because there is also a way to create an input chunk manually and then encode that one. In this case the underlying mtdm_input_chunk should be dropped when the container is dropped.

MarcusDunn

This looks good code-wise. Thanks for the really well documented and thought out PR.

I left some comments that should be addressed, but nothing major.

My main concern is that maintaining this will prevent updates to the main llama.h bindings as I expect this to break often as llama.cpp moves forward.

If ALL this code (bindings, wrappers, addition to our stub header file, etc) could be behind the feature flag such that I can update the library and break multi-modal support provided I do not use the feature that would be great.

I will merge this provided that breaks to mtmd would not stop updates to the submodule.

Again, thanks for the PR - this is great.

haixuanTao · 2025-08-13T08:12:28Z

Hi, not sure if that helps, but I tried running this with qwen-omni but it does not seem to work unfortunately:

cargo run --features metal --release --example mtmd -- \
  --model /Users/xaviertao/Downloads/Qwen2.5-Omni-3B-q4_k_m.gguf \
  --mmproj /Users/xaviertao/Downloads/Qwen2.5-Omni-3B-f16.mmproj \
  --image /Users/xaviertao/Downloads/h-1.jpg \
  --prompt "What is in the picture?" \
  --no-mmproj-offload \
  --marker "<start_of_image>"

decoding image batch 497/1369, n_tokens_batch = 1
image decoded (batch 497/1369) in 11 ms
decoding image batch 498/1369, n_tokens_batch = 1
decode: failed to find a memory slot for batch of size 1
failed to decode image
set_causal_attn: value = 1
failed to decode image
failed to eval chunk 1
ggml_metal_free: deallocating

To get the files:

wget https://huggingface.co/Mungert/Qwen2.5-Omni-3B-GGUF/resolve/main/Qwen2.5-Omni-3B-q4_k_m.gguf
wget https://huggingface.co/Mungert/Qwen2.5-Omni-3B-GGUF/resolve/main/Qwen2.5-Omni-3B-q8_0.mmproj

fellhorn · 2025-08-13T12:03:53Z

Hi, not sure if that helps, but I tried running this with qwen-omni but it does not seem to work unfortunately:

cargo run --features metal --release --example mtmd -- \
  --model /Users/xaviertao/Downloads/Qwen2.5-Omni-3B-q4_k_m.gguf \
  --mmproj /Users/xaviertao/Downloads/Qwen2.5-Omni-3B-f16.mmproj \
  --image /Users/xaviertao/Downloads/h-1.jpg \
  --prompt "What is in the picture?" \
  --no-mmproj-offload \
  --marker "<start_of_image>"

decoding image batch 497/1369, n_tokens_batch = 1
image decoded (batch 497/1369) in 11 ms
decoding image batch 498/1369, n_tokens_batch = 1
decode: failed to find a memory slot for batch of size 1
failed to decode image
set_causal_attn: value = 1
failed to decode image
failed to eval chunk 1
ggml_metal_free: deallocating

Thanks for reporting, @haixuanTao, I have indeed only tried it with Gemma models. I will take a look.
I will also try to find some time the next days to clean up the PR to eventually get it ready.

haixuanTao · 2025-08-13T12:25:13Z

No worries. I think that text only runs on llama-cpp-python version.

* Remove unsafe Send, Sync * Cleanup error handling * Use default mtmd_context directly Signed-off-by: Dennis Keck <[email protected]>

Signed-off-by: Dennis Keck <[email protected]>

fellhorn · 2025-08-14T12:14:38Z

No worries. I think that text only runs on llama-cpp-python version.

The issue was that the default context length was too short and I did not pass the arg also to the LlamaContext.
It should be fixed with 62f1511.

Let me know, if there are any other issues, @haixuanTao

Signed-off-by: Dennis Keck <[email protected]>

fellhorn · 2025-08-14T13:14:06Z

@MarcusDunn, thanks for the thorough review. I think I addressed all your comments in d025465 and f149f11.
Especially your comment regarding feature gating everything related to mtmd should be addressed with f149f11:
I added another wrapper_mtmd.h, which will only be used when building with the mtmd feature.

Please let me know if there are still mtmd dependencies that I might have missed. I assume that the include in the Cargo.toml should be fine?

MarcusDunn · 2025-08-14T14:10:32Z

The include is no worries. I'll take a look this weekend.

MarcusDunn · 2025-08-14T16:18:31Z

llama-cpp-2/src/mtmd.rs

+    /// Text input chunk
+    Text = llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_TEXT as isize,
+    /// Image input chunk
+    Image = llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_IMAGE as isize,
+    /// Audio input chunk
+    Audio = llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_AUDIO as isize,
+}


is the cast required?

When the enum uses a u32 representation it's no longer required: #819

Thanks, good point

MarcusDunn · 2025-08-14T16:18:49Z

llama-cpp-2/src/mtmd.rs

+            llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_TEXT => MtmdInputChunkType::Text,
+            llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_IMAGE => MtmdInputChunkType::Image,
+            llama_cpp_sys_2::MTMD_INPUT_CHUNK_TYPE_AUDIO => MtmdInputChunkType::Audio,
+            _ => panic!("Unknown MTMD input chunk type"),


include the chunk_type in the panic message.

Fixed in #819

MarcusDunn · 2025-08-14T16:21:32Z

llama-cpp-2/src/mtmd.rs

+    /// Get audio bitrate in Hz (e.g., 16000 for Whisper).
+    /// Returns -1 if audio is not supported.
+    #[must_use]
+    pub fn get_audio_bitrate(&self) -> i32 {
+        unsafe { llama_cpp_sys_2::mtmd_get_audio_bitrate(self.context.as_ptr()) }
+    }


I would prefer if we returned a Option<u32 / i32> for this and dealt with the -1 case to keep it a bit more rusty

Fixed in #819

MarcusDunn

LGTM. ~~Have the non-gemma models been tested? (I will merge regardless as we can always fix that later).~~

Feel free to address the comments, I'll merge a couple days from now to see if others want to give this branch some testing before I merge.

haixuanTao · 2025-08-15T01:38:34Z

Can confirm the model mention above is now working wonders!

Thanks a lot!

haixuanTao · 2025-08-15T02:16:43Z

More of an optimisation thing, but are we using the maximum number of batches for encoding/decoding images and audio?

        self.n_past = chunks.eval_chunks(&self.mtmd_ctx, context, 0, 0, 1, true)?; // batch size of image decoding

It seems that this could be higher than 1, but not an expert on this

Related: ggml-org/llama.cpp#14527 if you search through the issue he's actually decoding at 64 n_tokens_batch

haixuanTao · 2025-08-15T03:32:34Z

FYI, I tried to pass image+audio at the same time and it didn't work as there were only one marker within the prompt :D

Not an urgent bug

haixuanTao · 2025-08-16T06:04:05Z

More of an optimisation thing, but are we using the maximum number of batches for encoding/decoding images and audio?
        self.n_past = chunks.eval_chunks(&self.mtmd_ctx, context, 0, 0, 1, true)?; // batch size of image decoding
It seems that this could be higher than 1, but not an expert on this

Related: ggml-org/llama.cpp#14527 if you search through the issue he's actually decoding at 64 n_tokens_batch

I am able to batch by 10 using qwen-omni, reducing the decoding time from 120ms-> 50ms, but I truly have no idea why most of time when giving a n_batch a random batch size it gives me the following error:

image slice encoded in 326 ms
decoding image batch 1/3, n_tokens_batch = 20
image decoded (batch 1/3) in 52 ms
decoding image batch 2/3, n_tokens_batch = 20
image decoded (batch 2/3) in 53 ms
decoding image batch 3/3, n_tokens_batch = 20
image decoded (batch 3/3) in 53 ms
get_logits_ith: invalid logits id 0, reason: batch.logits[0] != true

altunenes · 2025-08-28T22:41:58Z

encoding image slice...

This stage took a massive amount of time in metal. dont know if it works correctly in cuda.

haixuanTao · 2025-08-29T06:16:40Z

It's super fast in cuda. But it's a llama.cpp thing apparently

haixuanTao · 2025-08-29T14:13:05Z

Related: ggml-org/llama.cpp#15426

altunenes · 2025-08-29T14:19:06Z

Related: ggml-org/llama.cpp#15426

thanks. experiencing exactly the same issue. tried many things. I thought it was my fault. I've been working on it since yesterday. I tried running it on different threads, but no matter what I did, my optimization only improved it by 2-3 seconds at most. Also, I was doing all the image/video processing on the GPU side (using wgpu).
Not tried Vulkan though

haixuanTao · 2025-08-29T14:20:30Z

Have you tried lowering the resolution of the image?

altunenes · 2025-08-29T14:26:13Z

Have you tried lowering the resolution of the image?

I did it for my tests/debugs (like 64x32 etc). This only causes a 1-2 second improvement and doesn't solve the problem. I also have a Mac M3 16GB. A single output takes 1 min on my side.
ggml-org/llama.cpp#15426 (comment)

haixuanTao · 2025-08-29T14:30:15Z

ok I have a m4 pro and it's about 30sec more or less. On cloud 5090 cuda, its 120ms the encoding

fellhorn · 2025-09-01T22:34:08Z

More of an optimisation thing, but are we using the maximum number of batches for encoding/decoding images and audio?
It seems that this could be higher than 1, but not an expert on this
Related: ggml-org/llama.cpp#14527 if you search through the issue he's actually decoding at 64 n_tokens_batch

I am able to batch by 10 using qwen-omni, reducing the decoding time from 120ms-> 50ms, but I truly have no idea why most of time when giving a n_batch a random batch size it gives me the following error:

Thanks, @haixuanTao, for spotting this. I fixed the issue in #819 and added a new cli arg for it. It should now work with arbitrary batch sizes. Let me know if you still see the invalid logits error.

Unfortunately I can't help with the metal performance problems as I don't own a compatible device. It sounds like an issue in llama.cpp though.

fellhorn added 12 commits May 26, 2025 19:27

Update llama.cpp

c07bda5

Signed-off-by: Dennis Keck <[email protected]>

WIP

9454897

Signed-off-by: Dennis Keck <[email protected]>

Add documentation

a40ed28

Signed-off-by: Dennis Keck <[email protected]>

Clippy

b5cb56b

Signed-off-by: Dennis Keck <[email protected]>

Dep: Update llama.cpp b6002

dd52e55

Signed-off-by: Dennis Keck <[email protected]>

Bump versions

56ae827

Signed-off-by: Dennis Keck <[email protected]>

Merge branch 'dep/update_llama_cpp_b6002' into dennis/feat/multi-modal

a27029d

Signed-off-by: Dennis Keck <[email protected]>

Merge remote-tracking branch 'upstream/main' into dennis/feat/multi-m…

7499765

…odal

Fix building of mtmd

222df6e

Signed-off-by: Dennis Keck <[email protected]>

Feature guard mtmd + cleanup

c45a061

Signed-off-by: Dennis Keck <[email protected]>

Add a small README

eee15c3

Signed-off-by: Dennis Keck <[email protected]>

Add some doctests

e1f1e04

Signed-off-by: Dennis Keck <[email protected]>

MarcusDunn reviewed Aug 4, 2025

View reviewed changes

MarcusDunn requested changes Aug 4, 2025

View reviewed changes

fellhorn added 2 commits August 13, 2025 23:33

Review round 1

d025465

* Remove unsafe Send, Sync * Cleanup error handling * Use default mtmd_context directly Signed-off-by: Dennis Keck <[email protected]>

Fix context length in mtmd example

62f1511

Signed-off-by: Dennis Keck <[email protected]>

Review round 1: wrapper_mtmd.h

f149f11

Signed-off-by: Dennis Keck <[email protected]>

fellhorn requested a review from MarcusDunn August 14, 2025 13:14

MarcusDunn reviewed Aug 14, 2025

View reviewed changes

MarcusDunn approved these changes Aug 14, 2025

View reviewed changes

MarcusDunn merged commit 96b6bcc into utilityai:main Aug 18, 2025
3 of 5 checks passed

fellhorn mentioned this pull request Sep 1, 2025

Multimodality improvements #819

Merged

		unsafe impl Send for MtmdContext {}
		unsafe impl Sync for MtmdContext {}

Multimodality support #790

Multimodality support #790

Uh oh!

Conversation

fellhorn commented Aug 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcusDunn left a comment

Choose a reason for hiding this comment

Uh oh!

haixuanTao commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fellhorn commented Aug 13, 2025

Uh oh!

haixuanTao commented Aug 13, 2025

Uh oh!

fellhorn commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fellhorn commented Aug 14, 2025

Uh oh!

MarcusDunn commented Aug 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcusDunn left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haixuanTao commented Aug 15, 2025

Uh oh!

haixuanTao commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

haixuanTao commented Aug 15, 2025

Uh oh!

haixuanTao commented Aug 16, 2025

Uh oh!

Uh oh!

altunenes commented Aug 28, 2025

Uh oh!

haixuanTao commented Aug 29, 2025

haixuanTao commented Aug 13, 2025 •

edited

Loading

fellhorn commented Aug 14, 2025 •

edited

Loading

MarcusDunn left a comment •

edited

Loading

haixuanTao commented Aug 15, 2025 •

edited

Loading

fellhorn commented Sep 1, 2025 •

edited

Loading