Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
7a9f639
Test dummy image tags in chat templates
abetlen Jan 31, 2024
b7338a0
Merge branch 'main' into generic-vlm-chat-format
abetlen Apr 27, 2024
b78ed72
Format and improve types for llava_cpp.py
abetlen Apr 27, 2024
a3c3b5d
Add from_pretrained support to llava chat format.
abetlen Apr 27, 2024
d7b28f7
Refactor llava chat format to use a jinja2
abetlen Apr 27, 2024
3cef09c
Revert chat format test
abetlen Apr 27, 2024
2fd41f9
Add moondream support (wip)
abetlen Apr 27, 2024
7df9483
Update moondream chat format
abetlen Apr 27, 2024
1705893
Update moondream chat format
abetlen Apr 27, 2024
fd55c29
Update moondream prompt
abetlen Apr 27, 2024
94fe4bc
Add function calling support
abetlen Apr 27, 2024
0e182be
Cache last image embed
abetlen Apr 28, 2024
20e0967
Add Llava1.6 support
abetlen Apr 28, 2024
8324ee0
Add nanollava support
abetlen Apr 28, 2024
8f09d42
Add obisidian support
abetlen Apr 28, 2024
22c55cd
Merge branch 'main' into generic-vlm-chat-format
abetlen Apr 28, 2024
c89c6de
Merge branch 'main' into generic-vlm-chat-format
abetlen Apr 30, 2024
dd47dda
Remove unnecessary import
abetlen Apr 30, 2024
0b891f4
Re-order multimodal chat formats
abetlen Apr 30, 2024
0e15835
Logits all no longer required for multi-modal models
abetlen Apr 30, 2024
fc5d01c
Update README.md
abetlen Apr 30, 2024
f03326c
Update docs
abetlen Apr 30, 2024
efd99f1
Update README
abetlen Apr 30, 2024
6e4ad72
Fix typo
abetlen Apr 30, 2024
f70326f
Update README
abetlen Apr 30, 2024
64008aa
Fix typo
abetlen Apr 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 36 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -490,14 +490,15 @@ Due to discrepancies between llama.cpp and HuggingFace's tokenizers, it is requi

### Multi-modal Models

`llama-cpp-python` supports the llava1.5 family of multi-modal models which allow the language model to
read information from both text and images.
`llama-cpp-python` supports such as llava1.5 which allow the language model to read information from both text and images.

You'll first need to download one of the available multi-modal models in GGUF format:

- [llava-v1.5-7b](https://huggingface.co/mys/ggml_llava-v1.5-7b)
- [llava-v1.5-13b](https://huggingface.co/mys/ggml_llava-v1.5-13b)
- [bakllava-1-7b](https://huggingface.co/mys/ggml_bakllava-1)
- [llava-v1.6-34b](https://huggingface.co/cjpais/llava-v1.6-34B-gguf)
- [moondream2](https://huggingface.co/vikhyatk/moondream2)

Then you'll need to use a custom chat handler to load the clip model and process the chat messages and images.

Expand All @@ -509,22 +510,52 @@ Then you'll need to use a custom chat handler to load the clip model and process
model_path="./path/to/llava/llama-model.gguf",
chat_handler=chat_handler,
n_ctx=2048, # n_ctx should be increased to accomodate the image embedding
logits_all=True,# needed to make llava work
)
>>> llm.create_chat_completion(
messages = [
{"role": "system", "content": "You are an assistant who perfectly describes images."},
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://.../image.png"}},
{"type" : "text", "text": "Describe this image in detail please."}
{"type" : "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } }
]
}
]
)
```

You can also pull the model from the Hugging Face Hub using the `from_pretrained` method.

```python
>>> from llama_cpp import Llama
>>> from llama_cpp.llama_chat_format import MoondreamChatHandler
>>> chat_handler = MoondreamChatHandler.from_pretrained(
repo_id="vikhyatk/moondream2",
filename="*mmproj*",
)
>>> llm = Llama.from_pretrained(
repo_id="vikhyatk/moondream2"
filename="*text-model*",
chat_handler=chat_handler,
n_ctx=2048, # n_ctx should be increased to accomodate the image embedding
)
>>> llm.create_chat_completion(
messages = [
{
"role": "user",
"content": [
{"type" : "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } }

]
}
]
)
```

**Note**: Multi-modal models also support tool calling and JSON mode.

<details>
<summary>Loading a Local Image</summary>

Expand Down
2 changes: 2 additions & 0 deletions docs/server.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,8 @@ You'll first need to download one of the available multi-modal models in GGUF fo
- [llava-v1.5-7b](https://huggingface.co/mys/ggml_llava-v1.5-7b)
- [llava-v1.5-13b](https://huggingface.co/mys/ggml_llava-v1.5-13b)
- [bakllava-1-7b](https://huggingface.co/mys/ggml_bakllava-1)
- [llava-v1.6-34b](https://huggingface.co/cjpais/llava-v1.6-34B-gguf)
- [moondream2](https://huggingface.co/vikhyatk/moondream2)

Then when you run the server you'll need to also specify the path to the clip model used for image embedding and the `llava-1-5` chat_format

Expand Down
Loading