-
Notifications
You must be signed in to change notification settings - Fork 703
docs: Add multimodal documentation vllm, sglang, and trtllm backends #4510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Indrajit Bhosale <[email protected]>
rmccorm4
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update https://github.com/ai-dynamo/dynamo/blob/main/docs/multimodal/multimodal_intro.md at the bottom with links to each of the backend specific docs as a central location?
Signed-off-by: Indrajit Bhosale <[email protected]>
rmccorm4
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: https://github.com/ai-dynamo/dynamo/actions/runs/19580969130/job/56078496698?pr=4510
#15 5.473 checking consistency... /workspace/dynamo/docs/backends/sglang/multimodal_sglang_guide.md: WARNING: document isn't included in any toctree [toc.not_included]
#15 5.474 /workspace/dynamo/docs/backends/trtllm/multimodal_trtllm_guide.md: WARNING: document isn't included in any toctree [toc.not_included]
#15 5.474 /workspace/dynamo/docs/backends/vllm/multimodal_vllm_guide.md: WARNING: document isn't included in any toctree [toc.not_included]
Needs these files added to docs/hidden_toctree.rst
|
@krishung5 can you help review the docs here? Main point is to clearly document what is supported in each backend today with relation to multimodality, and highlight at least 1 key example or model for each. |
|
|
||
| DISAGGREGATED (E->P->D): | ||
| Client → Frontend → Processor → Encoder [NIXL] → Prefill [bootstrap] → Decode → Response | ||
| • 4 components • Vision encoder + KV sharing • Bootstrap coordination |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the KV sharing here mean? Does it just mean PD disagg?
krishung5
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ``` | ||
| SIMPLE AGGREGATED (agg.sh): | ||
| Client → Frontend (Rust) → Worker [image load, encode, P+D] → Response | ||
| • 2 components • --modality multimodal • Easiest setup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could be a bit confusing on --modality multimodal as users might not be familiar with the launch scripts. I understand we want to keep it short here in the bullet points, but maybe we can do something like this?
SIMPLE AGGREGATED (agg.sh):
Client → Frontend (Rust) → Worker [image load, encode, P+D] → Response
• 2 components • worker flag `--modality multimodal` • Easiest setup
|
|
||
| ### Launch Script | ||
|
|
||
| Example: `examples/backends/trtllm/launch/agg.sh` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add actual link here?
| | **Frontend → Prefill** | Request with image URL or embedding path | No | | ||
| | **Encode → Prefill (Precomputed Embeddings)** | NIXL metadata (pre-computed embeddings) | Yes (Embeddings tensor) | | ||
| | **Encode → Prefill (Image URL) (WIP)** | Disaggregated params with multimodal handles | No (Handles via params) | | ||
| | **Prefill → Decode** | Disaggregated params | Yes/No (KV cache - UCX or NIXL) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Qq -
Yes/No (KV cache - UCX or NIXL)
Does this mean
Yes(KV cache transfer using NIXL)
No(KV cache transfer using UCX)
|
|
||
| ``` | ||
| SIMPLE AGGREGATED (agg_multimodal.sh): | ||
| Client → Frontend (Rust) → Worker [image load, encode, P+D] → Response |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to highlight the rust processor?
| Client → Frontend (Rust) → Worker [image load, encode, P+D] → Response | |
| Client → Frontend (Rust processor) → Worker [image load, encode, P+D] → Response |
| | **Data URL** | `data:image/jpeg;base64,/9j/4AAQ...` | Base64-encoded inline data | ✅ | | ||
|
|
||
|
|
||
| ## Aggregated Mode (PD) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we clarify here it's the EPD not the simple aggregated one? Do we want to add a section for the simple aggregated workflow? I think one confusion that I got frequently from people is EPD vs simple/traditional. If we could align on the wording that would be very helpful.
|
|
||
| ### Launch Script | ||
|
|
||
| Example: `examples/backends/vllm/launch/disagg_multimodal_llama.sh` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, can add the actual link to the script.
Overview:
Add comprehensive multimodal guides for vLLM, Sglang and TRT-LLM backends documenting architectures, deployment modes, input formats, and known limitations.
Details:
docs/backends/vllm/multimodal_vllm_guide.md- Complete vLLM multimodal reference-New:
docs/backends/trtllm/multimodal_trtllm_guide.md- Complete TRT-LLM multimodal reference-New:
dynamo/docs/backends/sglang/multimodal_sglang_guide.md- Complete SGlang multimodal reference