Skip to content

Conversation

@indrajit96
Copy link
Contributor

@indrajit96 indrajit96 commented Nov 20, 2025

Overview:

Add comprehensive multimodal guides for vLLM, Sglang and TRT-LLM backends documenting architectures, deployment modes, input formats, and known limitations.

Details:

  • New: docs/backends/vllm/multimodal_vllm_guide.md - Complete vLLM multimodal reference
    -New: docs/backends/trtllm/multimodal_trtllm_guide.md - Complete TRT-LLM multimodal reference
    -New: dynamo/docs/backends/sglang/multimodal_sglang_guide.md - Complete SGlang multimodal reference

Signed-off-by: Indrajit Bhosale <[email protected]>
Copy link
Contributor

@rmccorm4 rmccorm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update https://github.com/ai-dynamo/dynamo/blob/main/docs/multimodal/multimodal_intro.md at the bottom with links to each of the backend specific docs as a central location?

Signed-off-by: Indrajit Bhosale <[email protected]>
@rmccorm4 rmccorm4 changed the title 2/3 Done docs: Add multimodal documentation vllm, sglang, and trtllm backends Nov 21, 2025
@github-actions github-actions bot added the docs label Nov 21, 2025
Copy link
Contributor

@rmccorm4 rmccorm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: https://github.com/ai-dynamo/dynamo/actions/runs/19580969130/job/56078496698?pr=4510

#15 5.473 checking consistency... /workspace/dynamo/docs/backends/sglang/multimodal_sglang_guide.md: WARNING: document isn't included in any toctree [toc.not_included]
#15 5.474 /workspace/dynamo/docs/backends/trtllm/multimodal_trtllm_guide.md: WARNING: document isn't included in any toctree [toc.not_included]
#15 5.474 /workspace/dynamo/docs/backends/vllm/multimodal_vllm_guide.md: WARNING: document isn't included in any toctree [toc.not_included]

Needs these files added to docs/hidden_toctree.rst

@rmccorm4
Copy link
Contributor

@krishung5 can you help review the docs here? Main point is to clearly document what is supported in each backend today with relation to multimodality, and highlight at least 1 key example or model for each.


DISAGGREGATED (E->P->D):
Client → Frontend → Processor → Encoder [NIXL] → Prefill [bootstrap] → Decode → Response
• 4 components • Vision encoder + KV sharing • Bootstrap coordination
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the KV sharing here mean? Does it just mean PD disagg?

Copy link
Contributor

@krishung5 krishung5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting up this doc, great work! I think one minor comment is that, we have multimodal doc for all three frameworks, so maybe we can link them in the guide here somehow? i.e. vllm, trtllm here and here, and sglang

```
SIMPLE AGGREGATED (agg.sh):
Client → Frontend (Rust) → Worker [image load, encode, P+D] → Response
• 2 components • --modality multimodal • Easiest setup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be a bit confusing on --modality multimodal as users might not be familiar with the launch scripts. I understand we want to keep it short here in the bullet points, but maybe we can do something like this?

SIMPLE AGGREGATED (agg.sh):
  Client → Frontend (Rust) → Worker [image load, encode, P+D] → Response
  • 2 components • worker flag  `--modality multimodal` • Easiest setup


### Launch Script

Example: `examples/backends/trtllm/launch/agg.sh`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add actual link here?

| **Frontend → Prefill** | Request with image URL or embedding path | No |
| **Encode → Prefill (Precomputed Embeddings)** | NIXL metadata (pre-computed embeddings) | Yes (Embeddings tensor) |
| **Encode → Prefill (Image URL) (WIP)** | Disaggregated params with multimodal handles | No (Handles via params) |
| **Prefill → Decode** | Disaggregated params | Yes/No (KV cache - UCX or NIXL) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Qq -

Yes/No (KV cache - UCX or NIXL)

Does this mean
Yes(KV cache transfer using NIXL)
No(KV cache transfer using UCX)


```
SIMPLE AGGREGATED (agg_multimodal.sh):
Client → Frontend (Rust) → Worker [image load, encode, P+D] → Response
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to highlight the rust processor?

Suggested change
Client → Frontend (Rust) → Worker [image load, encode, P+D] → Response
Client → Frontend (Rust processor) → Worker [image load, encode, P+D] → Response

| **Data URL** | `data:image/jpeg;base64,/9j/4AAQ...` | Base64-encoded inline data | ✅ |


## Aggregated Mode (PD)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we clarify here it's the EPD not the simple aggregated one? Do we want to add a section for the simple aggregated workflow? I think one confusion that I got frequently from people is EPD vs simple/traditional. If we could align on the wording that would be very helpful.


### Launch Script

Example: `examples/backends/vllm/launch/disagg_multimodal_llama.sh`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, can add the actual link to the script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants