Skip to content

integrated vlm code for benchmark for Eagle2 #3698

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

chohk88
Copy link
Collaborator

@chohk88 chohk88 commented Jul 21, 2025

Description

Closing the previous pull request (#3652) due to rebase difficulties with the main branch. This new PR resubmits the same changes for the VLM benchmark framework—now cleanly rebased on the latest main branch—and incorporates all feedback from the original review.

  1. Integrated VLM benchmark framework
    • Currently supports Eagle2, Qwen 2.5-VL
    • Planned support: Paligemma etc.
  2. Added custom token-generation function** for multi-modal (MM) models

Type of change

Please delete options that are not relevant and/or add your own.

  • New feature (non-breaking change which adds functionality)

Checklist:

  • My code follows the style guidelines of this project (You can use the linters)
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas and hacks
  • I have made corresponding changes to the documentation
  • I have added tests to verify my fix or my feature
  • New and existing unit tests pass locally with my changes
  • I have added the relevant labels to my PR in so that relevant reviewers are notified

@chohk88 chohk88 requested review from peri044 and zewenli98 July 21, 2025 16:27
@chohk88 chohk88 self-assigned this Jul 21, 2025
@chohk88 chohk88 added component: conversion Issues re: Conversion stage component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels Jul 21, 2025
@meta-cla meta-cla bot added the cla signed label Jul 21, 2025
@github-actions github-actions bot removed component: conversion Issues re: Conversion stage component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels Jul 21, 2025
@peri044
Copy link
Collaborator

peri044 commented Aug 6, 2025

Qwen model : command I used:
python run_vlm.py

Error:

File "/work/TensorRT/tools/llm/run_vlm.py", line 448, in <module>
    inputs = load_inputs(args, processor, DEVICE)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/TensorRT/tools/llm/run_vlm.py", line 188, in load_inputs
    from qwen_vl_utils import process_vision_info
ModuleNotFoundError: No module named 'qwen_vl_utils'

@peri044
Copy link
Collaborator

peri044 commented Aug 6, 2025

When I tried Eagle2 model, it shows

Traceback (most recent call last):
  File "/work/TensorRT/tools/llm/run_vlm.py", line 443, in <module>
    model, processor, emb_layer = load_model(args.model, DEVICE, dtype)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/TensorRT/tools/llm/run_vlm.py", line 141, in load_model
    return _load_eagle2(device, torch_dtype)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/TensorRT/tools/llm/run_vlm.py", line 101, in _load_eagle2
    AutoModel.from_pretrained(
  File "/root/.pyenv/versions/3.11.13/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.13/lib/python3.11/site-packages/transformers/modeling_utils.py", line 279, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.13/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4336, in from_pretrained
    config = cls._autoset_attn_implementation(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pyenv/versions/3.11.13/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2109, in _autoset_attn_implementation
    cls._check_and_enable_flash_attn_2(
  File "/root/.pyenv/versions/3.11.13/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2252, in _check_and_enable_flash_attn_2
    raise ImportError(f"{preface} the package flash_attn seems to be not installed. {install_message}")
ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2.
root@45fb01c53ae9:/work/TensorRT/tools/llm# python

Copy link
Collaborator

@peri044 peri044 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update docs and add these models to the list of supported models.

# This patch is global for the script's execution context.
import transformers.models.qwen2.modeling_qwen2 as mq

mq.ALL_ATTENTION_FUNCTIONS["flash_attention_2"] = mq.ALL_ATTENTION_FUNCTIONS["sdpa"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try this instead ? Do you think the following will work ?

model.config._attn_implementation = "sdpa"

Comment on lines +157 to +158
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = Image.open(requests.get(url, stream=True).raw)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be default but can you also add an argument image_path where a user can provide a path to image on their local system ?

]

# --- Model-specific vision processing ---
if "qwen" in args.model.lower():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment: consider matching the model name exactly here since there can be multiple variants with similar naming (eg: qwen2, qwen3 etc)

max_seq_len = input_embeds.shape[1] + args.num_tokens

seq_len = torch.export.Dim("seq", min=1, max=max_seq_len)
position_ids = torch.arange(input_embeds.shape[1]).unsqueeze(0).to(DEVICE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make the device an argument as well.

Comment on lines +267 to +270
disable_tf32=True,
use_python_runtime=True,
debug=args.debug,
offload_module_to_cpu=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please make the other arguments (disable_tf32, use_python_runtime, offload_module_to_cpu) configurable as well ?

Comment on lines +712 to +732
image_embeds = None
if pixel_values is not None:
image_embeds = model.visual(pixel_values, image_grid_thw)

# 2. Create initial sequence embeddings
seq_tokens = input_ids.clone()
seq_embeds = emb_layer(seq_tokens)

# 3. Insert image embeddings at image token positions
if image_embeds is not None:
mask = seq_tokens == model.config.image_token_id
num_image_tokens = mask.sum().item()
if num_image_tokens != image_embeds.shape[0]:
raise ValueError(
f"Number of image tokens ({num_image_tokens}) does not match number of image embeddings ({image_embeds.shape[0]})."
)
mask_expanded = mask.unsqueeze(-1).expand_as(seq_embeds)
seq_embeds = seq_embeds.masked_scatter(
mask_expanded, image_embeds.to(seq_embeds.dtype)
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a similar section on Qwen2 on what parts of the graph are optimized and what is not.

hidden_states, kv_cache = outputs_and_kv[0], outputs_and_kv[1:]

# Use logit_pos to get the correct logit based on whether we padded or not.
logits = model.lm_head(hidden_states[:, -1, :])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not optimize lm_head ?

Comment on lines +863 to +869
def generate_mm_paligemma(
model,
pixel_values: torch.Tensor | None,
input_ids: torch.Tensor,
max_output_seq_length: int,
eos_token_id: int,
emb_layer: torch.nn.Embedding,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add docstring to this function ? Also mention in the docstring that paligemma is currently under development if you want to keep this function here.

emb_layer: torch.nn.Embedding,
device: str = "cuda:0",
) -> torch.LongTensor:
vit_embeds = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment as above:
Can you add docstring to this function ? Also mention in the docstring that paligemma is currently under development if you want to keep this function here.



@torch.inference_mode()
def generate_mm_qwen2_5_vl_with_timing(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we reuse the code from generate_mm_qwen2_5_vl here ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants