-
Notifications
You must be signed in to change notification settings - Fork 370
integrated vlm code for benchmark for Eagle2 #3698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Qwen model : command I used: Error: File "/work/TensorRT/tools/llm/run_vlm.py", line 448, in <module>
inputs = load_inputs(args, processor, DEVICE)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/work/TensorRT/tools/llm/run_vlm.py", line 188, in load_inputs
from qwen_vl_utils import process_vision_info
ModuleNotFoundError: No module named 'qwen_vl_utils' |
When I tried Eagle2 model, it shows Traceback (most recent call last):
File "/work/TensorRT/tools/llm/run_vlm.py", line 443, in <module>
model, processor, emb_layer = load_model(args.model, DEVICE, dtype)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/work/TensorRT/tools/llm/run_vlm.py", line 141, in load_model
return _load_eagle2(device, torch_dtype)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/work/TensorRT/tools/llm/run_vlm.py", line 101, in _load_eagle2
AutoModel.from_pretrained(
File "/root/.pyenv/versions/3.11.13/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.13/lib/python3.11/site-packages/transformers/modeling_utils.py", line 279, in _wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.13/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4336, in from_pretrained
config = cls._autoset_attn_implementation(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.13/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2109, in _autoset_attn_implementation
cls._check_and_enable_flash_attn_2(
File "/root/.pyenv/versions/3.11.13/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2252, in _check_and_enable_flash_attn_2
raise ImportError(f"{preface} the package flash_attn seems to be not installed. {install_message}")
ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2.
root@45fb01c53ae9:/work/TensorRT/tools/llm# python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update docs and add these models to the list of supported models.
# This patch is global for the script's execution context. | ||
import transformers.models.qwen2.modeling_qwen2 as mq | ||
|
||
mq.ALL_ATTENTION_FUNCTIONS["flash_attention_2"] = mq.ALL_ATTENTION_FUNCTIONS["sdpa"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try this instead ? Do you think the following will work ?
model.config._attn_implementation = "sdpa"
url = "https://www.ilankelman.org/stopsigns/australia.jpg" | ||
image = Image.open(requests.get(url, stream=True).raw) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be default but can you also add an argument image_path
where a user can provide a path to image on their local system ?
] | ||
|
||
# --- Model-specific vision processing --- | ||
if "qwen" in args.model.lower(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comment: consider matching the model name exactly here since there can be multiple variants with similar naming (eg: qwen2, qwen3 etc)
max_seq_len = input_embeds.shape[1] + args.num_tokens | ||
|
||
seq_len = torch.export.Dim("seq", min=1, max=max_seq_len) | ||
position_ids = torch.arange(input_embeds.shape[1]).unsqueeze(0).to(DEVICE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make the device an argument as well.
disable_tf32=True, | ||
use_python_runtime=True, | ||
debug=args.debug, | ||
offload_module_to_cpu=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please make the other arguments (disable_tf32, use_python_runtime, offload_module_to_cpu) configurable as well ?
image_embeds = None | ||
if pixel_values is not None: | ||
image_embeds = model.visual(pixel_values, image_grid_thw) | ||
|
||
# 2. Create initial sequence embeddings | ||
seq_tokens = input_ids.clone() | ||
seq_embeds = emb_layer(seq_tokens) | ||
|
||
# 3. Insert image embeddings at image token positions | ||
if image_embeds is not None: | ||
mask = seq_tokens == model.config.image_token_id | ||
num_image_tokens = mask.sum().item() | ||
if num_image_tokens != image_embeds.shape[0]: | ||
raise ValueError( | ||
f"Number of image tokens ({num_image_tokens}) does not match number of image embeddings ({image_embeds.shape[0]})." | ||
) | ||
mask_expanded = mask.unsqueeze(-1).expand_as(seq_embeds) | ||
seq_embeds = seq_embeds.masked_scatter( | ||
mask_expanded, image_embeds.to(seq_embeds.dtype) | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a similar section on Qwen2 on what parts of the graph are optimized and what is not.
hidden_states, kv_cache = outputs_and_kv[0], outputs_and_kv[1:] | ||
|
||
# Use logit_pos to get the correct logit based on whether we padded or not. | ||
logits = model.lm_head(hidden_states[:, -1, :]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not optimize lm_head ?
def generate_mm_paligemma( | ||
model, | ||
pixel_values: torch.Tensor | None, | ||
input_ids: torch.Tensor, | ||
max_output_seq_length: int, | ||
eos_token_id: int, | ||
emb_layer: torch.nn.Embedding, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add docstring to this function ? Also mention in the docstring that paligemma is currently under development if you want to keep this function here.
emb_layer: torch.nn.Embedding, | ||
device: str = "cuda:0", | ||
) -> torch.LongTensor: | ||
vit_embeds = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar comment as above:
Can you add docstring to this function ? Also mention in the docstring that paligemma is currently under development if you want to keep this function here.
|
||
|
||
@torch.inference_mode() | ||
def generate_mm_qwen2_5_vl_with_timing( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we reuse the code from generate_mm_qwen2_5_vl
here ?
Description
Closing the previous pull request (#3652) due to rebase difficulties with the main branch. This new PR resubmits the same changes for the VLM benchmark framework—now cleanly rebased on the latest main branch—and incorporates all feedback from the original review.
Type of change
Please delete options that are not relevant and/or add your own.
Checklist: