-
Notifications
You must be signed in to change notification settings - Fork 31.2k
Fix Break change of AWQ FusedModules due to Attention Refactor #41909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
SunMarc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, really appreciate it! Can you add some tests with small models ?
|
Certainly! I will add some tests after implementing |
|
I modified modeling_llama because autoawq uses the use_cache flag to determine whether to use KV cache by Therefore, use_cache must be passed down to the decoder layer. I’ll add some tests this weekend~ |
|
Sorry for the late update. I've added a test using |
SunMarc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this and nice tests ! Just a minor comment
| position_embeddings=position_embeddings, | ||
| position_ids=position_ids, | ||
| past_key_values=past_key_values, | ||
| use_cache=use_cache, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't mind passing this but I didn't find where this is used in decoder layer -> attention layer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A very good question.
On one hand, when using model.generate, use_cache is set to True, which enables the model to utilize past_key_values. At this point, the logic in autoawq checks whether the forward call originates from generate by inspecting use_cache, and accordingly adjusts the starting position of its precomputed RoPE embeddings. If use_cache is not passed down to the decoder layer and subsequently to the attention module, autoawq cannot determine whether it is inside a generate call. Consequently, it assumes the forward pass is always a regular one (i.e., without any cache), keeping the starting position fixed at 0, which leads to garbled output during inference.
autoawq:
On the other hand, similar to the implementations in Qwen2 and Qwen3, use_cache is indeed passed to the decoder layer and then forwarded to the attention module—but it is not actually used within the attention module itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm I see, thanks for the extensive explanation !
|
In fact, I believe that as |
SunMarc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot !
Thanks for your concerns ! We don't think we will maintain a fork of AutoAWQ as this is quite a huge tasks tbh with our current staff. I think the best solution for all quantization methods and not only AutoAWQ is to put the critical modeling code inside transformers (eg Linear4bit, QuantAttentionFused) if they want to make sure that nothing breaks and avoid monkey patching as much as possible + move the kernels to kernels-community repo. A good example is FP8-finegrained method. |
Thank you for your reply! I indeed overlooked the human effort required to maintain such a library before. Next, I’ll open a PR to migrate, simplify, and integrate the key code from I plan to first draft a well-thought-out design proposal and include it in the new PR description so we can discuss it together. Thanks so much! |
|
I've created a new PR #42256, which includes an analysis of AutoAWQ and its kernels. Is there anything else I need to do before this PR can be merged? |
|
you need to do make |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: apertus, arcee, aria, bitnet, cohere, csm, deepseek_v2, deepseek_v3, diffllama, emu3, ernie4_5, glm, glm4, glm4_moe, helium, hunyuan_v1_dense |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
MekkCyber
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this!
What does this PR do?
Fixes #41910
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@SunMarc @MekkCyber