Fix Break change of AWQ FusedModules due to Attention Refactor #41909

fanqiNO1 · 2025-10-28T08:26:52Z

What does this PR do?

Fixes #41910

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@SunMarc @MekkCyber

SunMarc

Thanks, really appreciate it! Can you add some tests with small models ?

fanqiNO1 · 2025-11-05T11:01:28Z

Certainly! I will add some tests after implementing AWQRoPE to support models like LLaMA3~

fanqiNO1 · 2025-11-07T05:55:17Z

I modified modeling_llama because autoawq uses the use_cache flag to determine whether to use KV cache by start_pos.

https://github.com/casper-hansen/AutoAWQ/blob/88e4c76b20755db275574e6a03c83c84ba3bece5/awq/modules/fused/attn.py#L231

Therefore, use_cache must be passed down to the decoder layer.

I’ll add some tests this weekend~

fanqiNO1 · 2025-11-12T06:37:51Z

Sorry for the late update.

I've added a test using hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 to verify the correctness of the AWQRoPE implementation, and I think this PR is ready for review~

SunMarc

Thanks for this and nice tests ! Just a minor comment

SunMarc · 2025-11-13T13:50:32Z

src/transformers/models/llama/modeling_llama.py

                position_embeddings=position_embeddings,
                position_ids=position_ids,
                past_key_values=past_key_values,
+                use_cache=use_cache,


don't mind passing this but I didn't find where this is used in decoder layer -> attention layer

A very good question.

On one hand, when using model.generate, use_cache is set to True, which enables the model to utilize past_key_values. At this point, the logic in autoawq checks whether the forward call originates from generate by inspecting use_cache, and accordingly adjusts the starting position of its precomputed RoPE embeddings. If use_cache is not passed down to the decoder layer and subsequently to the attention module, autoawq cannot determine whether it is inside a generate call. Consequently, it assumes the forward pass is always a regular one (i.e., without any cache), keeping the starting position fixed at 0, which leads to garbled output during inference.

autoawq:

https://github.com/casper-hansen/AutoAWQ/blob/88e4c76b20755db275574e6a03c83c84ba3bece5/awq/modules/fused/attn.py#L218-L241

On the other hand, similar to the implementations in Qwen2 and Qwen3, use_cache is indeed passed to the decoder layer and then forwarded to the attention module—but it is not actually used within the attention module itself.

Hmmmm I see, thanks for the extensive explanation !

fanqiNO1 · 2025-11-14T02:24:42Z

In fact, I believe that as Transformers continue to evolve—with an increasing number of models and ever-larger model sizes—the demand for AWQ will inevitably grow. Since AutoAWQ has already been archived, compatibility issues are likely to become more frequent. It might be a good solution for HuggingFace to fork and maintain an AutoAWQ repository. Should such a day come, I would be very willing to help you maintain it.

SunMarc

Thanks a lot !

SunMarc · 2025-11-14T16:30:40Z

In fact, I believe that as Transformers continue to evolve—with an increasing number of models and ever-larger model sizes—the demand for AWQ will inevitably grow. Since AutoAWQ has already been archived, compatibility issues are likely to become more frequent. It might be a good solution for HuggingFace to fork and maintain an AutoAWQ repository. Should such a day come, I would be very willing to help you maintain it.

Thanks for your concerns ! We don't think we will maintain a fork of AutoAWQ as this is quite a huge tasks tbh with our current staff. I think the best solution for all quantization methods and not only AutoAWQ is to put the critical modeling code inside transformers (eg Linear4bit, QuantAttentionFused) if they want to make sure that nothing breaks and avoid monkey patching as much as possible + move the kernels to kernels-community repo. A good example is FP8-finegrained method.
If you are willing to upstream + simplify AutoAWQ modeling code into transformers like QuantAttentionFused class, happy to review the PRs !

fanqiNO1 · 2025-11-15T04:21:13Z

In fact, I believe that as Transformers continue to evolve—with an increasing number of models and ever-larger model sizes—the demand for AWQ will inevitably grow. Since AutoAWQ has already been archived, compatibility issues are likely to become more frequent. It might be a good solution for HuggingFace to fork and maintain an AutoAWQ repository. Should such a day come, I would be very willing to help you maintain it.

Thanks for your concerns ! We don't think we will maintain a fork of AutoAWQ as this is quite a huge tasks tbh with our current staff. I think the best solution for all quantization methods and not only AutoAWQ is to put the critical modeling code inside transformers (eg Linear4bit, QuantAttentionFused) if they want to make sure that nothing breaks and avoid monkey patching as much as possible + move the kernels to kernels-community repo. A good example is FP8-finegrained method.

Thank you for your reply!

I indeed overlooked the human effort required to maintain such a library before. Next, I’ll open a PR to migrate, simplify, and integrate the key code from AutoAWQ into transformers as possible, and also move the relevant implementations from AutoAWQ-kernels into kernel-community.

I plan to first draft a well-thought-out design proposal and include it in the new PR description so we can discuss it together. Thanks so much!

fanqiNO1 · 2025-11-18T11:29:33Z

I've created a new PR #42256, which includes an analysis of AutoAWQ and its kernels.

Is there anything else I need to do before this PR can be merged?

SunMarc · 2025-11-18T17:29:31Z

you need to do make fix-copies to propagate some changes you did in the modeling file of llama

github-actions · 2025-11-19T07:02:14Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: apertus, arcee, aria, bitnet, cohere, csm, deepseek_v2, deepseek_v3, diffllama, emu3, ernie4_5, glm, glm4, glm4_moe, helium, hunyuan_v1_dense

HuggingFaceDocBuilderDev · 2025-11-19T13:15:10Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

MekkCyber

Thanks for fixing this!

fix awq bc due to attention refactor

63d7ca3

fanqiNO1 mentioned this pull request Oct 28, 2025

Breaking change about AWQ Fused modules due to Attention Refactor #41910

Open

4 tasks

SunMarc reviewed Nov 5, 2025

View reviewed changes

feat: support more rope_types for awq fusion

1f49232

feat: add test for llama3

9cdc808

fix ruff format

6d2fa27

SunMarc reviewed Nov 13, 2025

View reviewed changes

SunMarc approved these changes Nov 14, 2025

View reviewed changes

propagate changes in modeling_llama

e8d3a21

SunMarc requested a review from MekkCyber November 19, 2025 13:05

MekkCyber approved these changes Nov 19, 2025

View reviewed changes

Fix Break change of AWQ FusedModules due to Attention Refactor #41909

Are you sure you want to change the base?

Fix Break change of AWQ FusedModules due to Attention Refactor #41909

Conversation

fanqiNO1 commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

fanqiNO1 commented Nov 5, 2025

Uh oh!

fanqiNO1 commented Nov 7, 2025

Uh oh!

fanqiNO1 commented Nov 12, 2025

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

SunMarc Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

fanqiNO1 Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

SunMarc Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

fanqiNO1 commented Nov 14, 2025

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

SunMarc commented Nov 14, 2025

Uh oh!

fanqiNO1 commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fanqiNO1 commented Nov 18, 2025

Uh oh!

SunMarc commented Nov 18, 2025

Uh oh!

github-actions bot commented Nov 19, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Nov 19, 2025

Uh oh!

MekkCyber left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fanqiNO1 commented Oct 28, 2025 •

edited

Loading

fanqiNO1 commented Nov 15, 2025 •

edited

Loading