Skip to content

Conversation

@fanqiNO1
Copy link

@fanqiNO1 fanqiNO1 commented Oct 28, 2025

What does this PR do?

Fixes #41910

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@SunMarc @MekkCyber

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, really appreciate it! Can you add some tests with small models ?

@fanqiNO1
Copy link
Author

fanqiNO1 commented Nov 5, 2025

Certainly! I will add some tests after implementing AWQRoPE to support models like LLaMA3~

@fanqiNO1
Copy link
Author

fanqiNO1 commented Nov 7, 2025

I modified modeling_llama because autoawq uses the use_cache flag to determine whether to use KV cache by start_pos.

https://github.com/casper-hansen/AutoAWQ/blob/88e4c76b20755db275574e6a03c83c84ba3bece5/awq/modules/fused/attn.py#L231

Therefore, use_cache must be passed down to the decoder layer.

I’ll add some tests this weekend~

@fanqiNO1
Copy link
Author

Sorry for the late update.

I've added a test using hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 to verify the correctness of the AWQRoPE implementation, and I think this PR is ready for review~

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this and nice tests ! Just a minor comment

position_embeddings=position_embeddings,
position_ids=position_ids,
past_key_values=past_key_values,
use_cache=use_cache,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't mind passing this but I didn't find where this is used in decoder layer -> attention layer

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very good question.

On one hand, when using model.generate, use_cache is set to True, which enables the model to utilize past_key_values. At this point, the logic in autoawq checks whether the forward call originates from generate by inspecting use_cache, and accordingly adjusts the starting position of its precomputed RoPE embeddings. If use_cache is not passed down to the decoder layer and subsequently to the attention module, autoawq cannot determine whether it is inside a generate call. Consequently, it assumes the forward pass is always a regular one (i.e., without any cache), keeping the starting position fixed at 0, which leads to garbled output during inference.

autoawq:

https://github.com/casper-hansen/AutoAWQ/blob/88e4c76b20755db275574e6a03c83c84ba3bece5/awq/modules/fused/attn.py#L218-L241

On the other hand, similar to the implementations in Qwen2 and Qwen3, use_cache is indeed passed to the decoder layer and then forwarded to the attention module—but it is not actually used within the attention module itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm I see, thanks for the extensive explanation !

@fanqiNO1
Copy link
Author

In fact, I believe that as Transformers continue to evolve—with an increasing number of models and ever-larger model sizes—the demand for AWQ will inevitably grow. Since AutoAWQ has already been archived, compatibility issues are likely to become more frequent. It might be a good solution for HuggingFace to fork and maintain an AutoAWQ repository. Should such a day come, I would be very willing to help you maintain it.

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot !

@SunMarc
Copy link
Member

SunMarc commented Nov 14, 2025

In fact, I believe that as Transformers continue to evolve—with an increasing number of models and ever-larger model sizes—the demand for AWQ will inevitably grow. Since AutoAWQ has already been archived, compatibility issues are likely to become more frequent. It might be a good solution for HuggingFace to fork and maintain an AutoAWQ repository. Should such a day come, I would be very willing to help you maintain it.

Thanks for your concerns ! We don't think we will maintain a fork of AutoAWQ as this is quite a huge tasks tbh with our current staff. I think the best solution for all quantization methods and not only AutoAWQ is to put the critical modeling code inside transformers (eg Linear4bit, QuantAttentionFused) if they want to make sure that nothing breaks and avoid monkey patching as much as possible + move the kernels to kernels-community repo. A good example is FP8-finegrained method.
If you are willing to upstream + simplify AutoAWQ modeling code into transformers like QuantAttentionFused class, happy to review the PRs !

@fanqiNO1
Copy link
Author

fanqiNO1 commented Nov 15, 2025

In fact, I believe that as Transformers continue to evolve—with an increasing number of models and ever-larger model sizes—the demand for AWQ will inevitably grow. Since AutoAWQ has already been archived, compatibility issues are likely to become more frequent. It might be a good solution for HuggingFace to fork and maintain an AutoAWQ repository. Should such a day come, I would be very willing to help you maintain it.

Thanks for your concerns ! We don't think we will maintain a fork of AutoAWQ as this is quite a huge tasks tbh with our current staff. I think the best solution for all quantization methods and not only AutoAWQ is to put the critical modeling code inside transformers (eg Linear4bit, QuantAttentionFused) if they want to make sure that nothing breaks and avoid monkey patching as much as possible + move the kernels to kernels-community repo. A good example is FP8-finegrained method.

Thank you for your reply!

I indeed overlooked the human effort required to maintain such a library before. Next, I’ll open a PR to migrate, simplify, and integrate the key code from AutoAWQ into transformers as possible, and also move the relevant implementations from AutoAWQ-kernels into kernel-community.

I plan to first draft a well-thought-out design proposal and include it in the new PR description so we can discuss it together. Thanks so much!

@fanqiNO1
Copy link
Author

I've created a new PR #42256, which includes an analysis of AutoAWQ and its kernels.

Is there anything else I need to do before this PR can be merged?

@SunMarc
Copy link
Member

SunMarc commented Nov 18, 2025

you need to do make fix-copies to propagate some changes you did in the modeling file of llama

@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: apertus, arcee, aria, bitnet, cohere, csm, deepseek_v2, deepseek_v3, diffllama, emu3, ernie4_5, glm, glm4, glm4_moe, helium, hunyuan_v1_dense

@SunMarc SunMarc requested a review from MekkCyber November 19, 2025 13:05
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@MekkCyber MekkCyber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Breaking change about AWQ Fused modules due to Attention Refactor

4 participants