Skip to content

Conversation

@yzy1996
Copy link

@yzy1996 yzy1996 commented Nov 16, 2025

What does this PR do?

This PR introduces inference support for the openpangu_moe model. By integrating the inference implementation directly into the Transformers library, we aim to ensure seamless compatibility and enable continuous updates alongside future open-source releases.

I welcome any feedback on potential improvements or additional changes that would enhance this implementation. Please feel free to share your suggestions!

Our open-source model is available at: https://ai.gitcode.com/ascend-tribe/openPangu-Ultra-MoE-718B-V1.1

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, openpangu_moe

Copy link
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial review:

  • I think we can use modular to a higher extent - moe, attn (deepseek), rope (llama), ...
  • Would be nice if we could adopt this with more modular
  • Atm, we have a lot of old things used like _prepare_4d_causal_attention_mask or legacy caches which we should avoid

^ those are the biggest points IMO. Will this also be released on https://huggingface.co/ ?

Comment on lines +25 to +41
The OpenPanguMoE model was proposed in [<INSERT PAPER NAME HERE>](<INSERT PAPER LINK HERE>) by <INSERT AUTHORS HERE>.
<INSERT SHORT SUMMARY HERE>

The abstract from the paper is the following:

<INSERT PAPER ABSTRACT HERE>

Tips:

<INSERT TIPS ABOUT MODEL HERE>

This model was contributed by [INSERT YOUR HF USERNAME HERE](https://huggingface.co/<INSERT YOUR HF USERNAME HERE>).
The original code can be found [here](<INSERT LINK TO GITHUB REPO HERE>).

## Usage examples

<INSERT SOME NICE EXAMPLES HERE>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget the docs :D

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No associated tokenizer? Would go to tokenization_auto

Comment on lines +1 to +2
# coding=utf-8
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the full licence, elsewhere too

Suggested change
# coding=utf-8
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All rights reserved.
# coding=utf-8
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from ...configuration_utils import PreTrainedConfig

class OpenPanguMoEConfig(PreTrainedConfig):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstrings! You can also use modular for this if you find models similar enough

Comment on lines +26 to +30
attention_kv_lora_dim=512,
attention_q_lora_dim=1536,
attention_qk_rope_dim=64,
attention_v_dim=128,
attention_qk_dim=128,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MLA from deepseek v3? Let's align names with things that exist elsewhere

hidden_states = self.post_mlp_layernorm(hidden_states)
hidden_states = residual + hidden_states

return (hidden_states, present_key_value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return (hidden_states, present_key_value)
return hidden_states

Cache is modified in place so passing it is not necessary

attn_output = attn_output.transpose(1, 2).contiguous().view(bsz, q_len, -1)
attn_output = self.o_proj(attn_output)

return attn_output, past_key_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return attn_output, past_key_value
return attn_output, attn_weights

Same as the comment before + we have a feature for outputting the weights

Comment on lines +543 to +544
if use_cache and use_legacy_cache:
present_key_value = present_key_value.to_legacy_cache()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more legacy cache

if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()

class OpenPanguMoEModel(OpenPanguMoEPreTrainedModel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fairly standard, so we should be able to use other models here to inherit from, e.g. llama

Comment on lines +525 to +530
attention_mask = _prepare_4d_causal_attention_mask(
attention_mask,
(batch_size, seq_length),
hidden_states,
past_key_values_length,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a new API for this, create_causal_mask

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants