openPangu-Ultra-MoE-718B

English | 中文

1. Introduction

The openPangu-Ultra-MoE-718B is a large-scale mixture-of-experts language model trained from scratch on Ascend NPU, with a total parameter count of 718B and 39B activated parameters per token. The openPangu-Ultra-MoE-718B is trained on approximately 19 trillion tokens, and equipped with the capability to switch between fast and slow thinking.

2. Model Architecture

The architecture of the openPangu-Ultra-MoE-718B adopts the mainstream Multi-head Latent Attention (MLA), Multi-Token Prediction (MTP), high MoE sparsity, and features several different designs:

Depth-Scaled Sandwich-Norm and TinyInit: These techniques adjust the layer normalization structure and parameter initialization for improved training stability.
EP-Group load balancing loss: This technique optimizes the load balancing loss, achieving better expert specialization.

3. Results

Benchmark	Metric	Slow-thinking
General
C-Eval	Acc	91.06
CLUEWSC	Acc	94.67
MMLU-Pro	Exact Match	82.40
ArenaHard_v0.1	w/o Style Control	96.80
GPQA-Diamond	Avg@4	76.77
SuperGPQA	Acc	61.67
IF-Eval	Prompt Strict	80.59
SysBench	Constraint Satisfaction Rate	91.43
Math
CNMO 2024	Avg@32	80.73
AIME25	Avg@16	75.21
AIME24	Avg@16	80.21
MATH-500	Avg@1	97.40
Coding
LiveCodeBench	Avg@3 (01/25~05/25)	61.14
MBPP+	Avg@2	81.48

Note: The system prompt is empty during the evaluation process.

4. Deployment

4.1 Environment

Hardware Requirements

Atlas 800T A2 (64GB, >=32 NPUs), please refer to [Atlas 800T A2] for obtaining the driver and firmware installation packages.

System Requirements & Dependencies

Method 1: Install the following supporting software in a bare-metal environment.
- System: Linux (openEuler ≥ 24.03 recommended)
- CANN==8.1.RC1, please refer to [CANN Install] for installation
- python==3.10
- torch==2.1.0
- torch-npu==2.1.0.post12
- transformers>=4.48.2
Method 2: Start a container from a docker image.

Refer to the [Docker User Guide]

The above software environment has been verified, and theoretically supports newer versions. For any questions, please submit an issue.

4.2 Integrity Check

Please refer to the following methods to verify the integrity of the downloaded content. The hash values are stored in the checklist.chk file.

#!/usr/bin/env bash
ARCH=$(uname -m)
MODEL_PATH="${TARGET_FOLDER}/${MODEL_FOLDER_PATH}"
cd "$MODEL_PATH" || exit 1
if [ "$ARCH" = "arm64" ]; then
    sha256sum checklist.chk
else
    sha256sum -c checklist.chk
fi

4.3 Model Weights Conversion

This inference example of the openPangu-Ultra-MoE-718B adopts Tensor Parallel strategy with fused operators on Ascend NPU. It requires pre-sharding of the model weights. The following provides an example of weight sharding for 32-NPU parallel inference, with the split weights saved in the model/ directory.

cd inference
bash split_weight.sh

4.4 Inference Examples

The following provides a simple bfloat16 inference example of the openPangu-Ultra-MoE-718B deployed on a 4-node 32-NPU Atlas 800T A2 cluster, for which the node IP0 is selected as the master node:

cd inference
# Master node IP0:  ${NNODES} ${NODE_RANK} ${NPROC_PER_NODE} ${MASTER_ADDR} ${PROMPT}
bash generate.sh 4 0 8 IP0 "3*7=?"
# Worker node IP1
bash generate.sh 4 1 8 IP0 "3*7=?"
# Worker node IP2
bash generate.sh 4 2 8 IP0 "3*7=?"
# Worker node IP3
bash generate.sh 4 3 8 IP0 "3*7=?"

The model operates by default in slow thinking mode and can be configured to fast thinking mode through the following method: As demonstrated in the fast_thinking_template within the generate.py example, appending the /no_think flag to the end of user input.

4.5 Using Inference Framework

Vllm-ascend：please refer to [vllm_ascend_for_openpangu_ultra_moe_718b_EN]

5. Model License

Unless otherwise noted, the openPangu-Ultra-MoE-718B model is licensed under the terms and conditions of OPENPANGU MODEL LICENSE AGREEMENT VERSION 1.0, which is intended to be used permissively and enable the further development of artificial intelligence technologies. Please refer to the LICENSE file located in the root directory of the model repository for details.

6. Disclaimer

Due to the technical limitations inherent in the technology on which the openPangu-Ultra-MoE-718B (“Model”) relies and the fact that the artificial intelligence generated content is automatically produced by Model, Huawei cannot make any guarantees regarding the following matters:

The output of this Model is automatically generated via AI algorithms, it does not rule out the possibility that some of the information may be flawed, unreasonable, or cause discomfort, and the generated content does not represent Huawei's attitude or standpoint;
There is no guarantee that this Model is 100% accurate, reliable, functional, timely, secure and safety, error-free, uninterrupted, continuously stable, or free of any faults;
The output of this Model does not constitute any advices or decisions for you, and it does not guarantee the authenticity, completeness, accuracy, timeliness, legality, functionality, or practicality of the generated content. The generated content cannot replace professionals in medical, legal, and other fields in answering your questions. The generated content is for your reference only and does not represent any attitude, standpoint, or position of Huawei. You need to make independent judgments based on your actual situation, and Huawei does not assume any responsibilities.

7. Contact

If you have any question, please raise an issue or contact us at [email protected].