Uni-MoE 2.0-Base

Uni-MoE 2.0 is a fully open-source omnimodal model that substantially advances the capabilities of Lychee's Uni-MoE series in language-centric multimodal understanding, reasoning, and generating.

Uni-MoE 2.0-Base is the version of the Uni-MoE 2.0 series that supports only all-modality understanding and does not include the audio and image generation modules.


If you enjoy our work or want timely updates, please give us a like and follow us.

Open-source Plan

Getting Started

1. Clone this repository and navigate to the Uni-MoE 2.0 folder

git clone https://github.com/HITsz-TMG/Uni-MoE.git
cd Uni-MoE-2

2. Set up environment

Install the evaluation environment according to the requirements.

conda create -n uni_moe_2 python=3.11
conda activate uni_moe_2
pip install torch==2.5.1 torchaudio==2.5.1 torchvision==0.20.1
pip install -r requirements.txt
pip install flash-attn==2.6.0.post1 --no-build-isolation
pip install clip==1.0@git+https://github.com/openai/CLIP.git@dcba3cb2e2827b402d2701e7e1c7d9fed8a20ef1

Example Usage

We provide a simple example on the usage of this repo. For detailed usage, please refer to cookbook

import torch
from uni_moe.model.processing_qwen2_vl import Qwen2VLProcessor
from uni_moe.model.modeling_qwen_grin_moe import GrinQwen2VLForConditionalGeneration
from uni_moe.qwen_vl_utils import process_mm_info
from uni_moe.model import deepspeed_moe_inference_utils

processor = Qwen2VLProcessor.from_pretrained("HIT-TMG/Uni-MoE-2.0-Base")

model = GrinQwen2VLForConditionalGeneration.from_pretrained("HIT-TMG/Uni-MoE-2.0-Base", torch_dtype=torch.bfloat16).cuda()

processor.data_args = model.config

messages = [{
    "role": "user", 
    "content": [
            {"type": "text", "text": "<audio>\n<image>\nAnswer the question in the audio."},
            {"type": "audio", "audio": "examples/assets/audio/quick_start.mp3"},
            {"type": "image", "image": "examples/assets/image/quick_start.jpg"}
        ]
}]

texts = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
texts = texts.replace("<image>","<|vision_start|><|image_pad|><|vision_end|>").replace("<audio>","<|audio_start|><|audio_pad|><|audio_end|>").replace("<video>","<|vision_start|><|video_pad|><|vision_end|>")
image_inputs, video_inputs, audio_inputs = process_mm_info(messages)

inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    audios=audio_inputs,
    padding=True,
    return_tensors="pt",
)
inputs["input_ids"] = inputs["input_ids"].unsqueeze(0)

inputs = inputs.to(device=model.device)

output_ids = model.generate(
    **inputs,
    use_cache=True,
    pad_token_id=processor.tokenizer.eos_token_id,
    max_new_tokens=4096,
    temperature=1.0,
    do_sample=True
)

text = processor.batch_decode(output_ids[:, inputs["input_ids"].shape[-1]:], skip_special_tokens=True)[0]
print(text)
Downloads last month
49
Safetensors
Model size
28B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including HIT-TMG/Uni-MoE-2.0-Base