MGM-Omni-TTS-4B
Introduction
MGM-Omni is an omni-chatbot capable of processing text, image, video, and speech inputs, and generating both text and speech responses. MGM-Omni is capable of long-form speech understanding and generation, as well as zero-shot voice cloning in both Chinese and English. MGM-Omni-TTS-4B is the SpeechLM component of MGM-Omni for speech generation. For the MLLM part, please refer MGM-Omni.
Main Properties
- Omni-modality supports: MGM-Omni supports audio, video, image, and text inputs, understands long contexts, and can generate both text and speech outputs, making it a truly versatile multi-modal AI assistant.
 - Long-form Speech Understanding: Unlike most existing open-source multi-modal models, which typically fail with inputs longer than 15 minutes, MGM-Omni can handle hour-long speech inputs while delivering superior overall and detailed understanding and performance!
 - Long-form Speech Generation: With a treasure trove of training data and smart Chunk-Based Decoding, MGM-Omni can generate over 10 minutes of smooth, natural speech for continuous storytelling.
 - Streaming Generation: Thanks to the parallel decoding approach for speech tokens, MGM-Omni enables efficient and smooth streaming audio, making it suitable for live conversations.
 - Zero-shot Voice Cloning: With MGM-Omni’s extensive and diverse audio training, you can create a customized voice clone by simply recording a short clip (around 10 seconds) and reviewing the results.
 - Fully Open-source: All the code, models, and training data will be released.
 
Evaluation
Speech and Audio Understanding
| Model | Date | LS-clean↓ | LS-other↓ | CM-EN↓ | CM-ZH↓ | AISHELL↓ | 
|---|---|---|---|---|---|---|
| Mini-Omni2 | 2024-11 | 4.7 | 9.4 | - | - | - | 
| Lyra | 2024-12 | 2.0 | 4.0 | - | - | - | 
| VITA-1.5 | 2025-01 | 3.4 | 7.5 | - | - | 2.2 | 
| Qwen2.5-Omni | 2025-03 | 1.6 | 3.5 | 7.6 | 5.2 | - | 
| Ola | 2025-06 | 1.9 | 4.3 | - | - | - | 
| MGM-Omni-7B | 2025-08 | 1.7 | 3.6 | 8.8 | 4.5 | 1.9 | 
| MGM-Omni-32B | 2025-08 | 1.5 | 3.2 | 8.0 | 4.0 | 1.8 | 
This table presents WER and CER results on speech understanding. Here LS refers to LibriSpeech and CM refers to Common Voice.
| Model | Date | Speech↑ | Sound↑ | Music↑ | Mix↑ | Average↑ | 
|---|---|---|---|---|---|---|
| LLaMA-Omni | 2024-08 | 5.2 | 5.3 | 4.3 | 4.0 | 4.7 | 
| Mini-Omni2 | 2024-11 | 3.6 | 3.5 | 2.6 | 3.1 | 3.2 | 
| IXC2.5-OmniLive | 2024-12 | 1.6 | 1.8 | 1.7 | 1.6 | 1.7 | 
| VITA-1.5 | 2025-01 | 4.8 | 5.5 | 4.9 | 2.9 | 4.5 | 
| Qwen2.5-Omni | 2025-03 | 6.8 | 5.7 | 4.8 | 5.4 | 5.7 | 
| Ola | 2025-06 | 7.3 | 6.4 | 5.9 | 6.0 | 6.4 | 
| MGM-Omni-7B | 2025-08 | 7.3 | 6.5 | 6.3 | 6.1 | 6.5 | 
| MGM-Omni-32B | 2025-08 | 7.1 | 6.5 | 6.2 | 6.2 | 6.5 | 
This table presents evaluation results on AIR-Bench Chat (speech, sound, music, etc.).
Speech Generation
| Model | Date | Model Size | CER↓ | SS(ZH)↑ | WER↓ | SS(EN)↑ | 
|---|---|---|---|---|---|---|
| CosyVoice2 | 2024-12 | 0.5B | 1.45 | 0.748 | 2.57 | 0.652 | 
| Qwen2.5-Omni-3B | 2025-03 | 0.5B | 1.58 | 0.744 | 2.51 | 0.635 | 
| Qwen2.5-Omni-7B | 2025-03 | 2B | 1.42 | 0.754 | 2.33 | 0.641 | 
| MOSS-TTSD-v0 | 2025-06 | 2B | 2.18 | 0.594 | 2.46 | 0.476 | 
| HiggsAudio-v2 | 2025-07 | 6B | 1.66 | 0.743 | 2.44 | 0.677 | 
| MGM-Omni | 2025-08 | 0.6B | 1.49 | 0.749 | 2.54 | 0.670 | 
| MGM-Omni | 2025-08 | 2B | 1.38 | 0.753 | 2.28 | 0.682 | 
| MGM-Omni | 2025-08 | 4B | 1.34 | 0.756 | 2.22 | 0.684 | 
This table presents evaluation results on speech generation on seed-tts-eval. For Qwen2.5-Omni, model size refers to the size of the talker.
Citation
If you find this repo useful for your research, we would appreciate it if you could cite our work:
@misc{wang2025mgmomni,
  title={MGM-Omni: An Open-source Omni Chatbot},
  author={Wang, Chengyao and Zhong, Zhisheng and Peng, Bohao and Yang, Senqiao and Liu, Yuqi and Yu, Bei and Jia, Jiaya},
  year={2025},
  howpublished={\url{https://mgm-omni.notion.site}},
  note={Notion Blog}
}
- Downloads last month
 - 14
 
Model tree for wcy1122/MGM-Omni-TTS-4B
Base model
Qwen/Qwen3-4B-Instruct-2507