MGM-Omni-TTS-4B

Introduction

MGM-Omni is an omni-chatbot capable of processing text, image, video, and speech inputs, and generating both text and speech responses. MGM-Omni is capable of long-form speech understanding and generation, as well as zero-shot voice cloning in both Chinese and English. MGM-Omni-TTS-4B is the SpeechLM component of MGM-Omni for speech generation. For the MLLM part, please refer MGM-Omni.

Main Properties

Omni-modality supports: MGM-Omni supports audio, video, image, and text inputs, understands long contexts, and can generate both text and speech outputs, making it a truly versatile multi-modal AI assistant.
Long-form Speech Understanding: Unlike most existing open-source multi-modal models, which typically fail with inputs longer than 15 minutes, MGM-Omni can handle hour-long speech inputs while delivering superior overall and detailed understanding and performance!
Long-form Speech Generation: With a treasure trove of training data and smart Chunk-Based Decoding, MGM-Omni can generate over 10 minutes of smooth, natural speech for continuous storytelling.
Streaming Generation: Thanks to the parallel decoding approach for speech tokens, MGM-Omni enables efficient and smooth streaming audio, making it suitable for live conversations.
Zero-shot Voice Cloning: With MGM-Omni’s extensive and diverse audio training, you can create a customized voice clone by simply recording a short clip (around 10 seconds) and reviewing the results.
Fully Open-source: All the code, models, and training data will be released.

Evaluation

Speech and Audio Understanding

Model	Date	LS-clean↓	LS-other↓	CM-EN↓	CM-ZH↓	AISHELL↓
Mini-Omni2	2024-11	4.7	9.4	-	-	-
Lyra	2024-12	2.0	4.0	-	-	-
VITA-1.5	2025-01	3.4	7.5	-	-	2.2
Qwen2.5-Omni	2025-03	1.6	3.5	7.6	5.2	-
Ola	2025-06	1.9	4.3	-	-	-
MGM-Omni-7B	2025-08	1.7	3.6	8.8	4.5	1.9
MGM-Omni-32B	2025-08	1.5	3.2	8.0	4.0	1.8

This table presents WER and CER results on speech understanding. Here LS refers to LibriSpeech and CM refers to Common Voice.

Model	Date	Speech↑	Sound↑	Music↑	Mix↑	Average↑
LLaMA-Omni	2024-08	5.2	5.3	4.3	4.0	4.7
Mini-Omni2	2024-11	3.6	3.5	2.6	3.1	3.2
IXC2.5-OmniLive	2024-12	1.6	1.8	1.7	1.6	1.7
VITA-1.5	2025-01	4.8	5.5	4.9	2.9	4.5
Qwen2.5-Omni	2025-03	6.8	5.7	4.8	5.4	5.7
Ola	2025-06	7.3	6.4	5.9	6.0	6.4
MGM-Omni-7B	2025-08	7.3	6.5	6.3	6.1	6.5
MGM-Omni-32B	2025-08	7.1	6.5	6.2	6.2	6.5

This table presents evaluation results on AIR-Bench Chat (speech, sound, music, etc.).

Speech Generation

Model	Date	Model Size	CER↓	SS(ZH)↑	WER↓	SS(EN)↑
CosyVoice2	2024-12	0.5B	1.45	0.748	2.57	0.652
Qwen2.5-Omni-3B	2025-03	0.5B	1.58	0.744	2.51	0.635
Qwen2.5-Omni-7B	2025-03	2B	1.42	0.754	2.33	0.641
MOSS-TTSD-v0	2025-06	2B	2.18	0.594	2.46	0.476
HiggsAudio-v2	2025-07	6B	1.66	0.743	2.44	0.677
MGM-Omni	2025-08	0.6B	1.49	0.749	2.54	0.670
MGM-Omni	2025-08	2B	1.38	0.753	2.28	0.682
MGM-Omni	2025-08	4B	1.34	0.756	2.22	0.684

This table presents evaluation results on speech generation on seed-tts-eval. For Qwen2.5-Omni, model size refers to the size of the talker.

Citation

If you find this repo useful for your research, we would appreciate it if you could cite our work:

@misc{wang2025mgmomni,
  title={MGM-Omni: An Open-source Omni Chatbot},
  author={Wang, Chengyao and Zhong, Zhisheng and Peng, Bohao and Yang, Senqiao and Liu, Yuqi and Yu, Bei and Jia, Jiaya},
  year={2025},
  howpublished={\url{https://mgm-omni.notion.site}},
  note={Notion Blog}
}

Downloads last month: 14

Safetensors

Model size

5B params

Tensor type

BF16

Model tree for wcy1122/MGM-Omni-TTS-4B

Base model

Qwen/Qwen3-4B-Instruct-2507

Quantized

(115)

this model

Collection including wcy1122/MGM-Omni-TTS-4B

MGM-Omni

Collection

MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech • 18 items • Updated 24 days ago • 10