SpeechJudge: Towards Human-Level Judgment for Speech Naturalness
Paper
•
2511.07931
•
Published
Aligning large generative models with human feedback is a critical challenge. In speech synthesis, this is particularly pronounced due to the lack of a large-scale human preference dataset, which hinders the development of models that truly align with human perception. To address this, we introduce SpeechJudge, a comprehensive suite comprising a dataset, a benchmark, and a reward model centered on naturalness—one of the most fundamental subjective metrics for speech synthesis:
git clone https://github.com/AmphionTeam/SpeechJudge.git
cd SpeechJudge
pip install transformers==4.52.3
pip install accelerate==1.10.0
pip install qwen-omni-utils==0.0.8
The main entry point is infer/main_grm.py. Here's a basic example:
from infer.main_grm import load_model, compare_wavs
# Load the model
model_path = "pretrained/SpeechJudge-GRM"
model, processor = load_model(model_path)
# The compared two speeches (and the corresponding text)
target_text = "Your target text here"
wav_path_a = "path/to/audio_a.wav"
wav_path_b = "path/to/audio_b.wav"
# Compare the two audio outputs
rating, result = compare_wavs(processor, model, target_text, wav_path_a, wav_path_b)
print(f"Output A score: {rating['output_a']}")
print(f"Output B score: {rating['output_b']}")
print(f"\nDetailed Analysis:\n{result}")
The repository includes example audio files in infer/examples/. To run the provided example:
cd infer
python main_grm.py
If you use SpeechJudge in your research, please cite our paper:
@article{zhang2025speechjudge,
title={SpeechJudge: Towards Human-Level Judgment for Speech Naturalness},
author={Zhang, Xueyao and Wang, Chaoren and Liao, Huan and Li, Ziniu and Wang, Yuancheng and Wang, Li and Jia, Dongya and Chen, Yuanzhe and Li, Xiulin and Chen, Zhuo and Wu, Zhizheng},
journal={arXiv preprint arXiv:2511.07931},
year={2025}
}
Base model
Qwen/Qwen2.5-Omni-7B