This is a ResNet-152 speaker recognition model trained on the VoxBlink2 dataset, which contains 111,284 speakers.

The model is specifically adapted for telephone speech: the original data was downsampled to 8 kHz, and the GSM codec was applied to 50% of the data to simulate low-bandwidth conditions.

The backbone was trained using the WeSpeaker toolkit, following their standard VoxCeleb recipe.

Resuls on SRE-24

	EER(%)	min Cprimary
Development	9.31	0.522
Evaluation	7.59	0.562

Resuls on VoxCeleb1

	EER(%)
VoxCeleb1-O	2.42
VoxCeleb1-E	2.15
VoxCeleb1-H	4.32

Citation

If you use this model in your research, please cite the following paper:

@inproceedings{barahona25_interspeech,
  title     = {{Analysis of ABC Frontend Audio Systems for the NIST-SRE24}},
  author    = {Sara Barahona and Anna Silnova and Ladislav Mošner and Junyi Peng and Oldřich Plchot and Johan Rohdin and Lin Zhang and Jiangyu Han and Petr Palka and Federico Landini and Lukáš Burget and Themos Stafylakis and Sandro Cumani and Dominik Boboš and Miroslav Hlavaček and Martin Kodovsky and Tomaš Pavliček},
  year      = {2025},
  booktitle = {{Interspeech 2025}},
  pages     = {5763--5767},
  doi       = {10.21437/Interspeech.2025-2737},
  issn      = {2958-1796},
}

Downloads last month: 2

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support