Multi-Speaker MMS-TTS Hausa Model (2 Speakers)
This model is a structurally converted version of facebook/mms-tts-hau adapted for 2-speaker text-to-speech synthesis.
โ ๏ธ IMPORTANT: This model requires fine-tuning before use. Speaker embeddings are randomly initialized.
Model Description
- Architecture: VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech)
- Language: Hausa (ISO 639-3:
hau) - Number of Speakers: 2
- Speaker Embedding Dimension: 256
- Base Model: facebook/mms-tts-hau
- Status: โ ๏ธ Requires Fine-tuning
Architecture Details
VITS Components
The VITS architecture consists of:
Text Encoder (Preserved from original)
- Transformer-based text encoder
- Converts text to linguistic features
- Language-specific, not speaker-dependent
Posterior Encoder (Reinitialized)
- Encodes speaker characteristics from reference audio
- Learns speaker-specific acoustic features
- Must be trained on your speaker data
Flow-based Module (Reinitialized)
- Normalizing flows for feature transformation
- Learns speaker-dependent acoustic mappings
- Consists of coupling layers
Decoder/Generator (Reinitialized)
- Generates final waveform
- Speaker-specific synthesis
- Based on HiFi-GAN vocoder
Speaker Embedding Layer (Newly Added)
- 2 speaker embeddings
- 256-dimensional vectors
- Randomly initialized
What Was Changed
โ Preserved:
- Text encoder (language understanding)
- Model tokenizer
- Base architecture
โ Reinitialized:
- Posterior encoder (speaker characteristics)
- Flow-based transformations
- Decoder/generator
- Duration predictor
+ Added:
- Speaker embedding layer (2x256)
Fine-tuning Required
This model will not work without fine-tuning. You need:
Dataset: Multi-speaker Hausa TTS dataset with:
- Audio files (16kHz recommended)
- Transcriptions in Hausa
- Speaker labels (0 and 1)
- At least 50-150 utterances per speaker
Fine-tuning: Use the finetune-hf-vits repository
Fine-tuning Steps
# Clone fine-tuning repository
git clone https://github.com/ylacombe/finetune-hf-vits.git
cd finetune-hf-vits
# Install requirements
pip install -r requirements.txt
# Build monotonic alignment search
cd monotonic_align
python setup.py build_ext --inplace
cd ..
# Run fine-tuning (adjust parameters as needed)
accelerate launch run_vits_finetuning.py \
--model_name_or_path "suleiman2003/mms-tts-hau-2speaker" \
--dataset_name "your-dataset" \
--output_dir "./output" \
--num_train_epochs 100 \
--learning_rate 2e-4 \
--warmup_ratio 0.0 \
--per_device_train_batch_size 8 \
--speaker_id_column_name "speaker_id"
Usage (After Fine-tuning)
from transformers import VitsModel, VitsTokenizer, set_seed
import torch
import scipy
# Load fine-tuned model
model = VitsModel.from_pretrained("suleiman2003/mms-tts-hau-2speaker")
tokenizer = VitsTokenizer.from_pretrained("suleiman2003/mms-tts-hau-2speaker")
# Prepare input
text = "Sannu, yaya kake?" # "Hello, how are you?" in Hausa
inputs = tokenizer(text, return_tensors="pt")
# Generate speech for speaker 0
set_seed(42)
with torch.no_grad():
outputs = model(**inputs, speaker_id=0)
# Save audio
audio = outputs.waveform[0].cpu().numpy()
scipy.io.wavfile.write("output_speaker0.wav", rate=16000, data=audio)
# Generate speech for speaker 1
with torch.no_grad():
outputs = model(**inputs, speaker_id=1)
audio = outputs.waveform[0].cpu().numpy()
scipy.io.wavfile.write("output_speaker1.wav", rate=16000, data=audio)
Dataset Preparation
Your dataset should follow this structure:
# Example dataset format
{
"audio": [<audio_path_1>, <audio_path_2>, ...],
"text": ["Yaya kake?", "Ina kwana?", ...],
"speaker_id": [0, 1, 0, 1, ...] # Speaker labels
}
Recommended Datasets
- Create your own Hausa multi-speaker dataset
- Use existing Hausa corpora and add speaker annotations
- Minimum 100 utterances per speaker (more is better)
Technical Specifications
- Sampling Rate: 16,000 Hz
- Model Parameters: ~83M
- Framework: PyTorch + Transformers
- Training Method: GAN-based (Generator + Discriminator)
Limitations
- Requires fine-tuning: Cannot generate speech without training
- Speaker embeddings: Currently random, need to learn from data
- Language-specific: Optimized for Hausa only
- License: Inherits CC-BY-NC-4.0 (non-commercial) from base model
Citation
If you use this model, please cite:
@article{pratap2023mms,
title={Scaling Speech Technology to 1,000+ Languages},
author={Pratap, Vineel and Tjandra, Andros and Shi, Bowen and others},
journal={arXiv preprint arXiv:2305.13516},
year={2023}
}
@inproceedings{kim2021vits,
title={Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech},
author={Kim, Jaehyeon and Kong, Jungil and Son, Juhee},
booktitle={International Conference on Machine Learning},
pages={5530--5540},
year={2021},
organization={PMLR}
}
Resources
- Base Model: facebook/mms-tts-hau
- Fine-tuning Guide: finetune-hf-vits
- VITS Paper: arXiv:2106.06103
- MMS Paper: arXiv:2305.13516
License
This model inherits the CC-BY-NC-4.0 license from the base MMS model, restricting commercial use.
- Downloads last month
- 18
Model tree for suleiman2003/mms-tts-hau-2speaker
Base model
facebook/mms-tts-hau