Multi-Speaker MMS-TTS Hausa Model (2 Speakers)

This model is a structurally converted version of facebook/mms-tts-hau adapted for 2-speaker text-to-speech synthesis.

โš ๏ธ IMPORTANT: This model requires fine-tuning before use. Speaker embeddings are randomly initialized.

Model Description

  • Architecture: VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech)
  • Language: Hausa (ISO 639-3: hau)
  • Number of Speakers: 2
  • Speaker Embedding Dimension: 256
  • Base Model: facebook/mms-tts-hau
  • Status: โš ๏ธ Requires Fine-tuning

Architecture Details

VITS Components

The VITS architecture consists of:

  1. Text Encoder (Preserved from original)

    • Transformer-based text encoder
    • Converts text to linguistic features
    • Language-specific, not speaker-dependent
  2. Posterior Encoder (Reinitialized)

    • Encodes speaker characteristics from reference audio
    • Learns speaker-specific acoustic features
    • Must be trained on your speaker data
  3. Flow-based Module (Reinitialized)

    • Normalizing flows for feature transformation
    • Learns speaker-dependent acoustic mappings
    • Consists of coupling layers
  4. Decoder/Generator (Reinitialized)

    • Generates final waveform
    • Speaker-specific synthesis
    • Based on HiFi-GAN vocoder
  5. Speaker Embedding Layer (Newly Added)

    • 2 speaker embeddings
    • 256-dimensional vectors
    • Randomly initialized

What Was Changed

โœ“ Preserved:
  - Text encoder (language understanding)
  - Model tokenizer
  - Base architecture

โœ— Reinitialized:
  - Posterior encoder (speaker characteristics)
  - Flow-based transformations
  - Decoder/generator
  - Duration predictor

+ Added:
  - Speaker embedding layer (2x256)

Fine-tuning Required

This model will not work without fine-tuning. You need:

  1. Dataset: Multi-speaker Hausa TTS dataset with:

    • Audio files (16kHz recommended)
    • Transcriptions in Hausa
    • Speaker labels (0 and 1)
    • At least 50-150 utterances per speaker
  2. Fine-tuning: Use the finetune-hf-vits repository

Fine-tuning Steps

# Clone fine-tuning repository
git clone https://github.com/ylacombe/finetune-hf-vits.git
cd finetune-hf-vits

# Install requirements
pip install -r requirements.txt

# Build monotonic alignment search
cd monotonic_align
python setup.py build_ext --inplace
cd ..

# Run fine-tuning (adjust parameters as needed)
accelerate launch run_vits_finetuning.py \
  --model_name_or_path "suleiman2003/mms-tts-hau-2speaker" \
  --dataset_name "your-dataset" \
  --output_dir "./output" \
  --num_train_epochs 100 \
  --learning_rate 2e-4 \
  --warmup_ratio 0.0 \
  --per_device_train_batch_size 8 \
  --speaker_id_column_name "speaker_id"

Usage (After Fine-tuning)

from transformers import VitsModel, VitsTokenizer, set_seed
import torch
import scipy

# Load fine-tuned model
model = VitsModel.from_pretrained("suleiman2003/mms-tts-hau-2speaker")
tokenizer = VitsTokenizer.from_pretrained("suleiman2003/mms-tts-hau-2speaker")

# Prepare input
text = "Sannu, yaya kake?"  # "Hello, how are you?" in Hausa
inputs = tokenizer(text, return_tensors="pt")

# Generate speech for speaker 0
set_seed(42)
with torch.no_grad():
    outputs = model(**inputs, speaker_id=0)

# Save audio
audio = outputs.waveform[0].cpu().numpy()
scipy.io.wavfile.write("output_speaker0.wav", rate=16000, data=audio)

# Generate speech for speaker 1
with torch.no_grad():
    outputs = model(**inputs, speaker_id=1)
audio = outputs.waveform[0].cpu().numpy()
scipy.io.wavfile.write("output_speaker1.wav", rate=16000, data=audio)

Dataset Preparation

Your dataset should follow this structure:

# Example dataset format
{
    "audio": [<audio_path_1>, <audio_path_2>, ...],
    "text": ["Yaya kake?", "Ina kwana?", ...],
    "speaker_id": [0, 1, 0, 1, ...]  # Speaker labels
}

Recommended Datasets

  • Create your own Hausa multi-speaker dataset
  • Use existing Hausa corpora and add speaker annotations
  • Minimum 100 utterances per speaker (more is better)

Technical Specifications

  • Sampling Rate: 16,000 Hz
  • Model Parameters: ~83M
  • Framework: PyTorch + Transformers
  • Training Method: GAN-based (Generator + Discriminator)

Limitations

  • Requires fine-tuning: Cannot generate speech without training
  • Speaker embeddings: Currently random, need to learn from data
  • Language-specific: Optimized for Hausa only
  • License: Inherits CC-BY-NC-4.0 (non-commercial) from base model

Citation

If you use this model, please cite:

@article{pratap2023mms,
  title={Scaling Speech Technology to 1,000+ Languages},
  author={Pratap, Vineel and Tjandra, Andros and Shi, Bowen and others},
  journal={arXiv preprint arXiv:2305.13516},
  year={2023}
}

@inproceedings{kim2021vits,
  title={Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech},
  author={Kim, Jaehyeon and Kong, Jungil and Son, Juhee},
  booktitle={International Conference on Machine Learning},
  pages={5530--5540},
  year={2021},
  organization={PMLR}
}

Resources

License

This model inherits the CC-BY-NC-4.0 license from the base MMS model, restricting commercial use.

Downloads last month
18
Safetensors
Model size
39.6M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for suleiman2003/mms-tts-hau-2speaker

Finetuned
(2)
this model