π SALAMA-TTS β Swahili Text-to-Speech Model
Developer: AI4NNOV
Version: v1.0
License: Apache 2.0
Model Type: Text-to-Speech (TTS)
Base Model: facebook/mms-tts-swh (fine-tuned)
π Overview
SALAMA-TTS is the speech synthesis module of the SALAMA Framework, a complete end-to-end Speech-to-Speech AI system for African languages.
It generates natural, high-quality Swahili speech from text and integrates seamlessly with SALAMA-LLM and SALAMA-STT for conversational voice assistants.
The model is based on Metaβs MMS (Massively Multilingual Speech) TTS architecture using the VITS framework, fine-tuned for natural prosody, tone, and rhythm in Swahili.
π§± Model Architecture
SALAMA-TTS is built on the VITS architecture, combining the strengths of variational autoencoders (VAE) and GANs for realistic and expressive speech synthesis.
| Parameter | Value |
|---|---|
| Base Model | facebook/mms-tts-swh |
| Fine-Tuning | 8-bit quantized, LoRA fine-tuning |
| Optimizer | AdamW |
| Learning Rate | 2e-5 |
| Epochs | 20 |
| Sampling Rate | 16kHz |
| Frameworks | Transformers + Datasets + PyTorch |
| Language | Swahili (sw) |
π Dataset
| Dataset | Description | Purpose |
|---|---|---|
common_voice_17_0 |
Swahili voice dataset by Mozilla | Base training |
| Custom Swahili speech corpus | Locally recorded sentences and dialogues | Fine-tuning naturalness |
| Evaluated on | Common Voice Swahili (test split) | Evaluation |
π§ Model Capabilities
- Converts Swahili text to natural-sounding speech
- Handles both formal and conversational tone
- High clarity and prosody for long-form speech
- Seamless integration with SALAMA-LLM responses
- Output format: 16-bit PCM WAV
π Evaluation Metrics
| Metric | Score | Description |
|---|---|---|
| MOS (Mean Opinion Score) | 4.05 / 5.0 | Human-rated naturalness |
| WER (Generated β STT) | 0.21 | Evaluated by re-transcribing synthesized audio |
The MOS was evaluated by 12 native Swahili speakers across clarity, tone, and pronunciation.
βοΈ Usage (Python Example)
# Requirements:
# pip install onnxruntime librosa soundfile transformers numpy
# If you want GPU inference: pip install onnxruntime-gpu (and ensure CUDA toolkit is available)
import os
import numpy as np
import onnxruntime
from transformers import AutoTokenizer
import soundfile as sf
TTS_ONNX_MODEL_PATH = "swahili_tts.onnx" # path to your .onnx file
TTS_TOKENIZER_ID = "facebook/mms-tts-swh" # or whichever tokenizer you used
OUTPUT_SAMPLE_RATE = 16000
OUT_DIR = "tts_outputs"
os.makedirs(OUT_DIR, exist_ok=True)
def create_onnx_session(onnx_path: str):
"""Create an ONNX Runtime session using GPU if available, otherwise CPU."""
providers = ["CPUExecutionProvider"]
try:
# prefer CUDA if available
providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]
sess = onnxruntime.InferenceSession(onnx_path, providers=providers)
print("Using CUDAExecutionProvider for ONNX Runtime.")
except Exception:
sess = onnxruntime.InferenceSession(onnx_path, providers=["CPUExecutionProvider"])
print("CUDA not available β using CPUExecutionProvider for ONNX Runtime.")
return sess
def generate_speech_from_onnx(text: str,
onnx_session: onnxruntime.InferenceSession,
tokenizer: AutoTokenizer,
out_path: str = None) -> str:
"""
Synthesize speech from text using an ONNX TTS model.
Returns path to WAV file (16kHz, int16).
"""
if not text:
raise ValueError("Empty text provided.")
# Tokenize to numpy inputs (match what the ONNX model expects)
# NOTE: many TTS tokenizers return {"input_ids": np.array(...)} β adapt if your tokenizer differs
inputs = tokenizer(text, return_tensors="np", padding=True)
# Identify ONNX input name (assume first input)
input_name = onnx_session.get_inputs()[0].name
# Prepare ort_inputs dict using names expected by ONNX model
ort_inputs = {input_name: inputs["input_ids"].astype(np.int64)}
# Run ONNX inference
ort_outs = onnx_session.run(None, ort_inputs)
# The model should return a raw waveform or float array convertible to waveform.
# In many single-file TTS ONNX exports the first output is the waveform
audio_array = ort_outs[0]
# Flatten in case it's multi-dim and ensure 1-D waveform
audio_waveform = audio_array.flatten()
# If float waveform in -1..1, convert to int16; else try to coerce to int16
if np.issubdtype(audio_waveform.dtype, np.floating):
# clip then convert
audio_clip = np.clip(audio_waveform, -1.0, 1.0)
audio_int16 = (audio_clip * 32767.0).astype(np.int16)
else:
# if it's already int16-like, cast (safeguard)
audio_int16 = audio_waveform.astype(np.int16)
# Compose output filename
if out_path is None:
out_path = os.path.join(OUT_DIR, f"salama_tts_{abs(hash(text)) & 0xFFFF_FFFF}.wav")
# Save with soundfile (16kHz)
sf.write(out_path, audio_int16, samplerate=OUTPUT_SAMPLE_RATE, subtype="PCM_16")
return out_path
if __name__ == "__main__":
# Example usage
sess = create_onnx_session(TTS_ONNX_MODEL_PATH)
tokenizer = AutoTokenizer.from_pretrained(TTS_TOKENIZER_ID)
example_text = "Karibu kwenye mfumo wa SALAMA unaozalisha sauti asilia ya Kiswahili."
out_wav = generate_speech_from_onnx(example_text, sess, tokenizer)
print("Saved synthesized audio to:", out_wav)
Example Output:
Audio plays: βKaribu kwenye mfumo wa SALAMA unaozalisha sauti asilia ya Kiswahili.β
β‘ Key Features
- π£οΈ Natural Swahili speech generation
- π Adapted for African tonal variations
- π High clarity and rhythm
- βοΈ Fast inference with FP16 precision
- π Compatible with SALAMA-STT and SALAMA-LLM
Model tree for EYEDOL/SALAMA_TTS
Base model
facebook/mms-tts