language:
- vi
license: apache-2.0
tags:
- text-to-speech
- tts
- vietnamese
- audio
- speech-synthesis
- neutts-air
- qwen2.5
datasets:
- custom
metrics:
- wer
library_name: transformers
pipeline_tag: text-to-speech
NeuTTS-Air Vietnamese TTS
Vietnamese Text-to-Speech model finetuned from NeuTTS-Air on 2.6M+ Vietnamese audio samples.
Model Description
NeuTTS-Air Vietnamese là mô hình Text-to-Speech (TTS) cho tiếng Việt, được finetune từ NeuTTS-Air base model trên dataset lớn 2.6M+ mẫu audio tiếng Việt.
- Base Model: neuphonic/neutts-air (Qwen2.5 0.5B - 552M parameters)
- Language: Vietnamese (vi)
- Task: Text-to-Speech (TTS)
- Training Data: 2.6M+ Vietnamese audio samples
- Audio Codec: NeuCodec
- Sample Rate: 24kHz
- License: Apache 2.0
Features
✅ High Quality Vietnamese TTS - Natural Vietnamese speech synthesis
✅ Large-scale Training - Trained on 2.6M+ samples
✅ Voice Cloning - Clone voice from reference audio
✅ Text Normalization - Automatic Vietnamese text normalization with ViNorm
✅ Fast Inference - Optimized for production use
✅ Easy to Use - Simple API and Gradio UI
Quick Start
Installation
pip install torch transformers neucodec phonemizer librosa soundfile vinorm
Install espeak-ng:
# Ubuntu/Debian
sudo apt-get install espeak-ng
# macOS
brew install espeak-ng
Usage
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from neucodec import NeuCodec
from phonemizer.backend import EspeakBackend
from vinorm import TTSnorm
import soundfile as sf
import numpy as np
# Load model
model_id = "dinhthuan/neutts-air-vi"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
).to("cuda")
model.eval()
# Load codec
codec = NeuCodec.from_pretrained("neuphonic/neucodec").to("cuda")
codec.eval()
# Initialize phonemizer
phonemizer = EspeakBackend(language='vi', preserve_punctuation=True, with_stress=True)
# Normalize and phonemize text
text = "Xin chào, đây là mô hình text to speech tiếng Việt"
text_normalized = TTSnorm(text, punc=False, unknown=True, lower=False, rule=False)
phones = phonemizer.phonemize([text_normalized])[0]
# Encode reference audio (for voice cloning)
from librosa import load as librosa_load
ref_audio_path = "reference.wav"
ref_text = "Đây là văn bản tham chiếu"
ref_text_normalized = TTSnorm(ref_text, punc=False, unknown=True, lower=False, rule=False)
ref_phones = phonemizer.phonemize([ref_text_normalized])[0]
wav, _ = librosa_load(ref_audio_path, sr=16000, mono=True)
wav_tensor = torch.from_numpy(wav).float().unsqueeze(0).unsqueeze(0)
with torch.no_grad():
ref_codes = codec.encode_code(audio_or_path=wav_tensor).squeeze(0).squeeze(0).cpu()
# Generate speech
codes_str = "".join([f"<|speech_{i}|>" for i in ref_codes.tolist()])
combined_phones = ref_phones + " " + phones
chat = f"""user: Convert the text to speech:<|TEXT_PROMPT_START|>{combined_phones}<|TEXT_PROMPT_END|>\nassistant:<|SPEECH_GENERATION_START|>{codes_str}"""
input_ids = tokenizer.encode(chat, return_tensors="pt").to("cuda")
speech_end_id = tokenizer.convert_tokens_to_ids("<|SPEECH_GENERATION_END|>")
with torch.no_grad():
output = model.generate(
input_ids,
max_new_tokens=2048,
temperature=1.0,
top_k=50,
eos_token_id=speech_end_id,
pad_token_id=tokenizer.eos_token_id,
)
# Decode to audio
output_text = tokenizer.decode(output[0], skip_special_tokens=False)
# Extract speech codes and decode with codec...
# (See full implementation in repository)
# Save audio
sf.write("output.wav", audio, 24000)
Using the Inference Script
For easier usage, use the provided inference script:
# Clone repository
git clone https://github.com/iamdinhthuan/neutts-air-fintune
cd neutts-air-fintune
# Install dependencies
pip install -r requirements.txt
# Run inference
python infer_vietnamese.py \
--text "Xin chào Việt Nam" \
--ref_audio "reference.wav" \
--ref_text "Text của reference audio" \
--output "output.wav" \
--checkpoint "path/to/checkpoint"
Gradio UI
python gradio_app.py
Then open http://localhost:7860 in your browser.
Training Details
Training Data
- Dataset Size: 2.6M+ Vietnamese audio samples
- Audio Format: WAV, 16kHz, mono
- Text: Vietnamese with diacritics
- Train/Val Split: 99.5% / 0.5%
Training Configuration
- Base Model: neuphonic/neutts-air (Qwen2.5 0.5B)
- Epochs: 3
- Batch Size: 4 per device
- Gradient Accumulation: 2 steps (effective batch size: 8)
- Learning Rate: 4e-5
- Optimizer: AdamW (fused)
- Precision: BFloat16
- Hardware: NVIDIA RTX 3090 (24GB)
- Training Time: ~2.5-3 days
Optimizations
- ✅ Pre-encoded Dataset - 6x faster training
- ✅ TF32 Precision - 20% speedup on Ampere GPUs
- ✅ Fused AdamW - 10% faster optimizer
- ✅ Dataloader Optimizations - Pin memory, prefetch
- ✅ Increased Batch Size - Better GPU utilization
Total Speedup: 10-12x faster than baseline (30 days → 2.5-3 days)
Performance
Audio Quality
- Sample Rate: 24kHz
- Natural Prosody: Yes
- Voice Cloning: Supported
- Text Normalization: Automatic (numbers, dates, abbreviations)
Inference Speed
- GPU (RTX 3090): ~0.5s per sentence
- CPU: ~3-5s per sentence
Limitations
- Requires reference audio for voice cloning
- Best results with clear, high-quality reference audio (3-10 seconds)
- May struggle with very long sentences (>100 words)
- Requires Vietnamese text with proper diacritics for best quality
Ethical Considerations
⚠️ Voice Cloning Ethics:
- Only use reference audio with proper consent
- Do not use for impersonation or fraud
- Respect privacy and intellectual property rights
⚠️ Potential Misuse:
- Deepfake audio generation
- Unauthorized voice cloning
- Misinformation campaigns
Recommended Use:
- Accessibility tools (text-to-speech for visually impaired)
- Educational content
- Virtual assistants
- Audiobook narration (with consent)
- Language learning applications
Citation
If you use this model, please cite:
@misc{neutts-air-vietnamese,
author = {Thuan Dinh Nguyen},
title = {NeuTTS-Air Vietnamese TTS},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/YOUR_USERNAME/neutts-air-vietnamese}},
}
@misc{neutts-air,
author = {Neuphonic},
title = {NeuTTS-Air: Scalable TTS with Qwen2.5},
year = {2024},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/neuphonic/neutts-air}},
}
Acknowledgments
- Base Model: Neuphonic for NeuTTS-Air
- Backbone: Qwen Team for Qwen2.5
- Codec: Neuphonic for NeuCodec
- Phonemizer: espeak-ng
- Text Normalization: ViNorm
Repository
Full training and inference code: https://github.com/iamdinhthuan/neutts-air-fintune
License
Apache 2.0 - See LICENSE for details.
Contact
For questions or issues, please open an issue on GitHub.
Model Card Authors: Your Name
Last Updated: 2025-01-01