neutts-air-vi / README.md
dinhthuan's picture
Update README.md
df61981 verified
---
language:
- vi
license: apache-2.0
tags:
- text-to-speech
- tts
- vietnamese
- audio
- speech-synthesis
- neutts-air
- qwen2.5
datasets:
- custom
metrics:
- wer
library_name: transformers
pipeline_tag: text-to-speech
---
# NeuTTS-Air Vietnamese TTS
Vietnamese Text-to-Speech model finetuned from [NeuTTS-Air](https://huggingface.co/neuphonic/neutts-air) on 2.6M+ Vietnamese audio samples.
## Model Description
**NeuTTS-Air Vietnamese** là mô hình Text-to-Speech (TTS) cho tiếng Việt, được finetune từ NeuTTS-Air base model trên dataset lớn 2.6M+ mẫu audio tiếng Việt.
- **Base Model:** [neuphonic/neutts-air](https://huggingface.co/neuphonic/neutts-air) (Qwen2.5 0.5B - 552M parameters)
- **Language:** Vietnamese (vi)
- **Task:** Text-to-Speech (TTS)
- **Training Data:** 2.6M+ Vietnamese audio samples
- **Audio Codec:** [NeuCodec](https://huggingface.co/neuphonic/neucodec)
- **Sample Rate:** 24kHz
- **License:** Apache 2.0
## Features
**High Quality Vietnamese TTS** - Natural Vietnamese speech synthesis
**Large-scale Training** - Trained on 2.6M+ samples
**Voice Cloning** - Clone voice from reference audio
**Text Normalization** - Automatic Vietnamese text normalization with ViNorm
**Fast Inference** - Optimized for production use
**Easy to Use** - Simple API and Gradio UI
## Quick Start
### Installation
```bash
pip install torch transformers neucodec phonemizer librosa soundfile vinorm
```
**Install espeak-ng:**
```bash
# Ubuntu/Debian
sudo apt-get install espeak-ng
# macOS
brew install espeak-ng
```
### Usage
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from neucodec import NeuCodec
from phonemizer.backend import EspeakBackend
from vinorm import TTSnorm
import soundfile as sf
import numpy as np
# Load model
model_id = "dinhthuan/neutts-air-vi"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
).to("cuda")
model.eval()
# Load codec
codec = NeuCodec.from_pretrained("neuphonic/neucodec").to("cuda")
codec.eval()
# Initialize phonemizer
phonemizer = EspeakBackend(language='vi', preserve_punctuation=True, with_stress=True)
# Normalize and phonemize text
text = "Xin chào, đây là mô hình text to speech tiếng Việt"
text_normalized = TTSnorm(text, punc=False, unknown=True, lower=False, rule=False)
phones = phonemizer.phonemize([text_normalized])[0]
# Encode reference audio (for voice cloning)
from librosa import load as librosa_load
ref_audio_path = "reference.wav"
ref_text = "Đây là văn bản tham chiếu"
ref_text_normalized = TTSnorm(ref_text, punc=False, unknown=True, lower=False, rule=False)
ref_phones = phonemizer.phonemize([ref_text_normalized])[0]
wav, _ = librosa_load(ref_audio_path, sr=16000, mono=True)
wav_tensor = torch.from_numpy(wav).float().unsqueeze(0).unsqueeze(0)
with torch.no_grad():
ref_codes = codec.encode_code(audio_or_path=wav_tensor).squeeze(0).squeeze(0).cpu()
# Generate speech
codes_str = "".join([f"<|speech_{i}|>" for i in ref_codes.tolist()])
combined_phones = ref_phones + " " + phones
chat = f"""user: Convert the text to speech:<|TEXT_PROMPT_START|>{combined_phones}<|TEXT_PROMPT_END|>\nassistant:<|SPEECH_GENERATION_START|>{codes_str}"""
input_ids = tokenizer.encode(chat, return_tensors="pt").to("cuda")
speech_end_id = tokenizer.convert_tokens_to_ids("<|SPEECH_GENERATION_END|>")
with torch.no_grad():
output = model.generate(
input_ids,
max_new_tokens=2048,
temperature=1.0,
top_k=50,
eos_token_id=speech_end_id,
pad_token_id=tokenizer.eos_token_id,
)
# Decode to audio
output_text = tokenizer.decode(output[0], skip_special_tokens=False)
# Extract speech codes and decode with codec...
# (See full implementation in repository)
# Save audio
sf.write("output.wav", audio, 24000)
```
### Using the Inference Script
For easier usage, use the provided inference script:
```bash
# Clone repository
git clone https://github.com/iamdinhthuan/neutts-air-fintune
cd neutts-air-fintune
# Install dependencies
pip install -r requirements.txt
# Run inference
python infer_vietnamese.py \
--text "Xin chào Việt Nam" \
--ref_audio "reference.wav" \
--ref_text "Text của reference audio" \
--output "output.wav" \
--checkpoint "path/to/checkpoint"
```
### Gradio UI
```bash
python gradio_app.py
```
Then open http://localhost:7860 in your browser.
## Training Details
### Training Data
- **Dataset Size:** 2.6M+ Vietnamese audio samples
- **Audio Format:** WAV, 16kHz, mono
- **Text:** Vietnamese with diacritics
- **Train/Val Split:** 99.5% / 0.5%
### Training Configuration
- **Base Model:** neuphonic/neutts-air (Qwen2.5 0.5B)
- **Epochs:** 3
- **Batch Size:** 4 per device
- **Gradient Accumulation:** 2 steps (effective batch size: 8)
- **Learning Rate:** 4e-5
- **Optimizer:** AdamW (fused)
- **Precision:** BFloat16
- **Hardware:** NVIDIA RTX 3090 (24GB)
- **Training Time:** ~2.5-3 days
### Optimizations
-**Pre-encoded Dataset** - 6x faster training
-**TF32 Precision** - 20% speedup on Ampere GPUs
-**Fused AdamW** - 10% faster optimizer
-**Dataloader Optimizations** - Pin memory, prefetch
-**Increased Batch Size** - Better GPU utilization
**Total Speedup:** 10-12x faster than baseline (30 days → 2.5-3 days)
## Performance
### Audio Quality
- **Sample Rate:** 24kHz
- **Natural Prosody:** Yes
- **Voice Cloning:** Supported
- **Text Normalization:** Automatic (numbers, dates, abbreviations)
### Inference Speed
- **GPU (RTX 3090):** ~0.5s per sentence
- **CPU:** ~3-5s per sentence
## Limitations
- Requires reference audio for voice cloning
- Best results with clear, high-quality reference audio (3-10 seconds)
- May struggle with very long sentences (>100 words)
- Requires Vietnamese text with proper diacritics for best quality
## Ethical Considerations
⚠️ **Voice Cloning Ethics:**
- Only use reference audio with proper consent
- Do not use for impersonation or fraud
- Respect privacy and intellectual property rights
⚠️ **Potential Misuse:**
- Deepfake audio generation
- Unauthorized voice cloning
- Misinformation campaigns
**Recommended Use:**
- Accessibility tools (text-to-speech for visually impaired)
- Educational content
- Virtual assistants
- Audiobook narration (with consent)
- Language learning applications
## Citation
If you use this model, please cite:
```bibtex
@misc{neutts-air-vietnamese,
author = {Thuan Dinh Nguyen},
title = {NeuTTS-Air Vietnamese TTS},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/YOUR_USERNAME/neutts-air-vietnamese}},
}
@misc{neutts-air,
author = {Neuphonic},
title = {NeuTTS-Air: Scalable TTS with Qwen2.5},
year = {2024},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/neuphonic/neutts-air}},
}
```
## Acknowledgments
- **Base Model:** [Neuphonic](https://github.com/neuphonic) for NeuTTS-Air
- **Backbone:** [Qwen Team](https://github.com/QwenLM) for Qwen2.5
- **Codec:** [Neuphonic](https://github.com/neuphonic) for NeuCodec
- **Phonemizer:** [espeak-ng](https://github.com/espeak-ng/espeak-ng)
- **Text Normalization:** [ViNorm](https://github.com/v-nhandt21/ViNorm)
## Repository
Full training and inference code: [https://github.com/iamdinhthuan/neutts-air-fintune](https://github.com/iamdinhthuan/neutts-air-fintune)
## License
Apache 2.0 - See [LICENSE](LICENSE) for details.
## Contact
For questions or issues, please open an issue on [GitHub](https://github.com/iamdinhthuan/neutts-air-fintune/issues).
---
**Model Card Authors:** Your Name
**Last Updated:** 2025-01-01