|
|
--- |
|
|
language: |
|
|
- vi |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- text-to-speech |
|
|
- tts |
|
|
- vietnamese |
|
|
- audio |
|
|
- speech-synthesis |
|
|
- neutts-air |
|
|
- qwen2.5 |
|
|
datasets: |
|
|
- custom |
|
|
metrics: |
|
|
- wer |
|
|
library_name: transformers |
|
|
pipeline_tag: text-to-speech |
|
|
--- |
|
|
|
|
|
# NeuTTS-Air Vietnamese TTS |
|
|
|
|
|
Vietnamese Text-to-Speech model finetuned from [NeuTTS-Air](https://huggingface.co/neuphonic/neutts-air) on 2.6M+ Vietnamese audio samples. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
**NeuTTS-Air Vietnamese** là mô hình Text-to-Speech (TTS) cho tiếng Việt, được finetune từ NeuTTS-Air base model trên dataset lớn 2.6M+ mẫu audio tiếng Việt. |
|
|
|
|
|
- **Base Model:** [neuphonic/neutts-air](https://huggingface.co/neuphonic/neutts-air) (Qwen2.5 0.5B - 552M parameters) |
|
|
- **Language:** Vietnamese (vi) |
|
|
- **Task:** Text-to-Speech (TTS) |
|
|
- **Training Data:** 2.6M+ Vietnamese audio samples |
|
|
- **Audio Codec:** [NeuCodec](https://huggingface.co/neuphonic/neucodec) |
|
|
- **Sample Rate:** 24kHz |
|
|
- **License:** Apache 2.0 |
|
|
|
|
|
## Features |
|
|
|
|
|
✅ **High Quality Vietnamese TTS** - Natural Vietnamese speech synthesis |
|
|
✅ **Large-scale Training** - Trained on 2.6M+ samples |
|
|
✅ **Voice Cloning** - Clone voice from reference audio |
|
|
✅ **Text Normalization** - Automatic Vietnamese text normalization with ViNorm |
|
|
✅ **Fast Inference** - Optimized for production use |
|
|
✅ **Easy to Use** - Simple API and Gradio UI |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install torch transformers neucodec phonemizer librosa soundfile vinorm |
|
|
``` |
|
|
|
|
|
**Install espeak-ng:** |
|
|
|
|
|
```bash |
|
|
# Ubuntu/Debian |
|
|
sudo apt-get install espeak-ng |
|
|
|
|
|
# macOS |
|
|
brew install espeak-ng |
|
|
``` |
|
|
|
|
|
### Usage |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
from neucodec import NeuCodec |
|
|
from phonemizer.backend import EspeakBackend |
|
|
from vinorm import TTSnorm |
|
|
import soundfile as sf |
|
|
import numpy as np |
|
|
|
|
|
# Load model |
|
|
model_id = "dinhthuan/neutts-air-vi" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_id, |
|
|
torch_dtype=torch.bfloat16, |
|
|
trust_remote_code=True, |
|
|
).to("cuda") |
|
|
model.eval() |
|
|
|
|
|
# Load codec |
|
|
codec = NeuCodec.from_pretrained("neuphonic/neucodec").to("cuda") |
|
|
codec.eval() |
|
|
|
|
|
# Initialize phonemizer |
|
|
phonemizer = EspeakBackend(language='vi', preserve_punctuation=True, with_stress=True) |
|
|
|
|
|
# Normalize and phonemize text |
|
|
text = "Xin chào, đây là mô hình text to speech tiếng Việt" |
|
|
text_normalized = TTSnorm(text, punc=False, unknown=True, lower=False, rule=False) |
|
|
phones = phonemizer.phonemize([text_normalized])[0] |
|
|
|
|
|
# Encode reference audio (for voice cloning) |
|
|
from librosa import load as librosa_load |
|
|
ref_audio_path = "reference.wav" |
|
|
ref_text = "Đây là văn bản tham chiếu" |
|
|
ref_text_normalized = TTSnorm(ref_text, punc=False, unknown=True, lower=False, rule=False) |
|
|
ref_phones = phonemizer.phonemize([ref_text_normalized])[0] |
|
|
|
|
|
wav, _ = librosa_load(ref_audio_path, sr=16000, mono=True) |
|
|
wav_tensor = torch.from_numpy(wav).float().unsqueeze(0).unsqueeze(0) |
|
|
with torch.no_grad(): |
|
|
ref_codes = codec.encode_code(audio_or_path=wav_tensor).squeeze(0).squeeze(0).cpu() |
|
|
|
|
|
# Generate speech |
|
|
codes_str = "".join([f"<|speech_{i}|>" for i in ref_codes.tolist()]) |
|
|
combined_phones = ref_phones + " " + phones |
|
|
chat = f"""user: Convert the text to speech:<|TEXT_PROMPT_START|>{combined_phones}<|TEXT_PROMPT_END|>\nassistant:<|SPEECH_GENERATION_START|>{codes_str}""" |
|
|
|
|
|
input_ids = tokenizer.encode(chat, return_tensors="pt").to("cuda") |
|
|
speech_end_id = tokenizer.convert_tokens_to_ids("<|SPEECH_GENERATION_END|>") |
|
|
|
|
|
with torch.no_grad(): |
|
|
output = model.generate( |
|
|
input_ids, |
|
|
max_new_tokens=2048, |
|
|
temperature=1.0, |
|
|
top_k=50, |
|
|
eos_token_id=speech_end_id, |
|
|
pad_token_id=tokenizer.eos_token_id, |
|
|
) |
|
|
|
|
|
# Decode to audio |
|
|
output_text = tokenizer.decode(output[0], skip_special_tokens=False) |
|
|
# Extract speech codes and decode with codec... |
|
|
# (See full implementation in repository) |
|
|
|
|
|
# Save audio |
|
|
sf.write("output.wav", audio, 24000) |
|
|
``` |
|
|
|
|
|
### Using the Inference Script |
|
|
|
|
|
For easier usage, use the provided inference script: |
|
|
|
|
|
```bash |
|
|
# Clone repository |
|
|
git clone https://github.com/iamdinhthuan/neutts-air-fintune |
|
|
cd neutts-air-fintune |
|
|
|
|
|
# Install dependencies |
|
|
pip install -r requirements.txt |
|
|
|
|
|
# Run inference |
|
|
python infer_vietnamese.py \ |
|
|
--text "Xin chào Việt Nam" \ |
|
|
--ref_audio "reference.wav" \ |
|
|
--ref_text "Text của reference audio" \ |
|
|
--output "output.wav" \ |
|
|
--checkpoint "path/to/checkpoint" |
|
|
``` |
|
|
|
|
|
### Gradio UI |
|
|
|
|
|
```bash |
|
|
python gradio_app.py |
|
|
``` |
|
|
|
|
|
Then open http://localhost:7860 in your browser. |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
- **Dataset Size:** 2.6M+ Vietnamese audio samples |
|
|
- **Audio Format:** WAV, 16kHz, mono |
|
|
- **Text:** Vietnamese with diacritics |
|
|
- **Train/Val Split:** 99.5% / 0.5% |
|
|
|
|
|
### Training Configuration |
|
|
|
|
|
- **Base Model:** neuphonic/neutts-air (Qwen2.5 0.5B) |
|
|
- **Epochs:** 3 |
|
|
- **Batch Size:** 4 per device |
|
|
- **Gradient Accumulation:** 2 steps (effective batch size: 8) |
|
|
- **Learning Rate:** 4e-5 |
|
|
- **Optimizer:** AdamW (fused) |
|
|
- **Precision:** BFloat16 |
|
|
- **Hardware:** NVIDIA RTX 3090 (24GB) |
|
|
- **Training Time:** ~2.5-3 days |
|
|
|
|
|
### Optimizations |
|
|
|
|
|
- ✅ **Pre-encoded Dataset** - 6x faster training |
|
|
- ✅ **TF32 Precision** - 20% speedup on Ampere GPUs |
|
|
- ✅ **Fused AdamW** - 10% faster optimizer |
|
|
- ✅ **Dataloader Optimizations** - Pin memory, prefetch |
|
|
- ✅ **Increased Batch Size** - Better GPU utilization |
|
|
|
|
|
**Total Speedup:** 10-12x faster than baseline (30 days → 2.5-3 days) |
|
|
|
|
|
## Performance |
|
|
|
|
|
### Audio Quality |
|
|
|
|
|
- **Sample Rate:** 24kHz |
|
|
- **Natural Prosody:** Yes |
|
|
- **Voice Cloning:** Supported |
|
|
- **Text Normalization:** Automatic (numbers, dates, abbreviations) |
|
|
|
|
|
### Inference Speed |
|
|
|
|
|
- **GPU (RTX 3090):** ~0.5s per sentence |
|
|
- **CPU:** ~3-5s per sentence |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Requires reference audio for voice cloning |
|
|
- Best results with clear, high-quality reference audio (3-10 seconds) |
|
|
- May struggle with very long sentences (>100 words) |
|
|
- Requires Vietnamese text with proper diacritics for best quality |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
⚠️ **Voice Cloning Ethics:** |
|
|
- Only use reference audio with proper consent |
|
|
- Do not use for impersonation or fraud |
|
|
- Respect privacy and intellectual property rights |
|
|
|
|
|
⚠️ **Potential Misuse:** |
|
|
- Deepfake audio generation |
|
|
- Unauthorized voice cloning |
|
|
- Misinformation campaigns |
|
|
|
|
|
**Recommended Use:** |
|
|
- Accessibility tools (text-to-speech for visually impaired) |
|
|
- Educational content |
|
|
- Virtual assistants |
|
|
- Audiobook narration (with consent) |
|
|
- Language learning applications |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{neutts-air-vietnamese, |
|
|
author = {Thuan Dinh Nguyen}, |
|
|
title = {NeuTTS-Air Vietnamese TTS}, |
|
|
year = {2025}, |
|
|
publisher = {HuggingFace}, |
|
|
howpublished = {\url{https://huggingface.co/YOUR_USERNAME/neutts-air-vietnamese}}, |
|
|
} |
|
|
|
|
|
@misc{neutts-air, |
|
|
author = {Neuphonic}, |
|
|
title = {NeuTTS-Air: Scalable TTS with Qwen2.5}, |
|
|
year = {2024}, |
|
|
publisher = {HuggingFace}, |
|
|
howpublished = {\url{https://huggingface.co/neuphonic/neutts-air}}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- **Base Model:** [Neuphonic](https://github.com/neuphonic) for NeuTTS-Air |
|
|
- **Backbone:** [Qwen Team](https://github.com/QwenLM) for Qwen2.5 |
|
|
- **Codec:** [Neuphonic](https://github.com/neuphonic) for NeuCodec |
|
|
- **Phonemizer:** [espeak-ng](https://github.com/espeak-ng/espeak-ng) |
|
|
- **Text Normalization:** [ViNorm](https://github.com/v-nhandt21/ViNorm) |
|
|
|
|
|
## Repository |
|
|
|
|
|
Full training and inference code: [https://github.com/iamdinhthuan/neutts-air-fintune](https://github.com/iamdinhthuan/neutts-air-fintune) |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 - See [LICENSE](LICENSE) for details. |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions or issues, please open an issue on [GitHub](https://github.com/iamdinhthuan/neutts-air-fintune/issues). |
|
|
|
|
|
--- |
|
|
|
|
|
**Model Card Authors:** Your Name |
|
|
**Last Updated:** 2025-01-01 |
|
|
|
|
|
|