File size: 7,817 Bytes

---
language:
- vi
license: apache-2.0
tags:
- text-to-speech
- tts
- vietnamese
- audio
- speech-synthesis
- neutts-air
- qwen2.5
datasets:
- custom
metrics:
- wer
library_name: transformers
pipeline_tag: text-to-speech
---

# NeuTTS-Air Vietnamese TTS

Vietnamese Text-to-Speech model finetuned from [NeuTTS-Air](https://huggingface.co/neuphonic/neutts-air) on 2.6M+ Vietnamese audio samples.

## Model Description

**NeuTTS-Air Vietnamese** là mô hình Text-to-Speech (TTS) cho tiếng Việt, được finetune từ NeuTTS-Air base model trên dataset lớn 2.6M+ mẫu audio tiếng Việt.

- **Base Model:** [neuphonic/neutts-air](https://huggingface.co/neuphonic/neutts-air) (Qwen2.5 0.5B - 552M parameters)
- **Language:** Vietnamese (vi)
- **Task:** Text-to-Speech (TTS)
- **Training Data:** 2.6M+ Vietnamese audio samples
- **Audio Codec:** [NeuCodec](https://huggingface.co/neuphonic/neucodec)
- **Sample Rate:** 24kHz
- **License:** Apache 2.0

## Features

✅ **High Quality Vietnamese TTS** - Natural Vietnamese speech synthesis  
✅ **Large-scale Training** - Trained on 2.6M+ samples  
✅ **Voice Cloning** - Clone voice from reference audio  
✅ **Text Normalization** - Automatic Vietnamese text normalization with ViNorm  
✅ **Fast Inference** - Optimized for production use  
✅ **Easy to Use** - Simple API and Gradio UI  

## Quick Start

### Installation

```bash
pip install torch transformers neucodec phonemizer librosa soundfile vinorm
```

**Install espeak-ng:**

```bash
# Ubuntu/Debian
sudo apt-get install espeak-ng

# macOS
brew install espeak-ng
```

### Usage

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from neucodec import NeuCodec
from phonemizer.backend import EspeakBackend
from vinorm import TTSnorm
import soundfile as sf
import numpy as np

# Load model
model_id = "dinhthuan/neutts-air-vi"  
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).to("cuda")
model.eval()

# Load codec
codec = NeuCodec.from_pretrained("neuphonic/neucodec").to("cuda")
codec.eval()

# Initialize phonemizer
phonemizer = EspeakBackend(language='vi', preserve_punctuation=True, with_stress=True)

# Normalize and phonemize text
text = "Xin chào, đây là mô hình text to speech tiếng Việt"
text_normalized = TTSnorm(text, punc=False, unknown=True, lower=False, rule=False)
phones = phonemizer.phonemize([text_normalized])[0]

# Encode reference audio (for voice cloning)
from librosa import load as librosa_load
ref_audio_path = "reference.wav"
ref_text = "Đây là văn bản tham chiếu"
ref_text_normalized = TTSnorm(ref_text, punc=False, unknown=True, lower=False, rule=False)
ref_phones = phonemizer.phonemize([ref_text_normalized])[0]

wav, _ = librosa_load(ref_audio_path, sr=16000, mono=True)
wav_tensor = torch.from_numpy(wav).float().unsqueeze(0).unsqueeze(0)
with torch.no_grad():
    ref_codes = codec.encode_code(audio_or_path=wav_tensor).squeeze(0).squeeze(0).cpu()

# Generate speech
codes_str = "".join([f"<|speech_{i}|>" for i in ref_codes.tolist()])
combined_phones = ref_phones + " " + phones
chat = f"""user: Convert the text to speech:<|TEXT_PROMPT_START|>{combined_phones}<|TEXT_PROMPT_END|>\nassistant:<|SPEECH_GENERATION_START|>{codes_str}"""

input_ids = tokenizer.encode(chat, return_tensors="pt").to("cuda")
speech_end_id = tokenizer.convert_tokens_to_ids("<|SPEECH_GENERATION_END|>")

with torch.no_grad():
    output = model.generate(
        input_ids,
        max_new_tokens=2048,
        temperature=1.0,
        top_k=50,
        eos_token_id=speech_end_id,
        pad_token_id=tokenizer.eos_token_id,
    )

# Decode to audio
output_text = tokenizer.decode(output[0], skip_special_tokens=False)
# Extract speech codes and decode with codec...
# (See full implementation in repository)

# Save audio
sf.write("output.wav", audio, 24000)
```

### Using the Inference Script

For easier usage, use the provided inference script:

```bash
# Clone repository
git clone https://github.com/iamdinhthuan/neutts-air-fintune
cd neutts-air-fintune

# Install dependencies
pip install -r requirements.txt

# Run inference
python infer_vietnamese.py \
    --text "Xin chào Việt Nam" \
    --ref_audio "reference.wav" \
    --ref_text "Text của reference audio" \
    --output "output.wav" \
    --checkpoint "path/to/checkpoint"
```

### Gradio UI

```bash
python gradio_app.py
```

Then open http://localhost:7860 in your browser.

## Training Details

### Training Data

- **Dataset Size:** 2.6M+ Vietnamese audio samples
- **Audio Format:** WAV, 16kHz, mono
- **Text:** Vietnamese with diacritics
- **Train/Val Split:** 99.5% / 0.5%

### Training Configuration

- **Base Model:** neuphonic/neutts-air (Qwen2.5 0.5B)
- **Epochs:** 3
- **Batch Size:** 4 per device
- **Gradient Accumulation:** 2 steps (effective batch size: 8)
- **Learning Rate:** 4e-5
- **Optimizer:** AdamW (fused)
- **Precision:** BFloat16
- **Hardware:** NVIDIA RTX 3090 (24GB)
- **Training Time:** ~2.5-3 days

### Optimizations

- ✅ **Pre-encoded Dataset** - 6x faster training
- ✅ **TF32 Precision** - 20% speedup on Ampere GPUs
- ✅ **Fused AdamW** - 10% faster optimizer
- ✅ **Dataloader Optimizations** - Pin memory, prefetch
- ✅ **Increased Batch Size** - Better GPU utilization

**Total Speedup:** 10-12x faster than baseline (30 days → 2.5-3 days)

## Performance

### Audio Quality

- **Sample Rate:** 24kHz
- **Natural Prosody:** Yes
- **Voice Cloning:** Supported
- **Text Normalization:** Automatic (numbers, dates, abbreviations)

### Inference Speed

- **GPU (RTX 3090):** ~0.5s per sentence
- **CPU:** ~3-5s per sentence

## Limitations

- Requires reference audio for voice cloning
- Best results with clear, high-quality reference audio (3-10 seconds)
- May struggle with very long sentences (>100 words)
- Requires Vietnamese text with proper diacritics for best quality

## Ethical Considerations

⚠️ **Voice Cloning Ethics:**
- Only use reference audio with proper consent
- Do not use for impersonation or fraud
- Respect privacy and intellectual property rights

⚠️ **Potential Misuse:**
- Deepfake audio generation
- Unauthorized voice cloning
- Misinformation campaigns

**Recommended Use:**
- Accessibility tools (text-to-speech for visually impaired)
- Educational content
- Virtual assistants
- Audiobook narration (with consent)
- Language learning applications

## Citation

If you use this model, please cite:

```bibtex
@misc{neutts-air-vietnamese,
  author = {Thuan Dinh Nguyen},
  title = {NeuTTS-Air Vietnamese TTS},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/YOUR_USERNAME/neutts-air-vietnamese}},
}

@misc{neutts-air,
  author = {Neuphonic},
  title = {NeuTTS-Air: Scalable TTS with Qwen2.5},
  year = {2024},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/neuphonic/neutts-air}},
}
```

## Acknowledgments

- **Base Model:** [Neuphonic](https://github.com/neuphonic) for NeuTTS-Air
- **Backbone:** [Qwen Team](https://github.com/QwenLM) for Qwen2.5
- **Codec:** [Neuphonic](https://github.com/neuphonic) for NeuCodec
- **Phonemizer:** [espeak-ng](https://github.com/espeak-ng/espeak-ng)
- **Text Normalization:** [ViNorm](https://github.com/v-nhandt21/ViNorm)

## Repository

Full training and inference code: [https://github.com/iamdinhthuan/neutts-air-fintune](https://github.com/iamdinhthuan/neutts-air-fintune)

## License

Apache 2.0 - See [LICENSE](LICENSE) for details.

## Contact

For questions or issues, please open an issue on [GitHub](https://github.com/iamdinhthuan/neutts-air-fintune/issues).

---

**Model Card Authors:** Your Name  
**Last Updated:** 2025-01-01