neutts-air-vi / README.md
dinhthuan's picture
Update README.md
df61981 verified
metadata
language:
  - vi
license: apache-2.0
tags:
  - text-to-speech
  - tts
  - vietnamese
  - audio
  - speech-synthesis
  - neutts-air
  - qwen2.5
datasets:
  - custom
metrics:
  - wer
library_name: transformers
pipeline_tag: text-to-speech

NeuTTS-Air Vietnamese TTS

Vietnamese Text-to-Speech model finetuned from NeuTTS-Air on 2.6M+ Vietnamese audio samples.

Model Description

NeuTTS-Air Vietnamese là mô hình Text-to-Speech (TTS) cho tiếng Việt, được finetune từ NeuTTS-Air base model trên dataset lớn 2.6M+ mẫu audio tiếng Việt.

  • Base Model: neuphonic/neutts-air (Qwen2.5 0.5B - 552M parameters)
  • Language: Vietnamese (vi)
  • Task: Text-to-Speech (TTS)
  • Training Data: 2.6M+ Vietnamese audio samples
  • Audio Codec: NeuCodec
  • Sample Rate: 24kHz
  • License: Apache 2.0

Features

High Quality Vietnamese TTS - Natural Vietnamese speech synthesis
Large-scale Training - Trained on 2.6M+ samples
Voice Cloning - Clone voice from reference audio
Text Normalization - Automatic Vietnamese text normalization with ViNorm
Fast Inference - Optimized for production use
Easy to Use - Simple API and Gradio UI

Quick Start

Installation

pip install torch transformers neucodec phonemizer librosa soundfile vinorm

Install espeak-ng:

# Ubuntu/Debian
sudo apt-get install espeak-ng

# macOS
brew install espeak-ng

Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from neucodec import NeuCodec
from phonemizer.backend import EspeakBackend
from vinorm import TTSnorm
import soundfile as sf
import numpy as np

# Load model
model_id = "dinhthuan/neutts-air-vi"  
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).to("cuda")
model.eval()

# Load codec
codec = NeuCodec.from_pretrained("neuphonic/neucodec").to("cuda")
codec.eval()

# Initialize phonemizer
phonemizer = EspeakBackend(language='vi', preserve_punctuation=True, with_stress=True)

# Normalize and phonemize text
text = "Xin chào, đây là mô hình text to speech tiếng Việt"
text_normalized = TTSnorm(text, punc=False, unknown=True, lower=False, rule=False)
phones = phonemizer.phonemize([text_normalized])[0]

# Encode reference audio (for voice cloning)
from librosa import load as librosa_load
ref_audio_path = "reference.wav"
ref_text = "Đây là văn bản tham chiếu"
ref_text_normalized = TTSnorm(ref_text, punc=False, unknown=True, lower=False, rule=False)
ref_phones = phonemizer.phonemize([ref_text_normalized])[0]

wav, _ = librosa_load(ref_audio_path, sr=16000, mono=True)
wav_tensor = torch.from_numpy(wav).float().unsqueeze(0).unsqueeze(0)
with torch.no_grad():
    ref_codes = codec.encode_code(audio_or_path=wav_tensor).squeeze(0).squeeze(0).cpu()

# Generate speech
codes_str = "".join([f"<|speech_{i}|>" for i in ref_codes.tolist()])
combined_phones = ref_phones + " " + phones
chat = f"""user: Convert the text to speech:<|TEXT_PROMPT_START|>{combined_phones}<|TEXT_PROMPT_END|>\nassistant:<|SPEECH_GENERATION_START|>{codes_str}"""

input_ids = tokenizer.encode(chat, return_tensors="pt").to("cuda")
speech_end_id = tokenizer.convert_tokens_to_ids("<|SPEECH_GENERATION_END|>")

with torch.no_grad():
    output = model.generate(
        input_ids,
        max_new_tokens=2048,
        temperature=1.0,
        top_k=50,
        eos_token_id=speech_end_id,
        pad_token_id=tokenizer.eos_token_id,
    )

# Decode to audio
output_text = tokenizer.decode(output[0], skip_special_tokens=False)
# Extract speech codes and decode with codec...
# (See full implementation in repository)

# Save audio
sf.write("output.wav", audio, 24000)

Using the Inference Script

For easier usage, use the provided inference script:

# Clone repository
git clone https://github.com/iamdinhthuan/neutts-air-fintune
cd neutts-air-fintune

# Install dependencies
pip install -r requirements.txt

# Run inference
python infer_vietnamese.py \
    --text "Xin chào Việt Nam" \
    --ref_audio "reference.wav" \
    --ref_text "Text của reference audio" \
    --output "output.wav" \
    --checkpoint "path/to/checkpoint"

Gradio UI

python gradio_app.py

Then open http://localhost:7860 in your browser.

Training Details

Training Data

  • Dataset Size: 2.6M+ Vietnamese audio samples
  • Audio Format: WAV, 16kHz, mono
  • Text: Vietnamese with diacritics
  • Train/Val Split: 99.5% / 0.5%

Training Configuration

  • Base Model: neuphonic/neutts-air (Qwen2.5 0.5B)
  • Epochs: 3
  • Batch Size: 4 per device
  • Gradient Accumulation: 2 steps (effective batch size: 8)
  • Learning Rate: 4e-5
  • Optimizer: AdamW (fused)
  • Precision: BFloat16
  • Hardware: NVIDIA RTX 3090 (24GB)
  • Training Time: ~2.5-3 days

Optimizations

  • Pre-encoded Dataset - 6x faster training
  • TF32 Precision - 20% speedup on Ampere GPUs
  • Fused AdamW - 10% faster optimizer
  • Dataloader Optimizations - Pin memory, prefetch
  • Increased Batch Size - Better GPU utilization

Total Speedup: 10-12x faster than baseline (30 days → 2.5-3 days)

Performance

Audio Quality

  • Sample Rate: 24kHz
  • Natural Prosody: Yes
  • Voice Cloning: Supported
  • Text Normalization: Automatic (numbers, dates, abbreviations)

Inference Speed

  • GPU (RTX 3090): ~0.5s per sentence
  • CPU: ~3-5s per sentence

Limitations

  • Requires reference audio for voice cloning
  • Best results with clear, high-quality reference audio (3-10 seconds)
  • May struggle with very long sentences (>100 words)
  • Requires Vietnamese text with proper diacritics for best quality

Ethical Considerations

⚠️ Voice Cloning Ethics:

  • Only use reference audio with proper consent
  • Do not use for impersonation or fraud
  • Respect privacy and intellectual property rights

⚠️ Potential Misuse:

  • Deepfake audio generation
  • Unauthorized voice cloning
  • Misinformation campaigns

Recommended Use:

  • Accessibility tools (text-to-speech for visually impaired)
  • Educational content
  • Virtual assistants
  • Audiobook narration (with consent)
  • Language learning applications

Citation

If you use this model, please cite:

@misc{neutts-air-vietnamese,
  author = {Thuan Dinh Nguyen},
  title = {NeuTTS-Air Vietnamese TTS},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/YOUR_USERNAME/neutts-air-vietnamese}},
}

@misc{neutts-air,
  author = {Neuphonic},
  title = {NeuTTS-Air: Scalable TTS with Qwen2.5},
  year = {2024},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/neuphonic/neutts-air}},
}

Acknowledgments

Repository

Full training and inference code: https://github.com/iamdinhthuan/neutts-air-fintune

License

Apache 2.0 - See LICENSE for details.

Contact

For questions or issues, please open an issue on GitHub.


Model Card Authors: Your Name
Last Updated: 2025-01-01