neutts-air-vi / README.md

dinhthuan

Update README.md

df61981 verified 6 days ago

preview code

raw

history blame contribute delete

7.82 kB

metadata

language:
  - vi
license: apache-2.0
tags:
  - text-to-speech
  - tts
  - vietnamese
  - audio
  - speech-synthesis
  - neutts-air
  - qwen2.5
datasets:
  - custom
metrics:
  - wer
library_name: transformers
pipeline_tag: text-to-speech

NeuTTS-Air Vietnamese TTS

Vietnamese Text-to-Speech model finetuned from NeuTTS-Air on 2.6M+ Vietnamese audio samples.

Model Description

NeuTTS-Air Vietnamese là mô hình Text-to-Speech (TTS) cho tiếng Việt, được finetune từ NeuTTS-Air base model trên dataset lớn 2.6M+ mẫu audio tiếng Việt.

Base Model: neuphonic/neutts-air (Qwen2.5 0.5B - 552M parameters)
Language: Vietnamese (vi)
Task: Text-to-Speech (TTS)
Training Data: 2.6M+ Vietnamese audio samples
Audio Codec: NeuCodec
Sample Rate: 24kHz
License: Apache 2.0

Features

✅ High Quality Vietnamese TTS - Natural Vietnamese speech synthesis
✅ Large-scale Training - Trained on 2.6M+ samples
✅ Voice Cloning - Clone voice from reference audio
✅ Text Normalization - Automatic Vietnamese text normalization with ViNorm
✅ Fast Inference - Optimized for production use
✅ Easy to Use - Simple API and Gradio UI

Quick Start

Installation

pip install torch transformers neucodec phonemizer librosa soundfile vinorm

Install espeak-ng:

# Ubuntu/Debian
sudo apt-get install espeak-ng

# macOS
brew install espeak-ng

Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from neucodec import NeuCodec
from phonemizer.backend import EspeakBackend
from vinorm import TTSnorm
import soundfile as sf
import numpy as np

# Load model
model_id = "dinhthuan/neutts-air-vi"  
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).to("cuda")
model.eval()

# Load codec
codec = NeuCodec.from_pretrained("neuphonic/neucodec").to("cuda")
codec.eval()

# Initialize phonemizer
phonemizer = EspeakBackend(language='vi', preserve_punctuation=True, with_stress=True)

# Normalize and phonemize text
text = "Xin chào, đây là mô hình text to speech tiếng Việt"
text_normalized = TTSnorm(text, punc=False, unknown=True, lower=False, rule=False)
phones = phonemizer.phonemize([text_normalized])[0]

# Encode reference audio (for voice cloning)
from librosa import load as librosa_load
ref_audio_path = "reference.wav"
ref_text = "Đây là văn bản tham chiếu"
ref_text_normalized = TTSnorm(ref_text, punc=False, unknown=True, lower=False, rule=False)
ref_phones = phonemizer.phonemize([ref_text_normalized])[0]

wav, _ = librosa_load(ref_audio_path, sr=16000, mono=True)
wav_tensor = torch.from_numpy(wav).float().unsqueeze(0).unsqueeze(0)
with torch.no_grad():
    ref_codes = codec.encode_code(audio_or_path=wav_tensor).squeeze(0).squeeze(0).cpu()

# Generate speech
codes_str = "".join([f"<|speech_{i}|>" for i in ref_codes.tolist()])
combined_phones = ref_phones + " " + phones
chat = f"""user: Convert the text to speech:<|TEXT_PROMPT_START|>{combined_phones}<|TEXT_PROMPT_END|>\nassistant:<|SPEECH_GENERATION_START|>{codes_str}"""

input_ids = tokenizer.encode(chat, return_tensors="pt").to("cuda")
speech_end_id = tokenizer.convert_tokens_to_ids("<|SPEECH_GENERATION_END|>")

with torch.no_grad():
    output = model.generate(
        input_ids,
        max_new_tokens=2048,
        temperature=1.0,
        top_k=50,
        eos_token_id=speech_end_id,
        pad_token_id=tokenizer.eos_token_id,
    )

# Decode to audio
output_text = tokenizer.decode(output[0], skip_special_tokens=False)
# Extract speech codes and decode with codec...
# (See full implementation in repository)

# Save audio
sf.write("output.wav", audio, 24000)

Using the Inference Script

For easier usage, use the provided inference script:

# Clone repository
git clone https://github.com/iamdinhthuan/neutts-air-fintune
cd neutts-air-fintune

# Install dependencies
pip install -r requirements.txt

# Run inference
python infer_vietnamese.py \
    --text "Xin chào Việt Nam" \
    --ref_audio "reference.wav" \
    --ref_text "Text của reference audio" \
    --output "output.wav" \
    --checkpoint "path/to/checkpoint"

Gradio UI

python gradio_app.py

Then open http://localhost:7860 in your browser.

Training Details

Training Data

Dataset Size: 2.6M+ Vietnamese audio samples
Audio Format: WAV, 16kHz, mono
Text: Vietnamese with diacritics
Train/Val Split: 99.5% / 0.5%

Training Configuration

Base Model: neuphonic/neutts-air (Qwen2.5 0.5B)
Epochs: 3
Batch Size: 4 per device
Gradient Accumulation: 2 steps (effective batch size: 8)
Learning Rate: 4e-5
Optimizer: AdamW (fused)
Precision: BFloat16
Hardware: NVIDIA RTX 3090 (24GB)
Training Time: ~2.5-3 days

Optimizations

✅ Pre-encoded Dataset - 6x faster training
✅ TF32 Precision - 20% speedup on Ampere GPUs
✅ Fused AdamW - 10% faster optimizer
✅ Dataloader Optimizations - Pin memory, prefetch
✅ Increased Batch Size - Better GPU utilization

Total Speedup: 10-12x faster than baseline (30 days → 2.5-3 days)

Performance

Audio Quality

Sample Rate: 24kHz
Natural Prosody: Yes
Voice Cloning: Supported
Text Normalization: Automatic (numbers, dates, abbreviations)

Inference Speed

GPU (RTX 3090): ~0.5s per sentence
CPU: ~3-5s per sentence

Limitations

Requires reference audio for voice cloning
Best results with clear, high-quality reference audio (3-10 seconds)
May struggle with very long sentences (>100 words)
Requires Vietnamese text with proper diacritics for best quality

Ethical Considerations

⚠️ Voice Cloning Ethics:

Only use reference audio with proper consent
Do not use for impersonation or fraud
Respect privacy and intellectual property rights

⚠️ Potential Misuse:

Deepfake audio generation
Unauthorized voice cloning
Misinformation campaigns

Recommended Use:

Accessibility tools (text-to-speech for visually impaired)
Educational content
Virtual assistants
Audiobook narration (with consent)
Language learning applications

Citation

If you use this model, please cite:

@misc{neutts-air-vietnamese,
  author = {Thuan Dinh Nguyen},
  title = {NeuTTS-Air Vietnamese TTS},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/YOUR_USERNAME/neutts-air-vietnamese}},
}

@misc{neutts-air,
  author = {Neuphonic},
  title = {NeuTTS-Air: Scalable TTS with Qwen2.5},
  year = {2024},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/neuphonic/neutts-air}},
}

Acknowledgments

Base Model: Neuphonic for NeuTTS-Air
Backbone: Qwen Team for Qwen2.5
Codec: Neuphonic for NeuCodec
Phonemizer: espeak-ng
Text Normalization: ViNorm

Repository

Full training and inference code: https://github.com/iamdinhthuan/neutts-air-fintune

License

Apache 2.0 - See LICENSE for details.

Contact

For questions or issues, please open an issue on GitHub.

Model Card Authors: Your Name
Last Updated: 2025-01-01