neutts-air-vi / README.md

Update README.md

df61981 verified 15 days ago

7.82 kB

	---
	language:
	- vi
	license: apache-2.0
	tags:
	- text-to-speech
	- tts
	- vietnamese
	- audio
	- speech-synthesis
	- neutts-air
	- qwen2.5
	datasets:
	- custom
	metrics:
	- wer
	library_name: transformers
	pipeline_tag: text-to-speech
	---

	# NeuTTS-Air Vietnamese TTS

	Vietnamese Text-to-Speech model finetuned from [NeuTTS-Air](https://huggingface.co/neuphonic/neutts-air) on 2.6M+ Vietnamese audio samples.

	## Model Description

	NeuTTS-Air Vietnamese là mô hình Text-to-Speech (TTS) cho tiếng Việt, được finetune từ NeuTTS-Air base model trên dataset lớn 2.6M+ mẫu audio tiếng Việt.

	- Base Model: [neuphonic/neutts-air](https://huggingface.co/neuphonic/neutts-air) (Qwen2.5 0.5B - 552M parameters)
	- Language: Vietnamese (vi)
	- Task: Text-to-Speech (TTS)
	- Training Data: 2.6M+ Vietnamese audio samples
	- Audio Codec: [NeuCodec](https://huggingface.co/neuphonic/neucodec)
	- Sample Rate: 24kHz
	- License: Apache 2.0

	## Features

	✅ High Quality Vietnamese TTS - Natural Vietnamese speech synthesis
	✅ Large-scale Training - Trained on 2.6M+ samples
	✅ Voice Cloning - Clone voice from reference audio
	✅ Text Normalization - Automatic Vietnamese text normalization with ViNorm
	✅ Fast Inference - Optimized for production use
	✅ Easy to Use - Simple API and Gradio UI

	## Quick Start

	### Installation

	```bash
	pip install torch transformers neucodec phonemizer librosa soundfile vinorm
	```

	Install espeak-ng:

	```bash
	# Ubuntu/Debian
	sudo apt-get install espeak-ng

	# macOS
	brew install espeak-ng
	```

	### Usage

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM
	from neucodec import NeuCodec
	from phonemizer.backend import EspeakBackend
	from vinorm import TTSnorm
	import soundfile as sf
	import numpy as np

	# Load model
	model_id = "dinhthuan/neutts-air-vi"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16,
	trust_remote_code=True,
	).to("cuda")
	model.eval()

	# Load codec
	codec = NeuCodec.from_pretrained("neuphonic/neucodec").to("cuda")
	codec.eval()

	# Initialize phonemizer
	phonemizer = EspeakBackend(language='vi', preserve_punctuation=True, with_stress=True)

	# Normalize and phonemize text
	text = "Xin chào, đây là mô hình text to speech tiếng Việt"
	text_normalized = TTSnorm(text, punc=False, unknown=True, lower=False, rule=False)
	phones = phonemizer.phonemize([text_normalized])[0]

	# Encode reference audio (for voice cloning)
	from librosa import load as librosa_load
	ref_audio_path = "reference.wav"
	ref_text = "Đây là văn bản tham chiếu"
	ref_text_normalized = TTSnorm(ref_text, punc=False, unknown=True, lower=False, rule=False)
	ref_phones = phonemizer.phonemize([ref_text_normalized])[0]

	wav, _ = librosa_load(ref_audio_path, sr=16000, mono=True)
	wav_tensor = torch.from_numpy(wav).float().unsqueeze(0).unsqueeze(0)
	with torch.no_grad():
	ref_codes = codec.encode_code(audio_or_path=wav_tensor).squeeze(0).squeeze(0).cpu()

	# Generate speech
	codes_str = "".join([f"<\|speech_{i}\|>" for i in ref_codes.tolist()])
	combined_phones = ref_phones + " " + phones
	chat = f"""user: Convert the text to speech:<\|TEXT_PROMPT_START\|>{combined_phones}<\|TEXT_PROMPT_END\|>\nassistant:<\|SPEECH_GENERATION_START\|>{codes_str}"""

	input_ids = tokenizer.encode(chat, return_tensors="pt").to("cuda")
	speech_end_id = tokenizer.convert_tokens_to_ids("<\|SPEECH_GENERATION_END\|>")

	with torch.no_grad():
	output = model.generate(
	input_ids,
	max_new_tokens=2048,
	temperature=1.0,
	top_k=50,
	eos_token_id=speech_end_id,
	pad_token_id=tokenizer.eos_token_id,
	)

	# Decode to audio
	output_text = tokenizer.decode(output[0], skip_special_tokens=False)
	# Extract speech codes and decode with codec...
	# (See full implementation in repository)

	# Save audio
	sf.write("output.wav", audio, 24000)
	```

	### Using the Inference Script

	For easier usage, use the provided inference script:

	```bash
	# Clone repository
	git clone https://github.com/iamdinhthuan/neutts-air-fintune
	cd neutts-air-fintune

	# Install dependencies
	pip install -r requirements.txt

	# Run inference
	python infer_vietnamese.py \
	--text "Xin chào Việt Nam" \
	--ref_audio "reference.wav" \
	--ref_text "Text của reference audio" \
	--output "output.wav" \
	--checkpoint "path/to/checkpoint"
	```

	### Gradio UI

	```bash
	python gradio_app.py
	```

	Then open http://localhost:7860 in your browser.

	## Training Details

	### Training Data

	- Dataset Size: 2.6M+ Vietnamese audio samples
	- Audio Format: WAV, 16kHz, mono
	- Text: Vietnamese with diacritics
	- Train/Val Split: 99.5% / 0.5%

	### Training Configuration

	- Base Model: neuphonic/neutts-air (Qwen2.5 0.5B)
	- Epochs: 3
	- Batch Size: 4 per device
	- Gradient Accumulation: 2 steps (effective batch size: 8)
	- Learning Rate: 4e-5
	- Optimizer: AdamW (fused)
	- Precision: BFloat16
	- Hardware: NVIDIA RTX 3090 (24GB)
	- Training Time: ~2.5-3 days

	### Optimizations

	- ✅ Pre-encoded Dataset - 6x faster training
	- ✅ TF32 Precision - 20% speedup on Ampere GPUs
	- ✅ Fused AdamW - 10% faster optimizer
	- ✅ Dataloader Optimizations - Pin memory, prefetch
	- ✅ Increased Batch Size - Better GPU utilization

	Total Speedup: 10-12x faster than baseline (30 days → 2.5-3 days)

	## Performance

	### Audio Quality

	- Sample Rate: 24kHz
	- Natural Prosody: Yes
	- Voice Cloning: Supported
	- Text Normalization: Automatic (numbers, dates, abbreviations)

	### Inference Speed

	- GPU (RTX 3090): ~0.5s per sentence
	- CPU: ~3-5s per sentence

	## Limitations

	- Requires reference audio for voice cloning
	- Best results with clear, high-quality reference audio (3-10 seconds)
	- May struggle with very long sentences (>100 words)
	- Requires Vietnamese text with proper diacritics for best quality

	## Ethical Considerations

	⚠️ Voice Cloning Ethics:
	- Only use reference audio with proper consent
	- Do not use for impersonation or fraud
	- Respect privacy and intellectual property rights

	⚠️ Potential Misuse:
	- Deepfake audio generation
	- Unauthorized voice cloning
	- Misinformation campaigns

	Recommended Use:
	- Accessibility tools (text-to-speech for visually impaired)
	- Educational content
	- Virtual assistants
	- Audiobook narration (with consent)
	- Language learning applications

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{neutts-air-vietnamese,
	author = {Thuan Dinh Nguyen},
	title = {NeuTTS-Air Vietnamese TTS},
	year = {2025},
	publisher = {HuggingFace},
	howpublished = {\url{https://huggingface.co/YOUR_USERNAME/neutts-air-vietnamese}},
	}

	@misc{neutts-air,
	author = {Neuphonic},
	title = {NeuTTS-Air: Scalable TTS with Qwen2.5},
	year = {2024},
	publisher = {HuggingFace},
	howpublished = {\url{https://huggingface.co/neuphonic/neutts-air}},
	}
	```

	## Acknowledgments

	- Base Model: [Neuphonic](https://github.com/neuphonic) for NeuTTS-Air
	- Backbone: [Qwen Team](https://github.com/QwenLM) for Qwen2.5
	- Codec: [Neuphonic](https://github.com/neuphonic) for NeuCodec
	- Phonemizer: [espeak-ng](https://github.com/espeak-ng/espeak-ng)
	- Text Normalization: [ViNorm](https://github.com/v-nhandt21/ViNorm)

	## Repository

	Full training and inference code: [https://github.com/iamdinhthuan/neutts-air-fintune](https://github.com/iamdinhthuan/neutts-air-fintune)

	## License

	Apache 2.0 - See [LICENSE](LICENSE) for details.

	## Contact

	For questions or issues, please open an issue on [GitHub](https://github.com/iamdinhthuan/neutts-air-fintune/issues).

	---

	Model Card Authors: Your Name
	Last Updated: 2025-01-01