|
|
|
|
|
|
|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- ar |
|
|
- en |
|
|
base_model: aoi-ot/VibeVoice-Large |
|
|
tags: |
|
|
- text-to-speech |
|
|
- tts |
|
|
- audio |
|
|
- vibevoice |
|
|
- lora |
|
|
- arabic |
|
|
pipeline_tag: text-to-speech |
|
|
--- |
|
|
|
|
|
# VibeVoice Arabic LoRA |
|
|
|
|
|
This is a LoRA (Low-Rank Adaptation) fine-tuned model for Arabic text-to-speech, based on `aoi-ot/VibeVoice-Large`. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Base Model**: [aoi-ot/VibeVoice-Large](https://huggingface.co/aoi-ot/VibeVoice-Large) |
|
|
- **Training Method**: LoRA fine-tuning |
|
|
- **Language**: Arabic |
|
|
- **License**: MIT |
|
|
|
|
|
## Requirements |
|
|
|
|
|
### Hardware |
|
|
- **Inference**: |
|
|
- VibeVoice-1.5B: 6GB+ VRAM |
|
|
- VibeVoice-Large (7B): 16GB+ VRAM |
|
|
- **Training**: 48GB+ VRAM for VibeVoice-Large |
|
|
- VibeVoice-1.5B LoRA: 16GB+ VRAM minimum |
|
|
- VibeVoice-Large (7B) LoRA: 48GB+ VRAM minimum |
|
|
|
|
|
|
|
|
|
|
|
### Software |
|
|
```bash |
|
|
git clone https://github.com/vibevoice-community/VibeVoice.git |
|
|
cd VibeVoice/ |
|
|
pip install -e . |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Quick Start with Gradio |
|
|
|
|
|
```bash |
|
|
python demo/gradio_demo.py \ |
|
|
--model_path aoi-ot/VibeVoice-Large \ |
|
|
--checkpoint_path ABDALLALSWAITI/vibevoice-arabic-Z \ |
|
|
#--share |
|
|
``` |
|
|
|
|
|
### Command Line Inference |
|
|
|
|
|
```bash |
|
|
python demo/inference_from_file.py \ |
|
|
--model_path aoi-ot/VibeVoice-Large \ |
|
|
--txt_path your_arabic_text.txt \ |
|
|
--speaker_names Frank \ |
|
|
--checkpoint_path ABDALLALSWAITI/vibevoice-arabic-Z |
|
|
``` |
|
|
|
|
|
### Python API |
|
|
|
|
|
```python |
|
|
from vibevoice import VibeVoiceModel |
|
|
|
|
|
# Load model with Arabic LoRA |
|
|
model = VibeVoiceModel.from_pretrained( |
|
|
"aoi-ot/VibeVoice-Large", |
|
|
lora_path="ABDALLALSWAITI/vibevoice-arabic-Z" |
|
|
) |
|
|
|
|
|
# Generate speech |
|
|
text = "Speaker 0: مرحبا، كيف حالك؟" |
|
|
audio = model.generate(text, speaker_names=["Frank"]) |
|
|
``` |
|
|
|
|
|
## Training Your Own LoRA |
|
|
|
|
|
### 1. Installation |
|
|
|
|
|
```bash |
|
|
git clone https://github.com/voicepowered-ai/VibeVoice-finetuning |
|
|
cd VibeVoice-finetuning |
|
|
pip install -e . |
|
|
pip uninstall -y transformers && pip install transformers==4.51.3 |
|
|
wandb login # Optional |
|
|
``` |
|
|
|
|
|
### 2. Prepare Dataset |
|
|
|
|
|
### Hugging Face Dataset |
|
|
|
|
|
```python |
|
|
from datasets import Dataset, Audio |
|
|
|
|
|
data = { |
|
|
"text": [ |
|
|
"Speaker 0: مرحبا بك.", |
|
|
"Speaker 0: كيف يمكنني مساعدتك؟" |
|
|
], |
|
|
"audio": [ |
|
|
"audio1.wav", |
|
|
"audio2.wav" |
|
|
] |
|
|
} |
|
|
|
|
|
dataset = Dataset.from_dict(data) |
|
|
dataset = dataset.cast_column("audio", Audio(sampling_rate=24000)) |
|
|
dataset.push_to_hub("your-username/arabic-tts-dataset") |
|
|
``` |
|
|
|
|
|
Then train with: |
|
|
```bash |
|
|
python -m vibevoice.finetune.train_vibevoice \ |
|
|
--model_name_or_path vibevoice/VibeVoice-1.5B \ #or aoi-ot/VibeVoice-Large |
|
|
--dataset_name your-username/arabic-tts-dataset \ |
|
|
--text_column_name text \ |
|
|
--audio_column_name audio \ |
|
|
--voice_prompts_column_name audio \ |
|
|
--output_dir finetune_vibevoice_zac \ |
|
|
--per_device_train_batch_size 8 \ |
|
|
--gradient_accumulation_steps 16 \ |
|
|
--learning_rate 2.5e-5 \ |
|
|
--num_train_epochs 1 \ |
|
|
--logging_steps 10 \ |
|
|
--save_steps 100 \ |
|
|
--eval_steps 100 \ |
|
|
--report_to wandb \ |
|
|
--remove_unused_columns False \ |
|
|
--bf16 True \ |
|
|
--do_train \ |
|
|
--gradient_clipping \ |
|
|
--gradient_checkpointing False \ |
|
|
--ddpm_batch_mul 4 \ |
|
|
--diffusion_loss_weight 1.4 \ |
|
|
--train_diffusion_head True \ |
|
|
--ce_loss_weight 0.04 \ |
|
|
--voice_prompt_drop_rate 0.2 \ |
|
|
--lora_target_modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \ |
|
|
--lr_scheduler_type cosine \ |
|
|
--warmup_ratio 0.03 \ |
|
|
--max_grad_norm 0.8``` |
|
|
|
|
|
``` |
|
|
### example how dataset could be |
|
|
|
|
|
https://huggingface.co/datasets/vibevoice/jenny_vibevoice_formatted |
|
|
|
|
|
### Second method Create a `prompts.jsonl` file: |
|
|
|
|
|
```json |
|
|
{"text": "Speaker 0: مرحبا، هذا اختبار.", "audio": "audio1.wav"} |
|
|
{"text": "Speaker 0: هذا مثال آخر.", "audio": "audio2.wav"} |
|
|
``` |
|
|
|
|
|
Or use a Hugging Face dataset with columns: |
|
|
- `text`: Transcription with speaker labels |
|
|
- `audio`: 24kHz audio files |
|
|
- `voice_prompts`: (Optional) Reference voice clips |
|
|
|
|
|
### Train |
|
|
|
|
|
```bash |
|
|
python -m src.finetune_vibevoice_lora \ |
|
|
--model_name_or_path aoi-ot/VibeVoice-Large \ |
|
|
--processor_name_or_path src/vibevoice/processor \ |
|
|
--train_jsonl prompts.jsonl \ |
|
|
--text_column_name text \ |
|
|
--audio_column_name audio \ |
|
|
--output_dir output_arabic_lora \ |
|
|
--per_device_train_batch_size 8 \ |
|
|
--gradient_accumulation_steps 16 \ |
|
|
--learning_rate 2.5e-5 \ |
|
|
--num_train_epochs 5 \ |
|
|
--logging_steps 10 \ |
|
|
--save_steps 100 \ |
|
|
--report_to wandb \ |
|
|
--remove_unused_columns False \ |
|
|
--bf16 True \ |
|
|
--do_train \ |
|
|
--gradient_clipping \ |
|
|
--gradient_checkpointing False \ |
|
|
--ddpm_batch_mul 4 \ |
|
|
--diffusion_loss_weight 1.4 \ |
|
|
--train_diffusion_head True \ |
|
|
--ce_loss_weight 0.04 \ |
|
|
--voice_prompt_drop_rate 0.2 \ |
|
|
--lora_target_modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \ |
|
|
--lr_scheduler_type cosine \ |
|
|
--warmup_ratio 0.03 \ |
|
|
--max_grad_norm 0.8 |
|
|
``` |
|
|
|
|
|
### 4. Use Your Trained LoRA |
|
|
|
|
|
```bash |
|
|
python demo/gradio_demo.py \ |
|
|
--model_path aoi-ot/VibeVoice-Large \ |
|
|
--checkpoint_path output_arabic_lora/lora/checkpoint-500 \ |
|
|
--share |
|
|
``` |
|
|
|
|
|
## Dataset Format |
|
|
|
|
|
### JSONL Format |
|
|
|
|
|
**Single Speaker (auto-generated voice prompt):** |
|
|
```json |
|
|
{"text": "Speaker 0: النص العربي هنا.", "audio": "/path/to/audio.wav"} |
|
|
``` |
|
|
|
|
|
**Single Speaker (custom voice prompt):** |
|
|
```json |
|
|
{"text": "Speaker 0: النص العربي هنا.", "audio": "/path/to/audio.wav", "voice_prompts": "/path/to/reference.wav"} |
|
|
``` |
|
|
|
|
|
**Multi-Speaker:** |
|
|
```json |
|
|
{"text": "Speaker 0: كيف حالك؟\nSpeaker 1: أنا بخير، شكراً.", "audio": "/path/to/conversation.wav", "voice_prompts": ["/path/to/speaker0_ref.wav", "/path/to/speaker1_ref.wav"]} |
|
|
``` |
|
|
|
|
|
|
|
|
## Training Parameters |
|
|
|
|
|
| Parameter | Description | Recommended | |
|
|
|-----------|-------------|-------------| |
|
|
| `--model_name_or_path` | Base model | `aoi-ot/VibeVoice-Large` | |
|
|
| `--per_device_train_batch_size` | Batch size per GPU | `8` | |
|
|
| `--gradient_accumulation_steps` | Gradient accumulation | `16` | |
|
|
| `--learning_rate` | Learning rate | `2.5e-5` | |
|
|
| `--num_train_epochs` | Training epochs | `5-10` | |
|
|
| `--diffusion_loss_weight` | Diffusion loss weight | `1.4` | |
|
|
| `--ce_loss_weight` | Cross-entropy loss | `0.04` | |
|
|
| `--voice_prompt_drop_rate` | Voice prompt dropout | `0.2` | |
|
|
| `--lora_r` | LoRA rank | `8` | |
|
|
| `--lora_alpha` | LoRA alpha | `32` | |
|
|
|
|
|
## Memory Optimization |
|
|
|
|
|
### For Limited VRAM (32-40GB) |
|
|
|
|
|
```bash |
|
|
--per_device_train_batch_size 4 \ |
|
|
--gradient_accumulation_steps 32 \ |
|
|
--gradient_checkpointing True |
|
|
``` |
|
|
|
|
|
### Use LoRA on Diffusion Head |
|
|
|
|
|
```bash |
|
|
# Replace --train_diffusion_head True with: |
|
|
--lora_wrap_diffusion_head True |
|
|
``` |
|
|
|
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{vibevoice-arabic-lora, |
|
|
author = {ABDALLALSWAITI}, |
|
|
title = {VibeVoice Arabic LoRA}, |
|
|
year = {2025}, |
|
|
publisher = {HuggingFace}, |
|
|
howpublished = {\url{https://huggingface.co/ABDALLALSWAITI/vibevoice-arabic-Z}} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgements |
|
|
|
|
|
- Thanks to **Juan Pablo Gallego** from VoicePowered AI for the unofficial training code |
|
|
- Original VibeVoice by Microsoft Research |
|
|
- Community maintained by the VibeVoice community |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the MIT License. See the [LICENSE](LICENSE) file for details. |
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
### 💖 Support This Project |
|
|
If you enjoy using this extension and would like to support continued development, please consider [buying me a coffee](https://paypal.me/abdallalswaiti). Every contribution helps keep this project going and enables new features! |
|
|
|