vibevoice-arabic-Z / README.md
ABDALLALSWAITI's picture
Update README.md
3eeae5d verified
---
license: mit
language:
- ar
- en
base_model: aoi-ot/VibeVoice-Large
tags:
- text-to-speech
- tts
- audio
- vibevoice
- lora
- arabic
pipeline_tag: text-to-speech
---
# VibeVoice Arabic LoRA
This is a LoRA (Low-Rank Adaptation) fine-tuned model for Arabic text-to-speech, based on `aoi-ot/VibeVoice-Large`.
## Model Description
- **Base Model**: [aoi-ot/VibeVoice-Large](https://huggingface.co/aoi-ot/VibeVoice-Large)
- **Training Method**: LoRA fine-tuning
- **Language**: Arabic
- **License**: MIT
## Requirements
### Hardware
- **Inference**:
- VibeVoice-1.5B: 6GB+ VRAM
- VibeVoice-Large (7B): 16GB+ VRAM
- **Training**: 48GB+ VRAM for VibeVoice-Large
- VibeVoice-1.5B LoRA: 16GB+ VRAM minimum
- VibeVoice-Large (7B) LoRA: 48GB+ VRAM minimum
### Software
```bash
git clone https://github.com/vibevoice-community/VibeVoice.git
cd VibeVoice/
pip install -e .
```
## Usage
### Quick Start with Gradio
```bash
python demo/gradio_demo.py \
--model_path aoi-ot/VibeVoice-Large \
--checkpoint_path ABDALLALSWAITI/vibevoice-arabic-Z \
#--share
```
### Command Line Inference
```bash
python demo/inference_from_file.py \
--model_path aoi-ot/VibeVoice-Large \
--txt_path your_arabic_text.txt \
--speaker_names Frank \
--checkpoint_path ABDALLALSWAITI/vibevoice-arabic-Z
```
### Python API
```python
from vibevoice import VibeVoiceModel
# Load model with Arabic LoRA
model = VibeVoiceModel.from_pretrained(
"aoi-ot/VibeVoice-Large",
lora_path="ABDALLALSWAITI/vibevoice-arabic-Z"
)
# Generate speech
text = "Speaker 0: مرحبا، كيف حالك؟"
audio = model.generate(text, speaker_names=["Frank"])
```
## Training Your Own LoRA
### 1. Installation
```bash
git clone https://github.com/voicepowered-ai/VibeVoice-finetuning
cd VibeVoice-finetuning
pip install -e .
pip uninstall -y transformers && pip install transformers==4.51.3
wandb login # Optional
```
### 2. Prepare Dataset
### Hugging Face Dataset
```python
from datasets import Dataset, Audio
data = {
"text": [
"Speaker 0: مرحبا بك.",
"Speaker 0: كيف يمكنني مساعدتك؟"
],
"audio": [
"audio1.wav",
"audio2.wav"
]
}
dataset = Dataset.from_dict(data)
dataset = dataset.cast_column("audio", Audio(sampling_rate=24000))
dataset.push_to_hub("your-username/arabic-tts-dataset")
```
Then train with:
```bash
python -m vibevoice.finetune.train_vibevoice \
--model_name_or_path vibevoice/VibeVoice-1.5B \ #or aoi-ot/VibeVoice-Large
--dataset_name your-username/arabic-tts-dataset \
--text_column_name text \
--audio_column_name audio \
--voice_prompts_column_name audio \
--output_dir finetune_vibevoice_zac \
--per_device_train_batch_size 8 \
--gradient_accumulation_steps 16 \
--learning_rate 2.5e-5 \
--num_train_epochs 1 \
--logging_steps 10 \
--save_steps 100 \
--eval_steps 100 \
--report_to wandb \
--remove_unused_columns False \
--bf16 True \
--do_train \
--gradient_clipping \
--gradient_checkpointing False \
--ddpm_batch_mul 4 \
--diffusion_loss_weight 1.4 \
--train_diffusion_head True \
--ce_loss_weight 0.04 \
--voice_prompt_drop_rate 0.2 \
--lora_target_modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \
--lr_scheduler_type cosine \
--warmup_ratio 0.03 \
--max_grad_norm 0.8```
```
### example how dataset could be
https://huggingface.co/datasets/vibevoice/jenny_vibevoice_formatted
### Second method Create a `prompts.jsonl` file:
```json
{"text": "Speaker 0: مرحبا، هذا اختبار.", "audio": "audio1.wav"}
{"text": "Speaker 0: هذا مثال آخر.", "audio": "audio2.wav"}
```
Or use a Hugging Face dataset with columns:
- `text`: Transcription with speaker labels
- `audio`: 24kHz audio files
- `voice_prompts`: (Optional) Reference voice clips
### Train
```bash
python -m src.finetune_vibevoice_lora \
--model_name_or_path aoi-ot/VibeVoice-Large \
--processor_name_or_path src/vibevoice/processor \
--train_jsonl prompts.jsonl \
--text_column_name text \
--audio_column_name audio \
--output_dir output_arabic_lora \
--per_device_train_batch_size 8 \
--gradient_accumulation_steps 16 \
--learning_rate 2.5e-5 \
--num_train_epochs 5 \
--logging_steps 10 \
--save_steps 100 \
--report_to wandb \
--remove_unused_columns False \
--bf16 True \
--do_train \
--gradient_clipping \
--gradient_checkpointing False \
--ddpm_batch_mul 4 \
--diffusion_loss_weight 1.4 \
--train_diffusion_head True \
--ce_loss_weight 0.04 \
--voice_prompt_drop_rate 0.2 \
--lora_target_modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \
--lr_scheduler_type cosine \
--warmup_ratio 0.03 \
--max_grad_norm 0.8
```
### 4. Use Your Trained LoRA
```bash
python demo/gradio_demo.py \
--model_path aoi-ot/VibeVoice-Large \
--checkpoint_path output_arabic_lora/lora/checkpoint-500 \
--share
```
## Dataset Format
### JSONL Format
**Single Speaker (auto-generated voice prompt):**
```json
{"text": "Speaker 0: النص العربي هنا.", "audio": "/path/to/audio.wav"}
```
**Single Speaker (custom voice prompt):**
```json
{"text": "Speaker 0: النص العربي هنا.", "audio": "/path/to/audio.wav", "voice_prompts": "/path/to/reference.wav"}
```
**Multi-Speaker:**
```json
{"text": "Speaker 0: كيف حالك؟\nSpeaker 1: أنا بخير، شكراً.", "audio": "/path/to/conversation.wav", "voice_prompts": ["/path/to/speaker0_ref.wav", "/path/to/speaker1_ref.wav"]}
```
## Training Parameters
| Parameter | Description | Recommended |
|-----------|-------------|-------------|
| `--model_name_or_path` | Base model | `aoi-ot/VibeVoice-Large` |
| `--per_device_train_batch_size` | Batch size per GPU | `8` |
| `--gradient_accumulation_steps` | Gradient accumulation | `16` |
| `--learning_rate` | Learning rate | `2.5e-5` |
| `--num_train_epochs` | Training epochs | `5-10` |
| `--diffusion_loss_weight` | Diffusion loss weight | `1.4` |
| `--ce_loss_weight` | Cross-entropy loss | `0.04` |
| `--voice_prompt_drop_rate` | Voice prompt dropout | `0.2` |
| `--lora_r` | LoRA rank | `8` |
| `--lora_alpha` | LoRA alpha | `32` |
## Memory Optimization
### For Limited VRAM (32-40GB)
```bash
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 32 \
--gradient_checkpointing True
```
### Use LoRA on Diffusion Head
```bash
# Replace --train_diffusion_head True with:
--lora_wrap_diffusion_head True
```
## Citation
```bibtex
@misc{vibevoice-arabic-lora,
author = {ABDALLALSWAITI},
title = {VibeVoice Arabic LoRA},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/ABDALLALSWAITI/vibevoice-arabic-Z}}
}
```
## Acknowledgements
- Thanks to **Juan Pablo Gallego** from VoicePowered AI for the unofficial training code
- Original VibeVoice by Microsoft Research
- Community maintained by the VibeVoice community
## License
This model is released under the MIT License. See the [LICENSE](LICENSE) file for details.
---
### 💖 Support This Project
If you enjoy using this extension and would like to support continued development, please consider [buying me a coffee](https://paypal.me/abdallalswaiti). Every contribution helps keep this project going and enables new features!