File size: 7,360 Bytes



---
license: mit
language:
- ar
- en
base_model: aoi-ot/VibeVoice-Large
tags:
- text-to-speech
- tts
- audio
- vibevoice
- lora
- arabic
pipeline_tag: text-to-speech
---

# VibeVoice Arabic LoRA

This is a LoRA (Low-Rank Adaptation) fine-tuned model for Arabic text-to-speech, based on `aoi-ot/VibeVoice-Large`.

## Model Description

- **Base Model**: [aoi-ot/VibeVoice-Large](https://huggingface.co/aoi-ot/VibeVoice-Large)
- **Training Method**: LoRA fine-tuning
- **Language**: Arabic 
- **License**: MIT

## Requirements

### Hardware
- **Inference**:
  - VibeVoice-1.5B: 6GB+ VRAM
  - VibeVoice-Large (7B): 16GB+ VRAM
- **Training**: 48GB+ VRAM for VibeVoice-Large
  - VibeVoice-1.5B LoRA: 16GB+ VRAM minimum
  - VibeVoice-Large (7B) LoRA: 48GB+ VRAM minimum



### Software
```bash
git clone https://github.com/vibevoice-community/VibeVoice.git
cd VibeVoice/
pip install -e .
```

## Usage

### Quick Start with Gradio

```bash
python demo/gradio_demo.py \
  --model_path aoi-ot/VibeVoice-Large \
  --checkpoint_path ABDALLALSWAITI/vibevoice-arabic-Z \
  #--share
```

### Command Line Inference

```bash
python demo/inference_from_file.py \
  --model_path aoi-ot/VibeVoice-Large \
  --txt_path your_arabic_text.txt \
  --speaker_names Frank \
  --checkpoint_path ABDALLALSWAITI/vibevoice-arabic-Z
```

### Python API

```python
from vibevoice import VibeVoiceModel

# Load model with Arabic LoRA
model = VibeVoiceModel.from_pretrained(
    "aoi-ot/VibeVoice-Large",
    lora_path="ABDALLALSWAITI/vibevoice-arabic-Z"
)

# Generate speech
text = "Speaker 0: مرحبا، كيف حالك؟"
audio = model.generate(text, speaker_names=["Frank"])
```

## Training Your Own  LoRA

### 1. Installation

```bash
git clone https://github.com/voicepowered-ai/VibeVoice-finetuning
cd VibeVoice-finetuning
pip install -e .
pip uninstall -y transformers && pip install transformers==4.51.3
wandb login  # Optional
```

### 2. Prepare Dataset

### Hugging Face Dataset

```python
from datasets import Dataset, Audio

data = {
    "text": [
        "Speaker 0: مرحبا بك.",
        "Speaker 0: كيف يمكنني مساعدتك؟"
    ],
    "audio": [
        "audio1.wav",
        "audio2.wav"
    ]
}

dataset = Dataset.from_dict(data)
dataset = dataset.cast_column("audio", Audio(sampling_rate=24000))
dataset.push_to_hub("your-username/arabic-tts-dataset")
```

Then train with:
```bash
python -m vibevoice.finetune.train_vibevoice \
    --model_name_or_path vibevoice/VibeVoice-1.5B \ #or aoi-ot/VibeVoice-Large 
    --dataset_name your-username/arabic-tts-dataset \
    --text_column_name text \
    --audio_column_name audio \
    --voice_prompts_column_name audio \
    --output_dir finetune_vibevoice_zac \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 16 \
    --learning_rate 2.5e-5 \
    --num_train_epochs 1 \
    --logging_steps 10 \
    --save_steps 100 \
    --eval_steps 100 \
    --report_to wandb \
    --remove_unused_columns False \
    --bf16 True \
    --do_train \
    --gradient_clipping \
    --gradient_checkpointing False \
    --ddpm_batch_mul 4 \
    --diffusion_loss_weight 1.4 \
    --train_diffusion_head True \
    --ce_loss_weight 0.04 \
    --voice_prompt_drop_rate 0.2 \
    --lora_target_modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.03 \
    --max_grad_norm 0.8```

```
### example how dataset could be

https://huggingface.co/datasets/vibevoice/jenny_vibevoice_formatted

### Second method Create a `prompts.jsonl` file:

```json
{"text": "Speaker 0: مرحبا، هذا اختبار.", "audio": "audio1.wav"}
{"text": "Speaker 0: هذا مثال آخر.", "audio": "audio2.wav"}
```

Or use a Hugging Face dataset with columns:
- `text`: Transcription with speaker labels
- `audio`: 24kHz audio files
- `voice_prompts`: (Optional) Reference voice clips

###  Train

```bash
python -m src.finetune_vibevoice_lora \
  --model_name_or_path aoi-ot/VibeVoice-Large \
  --processor_name_or_path src/vibevoice/processor \
  --train_jsonl prompts.jsonl \
  --text_column_name text \
  --audio_column_name audio \
  --output_dir output_arabic_lora \
  --per_device_train_batch_size 8 \
  --gradient_accumulation_steps 16 \
  --learning_rate 2.5e-5 \
  --num_train_epochs 5 \
  --logging_steps 10 \
  --save_steps 100 \
  --report_to wandb \
  --remove_unused_columns False \
  --bf16 True \
  --do_train \
  --gradient_clipping \
  --gradient_checkpointing False \
  --ddpm_batch_mul 4 \
  --diffusion_loss_weight 1.4 \
  --train_diffusion_head True \
  --ce_loss_weight 0.04 \
  --voice_prompt_drop_rate 0.2 \
  --lora_target_modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \
  --lr_scheduler_type cosine \
  --warmup_ratio 0.03 \
  --max_grad_norm 0.8
```

### 4. Use Your Trained LoRA

```bash
python demo/gradio_demo.py \
  --model_path aoi-ot/VibeVoice-Large \
  --checkpoint_path output_arabic_lora/lora/checkpoint-500 \
  --share
```

## Dataset Format

### JSONL Format

**Single Speaker (auto-generated voice prompt):**
```json
{"text": "Speaker 0: النص العربي هنا.", "audio": "/path/to/audio.wav"}
```

**Single Speaker (custom voice prompt):**
```json
{"text": "Speaker 0: النص العربي هنا.", "audio": "/path/to/audio.wav", "voice_prompts": "/path/to/reference.wav"}
```

**Multi-Speaker:**
```json
{"text": "Speaker 0: كيف حالك؟\nSpeaker 1: أنا بخير، شكراً.", "audio": "/path/to/conversation.wav", "voice_prompts": ["/path/to/speaker0_ref.wav", "/path/to/speaker1_ref.wav"]}
```


## Training Parameters

| Parameter | Description | Recommended |
|-----------|-------------|-------------|
| `--model_name_or_path` | Base model | `aoi-ot/VibeVoice-Large` |
| `--per_device_train_batch_size` | Batch size per GPU | `8` |
| `--gradient_accumulation_steps` | Gradient accumulation | `16` |
| `--learning_rate` | Learning rate | `2.5e-5` |
| `--num_train_epochs` | Training epochs | `5-10` |
| `--diffusion_loss_weight` | Diffusion loss weight | `1.4` |
| `--ce_loss_weight` | Cross-entropy loss | `0.04` |
| `--voice_prompt_drop_rate` | Voice prompt dropout | `0.2` |
| `--lora_r` | LoRA rank | `8` |
| `--lora_alpha` | LoRA alpha | `32` |

## Memory Optimization

### For Limited VRAM (32-40GB)

```bash
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 32 \
--gradient_checkpointing True
```

### Use LoRA on Diffusion Head

```bash
# Replace --train_diffusion_head True with:
--lora_wrap_diffusion_head True
```


## Citation

```bibtex
@misc{vibevoice-arabic-lora,
  author = {ABDALLALSWAITI},
  title = {VibeVoice Arabic LoRA},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/ABDALLALSWAITI/vibevoice-arabic-Z}}
}
```

## Acknowledgements

- Thanks to **Juan Pablo Gallego** from VoicePowered AI for the unofficial training code
- Original VibeVoice by Microsoft Research
- Community maintained by the VibeVoice community

## License

This model is released under the MIT License. See the [LICENSE](LICENSE) file for details.

---


### 💖 Support This Project
If you enjoy using this extension and would like to support continued development, please consider [buying me a coffee](https://paypal.me/abdallalswaiti). Every contribution helps keep this project going and enables new features!