VibeVoice Arabic LoRA
This is a LoRA (Low-Rank Adaptation) fine-tuned model for Arabic text-to-speech, based on aoi-ot/VibeVoice-Large.
Model Description
- Base Model: aoi-ot/VibeVoice-Large
- Training Method: LoRA fine-tuning
- Language: Arabic
- License: MIT
Requirements
Hardware
- Inference:
- VibeVoice-1.5B: 6GB+ VRAM
- VibeVoice-Large (7B): 16GB+ VRAM
- Training: 48GB+ VRAM for VibeVoice-Large
- VibeVoice-1.5B LoRA: 16GB+ VRAM minimum
- VibeVoice-Large (7B) LoRA: 48GB+ VRAM minimum
Software
git clone https://github.com/vibevoice-community/VibeVoice.git
cd VibeVoice/
pip install -e .
Usage
Quick Start with Gradio
python demo/gradio_demo.py \
--model_path aoi-ot/VibeVoice-Large \
--checkpoint_path ABDALLALSWAITI/vibevoice-arabic-Z \
#--share
Command Line Inference
python demo/inference_from_file.py \
--model_path aoi-ot/VibeVoice-Large \
--txt_path your_arabic_text.txt \
--speaker_names Frank \
--checkpoint_path ABDALLALSWAITI/vibevoice-arabic-Z
Python API
from vibevoice import VibeVoiceModel
# Load model with Arabic LoRA
model = VibeVoiceModel.from_pretrained(
"aoi-ot/VibeVoice-Large",
lora_path="ABDALLALSWAITI/vibevoice-arabic-Z"
)
# Generate speech
text = "Speaker 0: ู
ุฑุญุจุงุ ููู ุญุงููุ"
audio = model.generate(text, speaker_names=["Frank"])
Training Your Own LoRA
1. Installation
git clone https://github.com/voicepowered-ai/VibeVoice-finetuning
cd VibeVoice-finetuning
pip install -e .
pip uninstall -y transformers && pip install transformers==4.51.3
wandb login # Optional
2. Prepare Dataset
Hugging Face Dataset
from datasets import Dataset, Audio
data = {
"text": [
"Speaker 0: ู
ุฑุญุจุง ุจู.",
"Speaker 0: ููู ูู
ูููู ู
ุณุงุนุฏุชูุ"
],
"audio": [
"audio1.wav",
"audio2.wav"
]
}
dataset = Dataset.from_dict(data)
dataset = dataset.cast_column("audio", Audio(sampling_rate=24000))
dataset.push_to_hub("your-username/arabic-tts-dataset")
Then train with:
python -m vibevoice.finetune.train_vibevoice \
--model_name_or_path vibevoice/VibeVoice-1.5B \ #or aoi-ot/VibeVoice-Large
--dataset_name your-username/arabic-tts-dataset \
--text_column_name text \
--audio_column_name audio \
--voice_prompts_column_name audio \
--output_dir finetune_vibevoice_zac \
--per_device_train_batch_size 8 \
--gradient_accumulation_steps 16 \
--learning_rate 2.5e-5 \
--num_train_epochs 1 \
--logging_steps 10 \
--save_steps 100 \
--eval_steps 100 \
--report_to wandb \
--remove_unused_columns False \
--bf16 True \
--do_train \
--gradient_clipping \
--gradient_checkpointing False \
--ddpm_batch_mul 4 \
--diffusion_loss_weight 1.4 \
--train_diffusion_head True \
--ce_loss_weight 0.04 \
--voice_prompt_drop_rate 0.2 \
--lora_target_modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \
--lr_scheduler_type cosine \
--warmup_ratio 0.03 \
--max_grad_norm 0.8```
example how dataset could be
https://huggingface.co/datasets/vibevoice/jenny_vibevoice_formatted
Second method Create a prompts.jsonl file:
{"text": "Speaker 0: ู
ุฑุญุจุงุ ูุฐุง ุงุฎุชุจุงุฑ.", "audio": "audio1.wav"}
{"text": "Speaker 0: ูุฐุง ู
ุซุงู ุขุฎุฑ.", "audio": "audio2.wav"}
Or use a Hugging Face dataset with columns:
text: Transcription with speaker labelsaudio: 24kHz audio filesvoice_prompts: (Optional) Reference voice clips
Train
python -m src.finetune_vibevoice_lora \
--model_name_or_path aoi-ot/VibeVoice-Large \
--processor_name_or_path src/vibevoice/processor \
--train_jsonl prompts.jsonl \
--text_column_name text \
--audio_column_name audio \
--output_dir output_arabic_lora \
--per_device_train_batch_size 8 \
--gradient_accumulation_steps 16 \
--learning_rate 2.5e-5 \
--num_train_epochs 5 \
--logging_steps 10 \
--save_steps 100 \
--report_to wandb \
--remove_unused_columns False \
--bf16 True \
--do_train \
--gradient_clipping \
--gradient_checkpointing False \
--ddpm_batch_mul 4 \
--diffusion_loss_weight 1.4 \
--train_diffusion_head True \
--ce_loss_weight 0.04 \
--voice_prompt_drop_rate 0.2 \
--lora_target_modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \
--lr_scheduler_type cosine \
--warmup_ratio 0.03 \
--max_grad_norm 0.8
4. Use Your Trained LoRA
python demo/gradio_demo.py \
--model_path aoi-ot/VibeVoice-Large \
--checkpoint_path output_arabic_lora/lora/checkpoint-500 \
--share
Dataset Format
JSONL Format
Single Speaker (auto-generated voice prompt):
{"text": "Speaker 0: ุงููุต ุงูุนุฑุจู ููุง.", "audio": "/path/to/audio.wav"}
Single Speaker (custom voice prompt):
{"text": "Speaker 0: ุงููุต ุงูุนุฑุจู ููุง.", "audio": "/path/to/audio.wav", "voice_prompts": "/path/to/reference.wav"}
Multi-Speaker:
{"text": "Speaker 0: ููู ุญุงููุ\nSpeaker 1: ุฃูุง ุจุฎูุฑุ ุดูุฑุงู.", "audio": "/path/to/conversation.wav", "voice_prompts": ["/path/to/speaker0_ref.wav", "/path/to/speaker1_ref.wav"]}
Training Parameters
| Parameter | Description | Recommended |
|---|---|---|
--model_name_or_path |
Base model | aoi-ot/VibeVoice-Large |
--per_device_train_batch_size |
Batch size per GPU | 8 |
--gradient_accumulation_steps |
Gradient accumulation | 16 |
--learning_rate |
Learning rate | 2.5e-5 |
--num_train_epochs |
Training epochs | 5-10 |
--diffusion_loss_weight |
Diffusion loss weight | 1.4 |
--ce_loss_weight |
Cross-entropy loss | 0.04 |
--voice_prompt_drop_rate |
Voice prompt dropout | 0.2 |
--lora_r |
LoRA rank | 8 |
--lora_alpha |
LoRA alpha | 32 |
Memory Optimization
For Limited VRAM (32-40GB)
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 32 \
--gradient_checkpointing True
Use LoRA on Diffusion Head
# Replace --train_diffusion_head True with:
--lora_wrap_diffusion_head True
Citation
@misc{vibevoice-arabic-lora,
author = {ABDALLALSWAITI},
title = {VibeVoice Arabic LoRA},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/ABDALLALSWAITI/vibevoice-arabic-Z}}
}
Acknowledgements
- Thanks to Juan Pablo Gallego from VoicePowered AI for the unofficial training code
- Original VibeVoice by Microsoft Research
- Community maintained by the VibeVoice community
License
This model is released under the MIT License. See the LICENSE file for details.
๐ Support This Project
If you enjoy using this extension and would like to support continued development, please consider buying me a coffee. Every contribution helps keep this project going and enables new features!
Model tree for ABDALLALSWAITI/vibevoice-arabic-Z
Base model
aoi-ot/VibeVoice-Large