VibeVoice Arabic LoRA

This is a LoRA (Low-Rank Adaptation) fine-tuned model for Arabic text-to-speech, based on aoi-ot/VibeVoice-Large.

Model Description

Base Model: aoi-ot/VibeVoice-Large
Training Method: LoRA fine-tuning
Language: Arabic
License: MIT

Requirements

Hardware

Inference:
- VibeVoice-1.5B: 6GB+ VRAM
- VibeVoice-Large (7B): 16GB+ VRAM
Training: 48GB+ VRAM for VibeVoice-Large
- VibeVoice-1.5B LoRA: 16GB+ VRAM minimum
- VibeVoice-Large (7B) LoRA: 48GB+ VRAM minimum

Software

git clone https://github.com/vibevoice-community/VibeVoice.git
cd VibeVoice/
pip install -e .

Usage

Quick Start with Gradio

python demo/gradio_demo.py \
  --model_path aoi-ot/VibeVoice-Large \
  --checkpoint_path ABDALLALSWAITI/vibevoice-arabic-Z \
  #--share

Command Line Inference

python demo/inference_from_file.py \
  --model_path aoi-ot/VibeVoice-Large \
  --txt_path your_arabic_text.txt \
  --speaker_names Frank \
  --checkpoint_path ABDALLALSWAITI/vibevoice-arabic-Z

Python API

from vibevoice import VibeVoiceModel

# Load model with Arabic LoRA
model = VibeVoiceModel.from_pretrained(
    "aoi-ot/VibeVoice-Large",
    lora_path="ABDALLALSWAITI/vibevoice-arabic-Z"
)

# Generate speech
text = "Speaker 0: مرحبا، كيف حالك؟"
audio = model.generate(text, speaker_names=["Frank"])

Training Your Own LoRA

1. Installation

git clone https://github.com/voicepowered-ai/VibeVoice-finetuning
cd VibeVoice-finetuning
pip install -e .
pip uninstall -y transformers && pip install transformers==4.51.3
wandb login  # Optional

2. Prepare Dataset

Hugging Face Dataset

from datasets import Dataset, Audio

data = {
    "text": [
        "Speaker 0: مرحبا بك.",
        "Speaker 0: كيف يمكنني مساعدتك؟"
    ],
    "audio": [
        "audio1.wav",
        "audio2.wav"
    ]
}

dataset = Dataset.from_dict(data)
dataset = dataset.cast_column("audio", Audio(sampling_rate=24000))
dataset.push_to_hub("your-username/arabic-tts-dataset")

Then train with:

python -m vibevoice.finetune.train_vibevoice \
    --model_name_or_path vibevoice/VibeVoice-1.5B \ #or aoi-ot/VibeVoice-Large 
    --dataset_name your-username/arabic-tts-dataset \
    --text_column_name text \
    --audio_column_name audio \
    --voice_prompts_column_name audio \
    --output_dir finetune_vibevoice_zac \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 16 \
    --learning_rate 2.5e-5 \
    --num_train_epochs 1 \
    --logging_steps 10 \
    --save_steps 100 \
    --eval_steps 100 \
    --report_to wandb \
    --remove_unused_columns False \
    --bf16 True \
    --do_train \
    --gradient_clipping \
    --gradient_checkpointing False \
    --ddpm_batch_mul 4 \
    --diffusion_loss_weight 1.4 \
    --train_diffusion_head True \
    --ce_loss_weight 0.04 \
    --voice_prompt_drop_rate 0.2 \
    --lora_target_modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.03 \
    --max_grad_norm 0.8```

example how dataset could be

https://huggingface.co/datasets/vibevoice/jenny_vibevoice_formatted

Second method Create a `prompts.jsonl` file:

{"text": "Speaker 0: مرحبا، هذا اختبار.", "audio": "audio1.wav"}
{"text": "Speaker 0: هذا مثال آخر.", "audio": "audio2.wav"}

Or use a Hugging Face dataset with columns:

text: Transcription with speaker labels
audio: 24kHz audio files
voice_prompts: (Optional) Reference voice clips

Train

python -m src.finetune_vibevoice_lora \
  --model_name_or_path aoi-ot/VibeVoice-Large \
  --processor_name_or_path src/vibevoice/processor \
  --train_jsonl prompts.jsonl \
  --text_column_name text \
  --audio_column_name audio \
  --output_dir output_arabic_lora \
  --per_device_train_batch_size 8 \
  --gradient_accumulation_steps 16 \
  --learning_rate 2.5e-5 \
  --num_train_epochs 5 \
  --logging_steps 10 \
  --save_steps 100 \
  --report_to wandb \
  --remove_unused_columns False \
  --bf16 True \
  --do_train \
  --gradient_clipping \
  --gradient_checkpointing False \
  --ddpm_batch_mul 4 \
  --diffusion_loss_weight 1.4 \
  --train_diffusion_head True \
  --ce_loss_weight 0.04 \
  --voice_prompt_drop_rate 0.2 \
  --lora_target_modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \
  --lr_scheduler_type cosine \
  --warmup_ratio 0.03 \
  --max_grad_norm 0.8

4. Use Your Trained LoRA

python demo/gradio_demo.py \
  --model_path aoi-ot/VibeVoice-Large \
  --checkpoint_path output_arabic_lora/lora/checkpoint-500 \
  --share

Dataset Format

JSONL Format

Single Speaker (auto-generated voice prompt):

{"text": "Speaker 0: النص العربي هنا.", "audio": "/path/to/audio.wav"}

Single Speaker (custom voice prompt):

{"text": "Speaker 0: النص العربي هنا.", "audio": "/path/to/audio.wav", "voice_prompts": "/path/to/reference.wav"}

Multi-Speaker:

{"text": "Speaker 0: كيف حالك؟\nSpeaker 1: أنا بخير، شكراً.", "audio": "/path/to/conversation.wav", "voice_prompts": ["/path/to/speaker0_ref.wav", "/path/to/speaker1_ref.wav"]}

Training Parameters

Parameter	Description	Recommended
`--model_name_or_path`	Base model	`aoi-ot/VibeVoice-Large`
`--per_device_train_batch_size`	Batch size per GPU	`8`
`--gradient_accumulation_steps`	Gradient accumulation	`16`
`--learning_rate`	Learning rate	`2.5e-5`
`--num_train_epochs`	Training epochs	`5-10`
`--diffusion_loss_weight`	Diffusion loss weight	`1.4`
`--ce_loss_weight`	Cross-entropy loss	`0.04`
`--voice_prompt_drop_rate`	Voice prompt dropout	`0.2`
`--lora_r`	LoRA rank	`8`
`--lora_alpha`	LoRA alpha	`32`

Memory Optimization

For Limited VRAM (32-40GB)

--per_device_train_batch_size 4 \
--gradient_accumulation_steps 32 \
--gradient_checkpointing True

Use LoRA on Diffusion Head

# Replace --train_diffusion_head True with:
--lora_wrap_diffusion_head True

Citation

@misc{vibevoice-arabic-lora,
  author = {ABDALLALSWAITI},
  title = {VibeVoice Arabic LoRA},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/ABDALLALSWAITI/vibevoice-arabic-Z}}
}

Acknowledgements

Thanks to Juan Pablo Gallego from VoicePowered AI for the unofficial training code
Original VibeVoice by Microsoft Research
Community maintained by the VibeVoice community

License

This model is released under the MIT License. See the LICENSE file for details.

💖 Support This Project

If you enjoy using this extension and would like to support continued development, please consider buying me a coffee. Every contribution helps keep this project going and enables new features!

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for ABDALLALSWAITI/vibevoice-arabic-Z

Base model

aoi-ot/VibeVoice-Large

Adapter

(3)

this model