VibeVoice Arabic LoRA

This is a LoRA (Low-Rank Adaptation) fine-tuned model for Arabic text-to-speech, based on aoi-ot/VibeVoice-Large.

Model Description

Requirements

Hardware

  • Inference:
    • VibeVoice-1.5B: 6GB+ VRAM
    • VibeVoice-Large (7B): 16GB+ VRAM
  • Training: 48GB+ VRAM for VibeVoice-Large
    • VibeVoice-1.5B LoRA: 16GB+ VRAM minimum
    • VibeVoice-Large (7B) LoRA: 48GB+ VRAM minimum

Software

git clone https://github.com/vibevoice-community/VibeVoice.git
cd VibeVoice/
pip install -e .

Usage

Quick Start with Gradio

python demo/gradio_demo.py \
  --model_path aoi-ot/VibeVoice-Large \
  --checkpoint_path ABDALLALSWAITI/vibevoice-arabic-Z \
  #--share

Command Line Inference

python demo/inference_from_file.py \
  --model_path aoi-ot/VibeVoice-Large \
  --txt_path your_arabic_text.txt \
  --speaker_names Frank \
  --checkpoint_path ABDALLALSWAITI/vibevoice-arabic-Z

Python API

from vibevoice import VibeVoiceModel

# Load model with Arabic LoRA
model = VibeVoiceModel.from_pretrained(
    "aoi-ot/VibeVoice-Large",
    lora_path="ABDALLALSWAITI/vibevoice-arabic-Z"
)

# Generate speech
text = "Speaker 0: ู…ุฑุญุจุงุŒ ูƒูŠู ุญุงู„ูƒุŸ"
audio = model.generate(text, speaker_names=["Frank"])

Training Your Own LoRA

1. Installation

git clone https://github.com/voicepowered-ai/VibeVoice-finetuning
cd VibeVoice-finetuning
pip install -e .
pip uninstall -y transformers && pip install transformers==4.51.3
wandb login  # Optional

2. Prepare Dataset

Hugging Face Dataset

from datasets import Dataset, Audio

data = {
    "text": [
        "Speaker 0: ู…ุฑุญุจุง ุจูƒ.",
        "Speaker 0: ูƒูŠู ูŠู…ูƒู†ู†ูŠ ู…ุณุงุนุฏุชูƒุŸ"
    ],
    "audio": [
        "audio1.wav",
        "audio2.wav"
    ]
}

dataset = Dataset.from_dict(data)
dataset = dataset.cast_column("audio", Audio(sampling_rate=24000))
dataset.push_to_hub("your-username/arabic-tts-dataset")

Then train with:

python -m vibevoice.finetune.train_vibevoice \
    --model_name_or_path vibevoice/VibeVoice-1.5B \ #or aoi-ot/VibeVoice-Large 
    --dataset_name your-username/arabic-tts-dataset \
    --text_column_name text \
    --audio_column_name audio \
    --voice_prompts_column_name audio \
    --output_dir finetune_vibevoice_zac \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 16 \
    --learning_rate 2.5e-5 \
    --num_train_epochs 1 \
    --logging_steps 10 \
    --save_steps 100 \
    --eval_steps 100 \
    --report_to wandb \
    --remove_unused_columns False \
    --bf16 True \
    --do_train \
    --gradient_clipping \
    --gradient_checkpointing False \
    --ddpm_batch_mul 4 \
    --diffusion_loss_weight 1.4 \
    --train_diffusion_head True \
    --ce_loss_weight 0.04 \
    --voice_prompt_drop_rate 0.2 \
    --lora_target_modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.03 \
    --max_grad_norm 0.8```

example how dataset could be

https://huggingface.co/datasets/vibevoice/jenny_vibevoice_formatted

Second method Create a prompts.jsonl file:

{"text": "Speaker 0: ู…ุฑุญุจุงุŒ ู‡ุฐุง ุงุฎุชุจุงุฑ.", "audio": "audio1.wav"}
{"text": "Speaker 0: ู‡ุฐุง ู…ุซุงู„ ุขุฎุฑ.", "audio": "audio2.wav"}

Or use a Hugging Face dataset with columns:

  • text: Transcription with speaker labels
  • audio: 24kHz audio files
  • voice_prompts: (Optional) Reference voice clips

Train

python -m src.finetune_vibevoice_lora \
  --model_name_or_path aoi-ot/VibeVoice-Large \
  --processor_name_or_path src/vibevoice/processor \
  --train_jsonl prompts.jsonl \
  --text_column_name text \
  --audio_column_name audio \
  --output_dir output_arabic_lora \
  --per_device_train_batch_size 8 \
  --gradient_accumulation_steps 16 \
  --learning_rate 2.5e-5 \
  --num_train_epochs 5 \
  --logging_steps 10 \
  --save_steps 100 \
  --report_to wandb \
  --remove_unused_columns False \
  --bf16 True \
  --do_train \
  --gradient_clipping \
  --gradient_checkpointing False \
  --ddpm_batch_mul 4 \
  --diffusion_loss_weight 1.4 \
  --train_diffusion_head True \
  --ce_loss_weight 0.04 \
  --voice_prompt_drop_rate 0.2 \
  --lora_target_modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \
  --lr_scheduler_type cosine \
  --warmup_ratio 0.03 \
  --max_grad_norm 0.8

4. Use Your Trained LoRA

python demo/gradio_demo.py \
  --model_path aoi-ot/VibeVoice-Large \
  --checkpoint_path output_arabic_lora/lora/checkpoint-500 \
  --share

Dataset Format

JSONL Format

Single Speaker (auto-generated voice prompt):

{"text": "Speaker 0: ุงู„ู†ุต ุงู„ุนุฑุจูŠ ู‡ู†ุง.", "audio": "/path/to/audio.wav"}

Single Speaker (custom voice prompt):

{"text": "Speaker 0: ุงู„ู†ุต ุงู„ุนุฑุจูŠ ู‡ู†ุง.", "audio": "/path/to/audio.wav", "voice_prompts": "/path/to/reference.wav"}

Multi-Speaker:

{"text": "Speaker 0: ูƒูŠู ุญุงู„ูƒุŸ\nSpeaker 1: ุฃู†ุง ุจุฎูŠุฑุŒ ุดูƒุฑุงู‹.", "audio": "/path/to/conversation.wav", "voice_prompts": ["/path/to/speaker0_ref.wav", "/path/to/speaker1_ref.wav"]}

Training Parameters

Parameter Description Recommended
--model_name_or_path Base model aoi-ot/VibeVoice-Large
--per_device_train_batch_size Batch size per GPU 8
--gradient_accumulation_steps Gradient accumulation 16
--learning_rate Learning rate 2.5e-5
--num_train_epochs Training epochs 5-10
--diffusion_loss_weight Diffusion loss weight 1.4
--ce_loss_weight Cross-entropy loss 0.04
--voice_prompt_drop_rate Voice prompt dropout 0.2
--lora_r LoRA rank 8
--lora_alpha LoRA alpha 32

Memory Optimization

For Limited VRAM (32-40GB)

--per_device_train_batch_size 4 \
--gradient_accumulation_steps 32 \
--gradient_checkpointing True

Use LoRA on Diffusion Head

# Replace --train_diffusion_head True with:
--lora_wrap_diffusion_head True

Citation

@misc{vibevoice-arabic-lora,
  author = {ABDALLALSWAITI},
  title = {VibeVoice Arabic LoRA},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/ABDALLALSWAITI/vibevoice-arabic-Z}}
}

Acknowledgements

  • Thanks to Juan Pablo Gallego from VoicePowered AI for the unofficial training code
  • Original VibeVoice by Microsoft Research
  • Community maintained by the VibeVoice community

License

This model is released under the MIT License. See the LICENSE file for details.


๐Ÿ’– Support This Project

If you enjoy using this extension and would like to support continued development, please consider buying me a coffee. Every contribution helps keep this project going and enables new features!

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ABDALLALSWAITI/vibevoice-arabic-Z

Adapter
(3)
this model