vibevoice-arabic-Z / README.md
ABDALLALSWAITI's picture
Update README.md
3eeae5d verified
metadata
license: mit
language:
  - ar
  - en
base_model: aoi-ot/VibeVoice-Large
tags:
  - text-to-speech
  - tts
  - audio
  - vibevoice
  - lora
  - arabic
pipeline_tag: text-to-speech

VibeVoice Arabic LoRA

This is a LoRA (Low-Rank Adaptation) fine-tuned model for Arabic text-to-speech, based on aoi-ot/VibeVoice-Large.

Model Description

Requirements

Hardware

  • Inference:
    • VibeVoice-1.5B: 6GB+ VRAM
    • VibeVoice-Large (7B): 16GB+ VRAM
  • Training: 48GB+ VRAM for VibeVoice-Large
    • VibeVoice-1.5B LoRA: 16GB+ VRAM minimum
    • VibeVoice-Large (7B) LoRA: 48GB+ VRAM minimum

Software

git clone https://github.com/vibevoice-community/VibeVoice.git
cd VibeVoice/
pip install -e .

Usage

Quick Start with Gradio

python demo/gradio_demo.py \
  --model_path aoi-ot/VibeVoice-Large \
  --checkpoint_path ABDALLALSWAITI/vibevoice-arabic-Z \
  #--share

Command Line Inference

python demo/inference_from_file.py \
  --model_path aoi-ot/VibeVoice-Large \
  --txt_path your_arabic_text.txt \
  --speaker_names Frank \
  --checkpoint_path ABDALLALSWAITI/vibevoice-arabic-Z

Python API

from vibevoice import VibeVoiceModel

# Load model with Arabic LoRA
model = VibeVoiceModel.from_pretrained(
    "aoi-ot/VibeVoice-Large",
    lora_path="ABDALLALSWAITI/vibevoice-arabic-Z"
)

# Generate speech
text = "Speaker 0: مرحبا، كيف حالك؟"
audio = model.generate(text, speaker_names=["Frank"])

Training Your Own LoRA

1. Installation

git clone https://github.com/voicepowered-ai/VibeVoice-finetuning
cd VibeVoice-finetuning
pip install -e .
pip uninstall -y transformers && pip install transformers==4.51.3
wandb login  # Optional

2. Prepare Dataset

Hugging Face Dataset

from datasets import Dataset, Audio

data = {
    "text": [
        "Speaker 0: مرحبا بك.",
        "Speaker 0: كيف يمكنني مساعدتك؟"
    ],
    "audio": [
        "audio1.wav",
        "audio2.wav"
    ]
}

dataset = Dataset.from_dict(data)
dataset = dataset.cast_column("audio", Audio(sampling_rate=24000))
dataset.push_to_hub("your-username/arabic-tts-dataset")

Then train with:

python -m vibevoice.finetune.train_vibevoice \
    --model_name_or_path vibevoice/VibeVoice-1.5B \ #or aoi-ot/VibeVoice-Large 
    --dataset_name your-username/arabic-tts-dataset \
    --text_column_name text \
    --audio_column_name audio \
    --voice_prompts_column_name audio \
    --output_dir finetune_vibevoice_zac \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 16 \
    --learning_rate 2.5e-5 \
    --num_train_epochs 1 \
    --logging_steps 10 \
    --save_steps 100 \
    --eval_steps 100 \
    --report_to wandb \
    --remove_unused_columns False \
    --bf16 True \
    --do_train \
    --gradient_clipping \
    --gradient_checkpointing False \
    --ddpm_batch_mul 4 \
    --diffusion_loss_weight 1.4 \
    --train_diffusion_head True \
    --ce_loss_weight 0.04 \
    --voice_prompt_drop_rate 0.2 \
    --lora_target_modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.03 \
    --max_grad_norm 0.8```

example how dataset could be

https://huggingface.co/datasets/vibevoice/jenny_vibevoice_formatted

Second method Create a prompts.jsonl file:

{"text": "Speaker 0: مرحبا، هذا اختبار.", "audio": "audio1.wav"}
{"text": "Speaker 0: هذا مثال آخر.", "audio": "audio2.wav"}

Or use a Hugging Face dataset with columns:

  • text: Transcription with speaker labels
  • audio: 24kHz audio files
  • voice_prompts: (Optional) Reference voice clips

Train

python -m src.finetune_vibevoice_lora \
  --model_name_or_path aoi-ot/VibeVoice-Large \
  --processor_name_or_path src/vibevoice/processor \
  --train_jsonl prompts.jsonl \
  --text_column_name text \
  --audio_column_name audio \
  --output_dir output_arabic_lora \
  --per_device_train_batch_size 8 \
  --gradient_accumulation_steps 16 \
  --learning_rate 2.5e-5 \
  --num_train_epochs 5 \
  --logging_steps 10 \
  --save_steps 100 \
  --report_to wandb \
  --remove_unused_columns False \
  --bf16 True \
  --do_train \
  --gradient_clipping \
  --gradient_checkpointing False \
  --ddpm_batch_mul 4 \
  --diffusion_loss_weight 1.4 \
  --train_diffusion_head True \
  --ce_loss_weight 0.04 \
  --voice_prompt_drop_rate 0.2 \
  --lora_target_modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \
  --lr_scheduler_type cosine \
  --warmup_ratio 0.03 \
  --max_grad_norm 0.8

4. Use Your Trained LoRA

python demo/gradio_demo.py \
  --model_path aoi-ot/VibeVoice-Large \
  --checkpoint_path output_arabic_lora/lora/checkpoint-500 \
  --share

Dataset Format

JSONL Format

Single Speaker (auto-generated voice prompt):

{"text": "Speaker 0: النص العربي هنا.", "audio": "/path/to/audio.wav"}

Single Speaker (custom voice prompt):

{"text": "Speaker 0: النص العربي هنا.", "audio": "/path/to/audio.wav", "voice_prompts": "/path/to/reference.wav"}

Multi-Speaker:

{"text": "Speaker 0: كيف حالك؟\nSpeaker 1: أنا بخير، شكراً.", "audio": "/path/to/conversation.wav", "voice_prompts": ["/path/to/speaker0_ref.wav", "/path/to/speaker1_ref.wav"]}

Training Parameters

Parameter Description Recommended
--model_name_or_path Base model aoi-ot/VibeVoice-Large
--per_device_train_batch_size Batch size per GPU 8
--gradient_accumulation_steps Gradient accumulation 16
--learning_rate Learning rate 2.5e-5
--num_train_epochs Training epochs 5-10
--diffusion_loss_weight Diffusion loss weight 1.4
--ce_loss_weight Cross-entropy loss 0.04
--voice_prompt_drop_rate Voice prompt dropout 0.2
--lora_r LoRA rank 8
--lora_alpha LoRA alpha 32

Memory Optimization

For Limited VRAM (32-40GB)

--per_device_train_batch_size 4 \
--gradient_accumulation_steps 32 \
--gradient_checkpointing True

Use LoRA on Diffusion Head

# Replace --train_diffusion_head True with:
--lora_wrap_diffusion_head True

Citation

@misc{vibevoice-arabic-lora,
  author = {ABDALLALSWAITI},
  title = {VibeVoice Arabic LoRA},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/ABDALLALSWAITI/vibevoice-arabic-Z}}
}

Acknowledgements

  • Thanks to Juan Pablo Gallego from VoicePowered AI for the unofficial training code
  • Original VibeVoice by Microsoft Research
  • Community maintained by the VibeVoice community

License

This model is released under the MIT License. See the LICENSE file for details.


💖 Support This Project

If you enjoy using this extension and would like to support continued development, please consider buying me a coffee. Every contribution helps keep this project going and enables new features!