--- license: mit language: - ar - en base_model: aoi-ot/VibeVoice-Large tags: - text-to-speech - tts - audio - vibevoice - lora - arabic pipeline_tag: text-to-speech --- # VibeVoice Arabic LoRA This is a LoRA (Low-Rank Adaptation) fine-tuned model for Arabic text-to-speech, based on `aoi-ot/VibeVoice-Large`. ## Model Description - **Base Model**: [aoi-ot/VibeVoice-Large](https://huggingface.co/aoi-ot/VibeVoice-Large) - **Training Method**: LoRA fine-tuning - **Language**: Arabic - **License**: MIT ## Requirements ### Hardware - **Inference**: - VibeVoice-1.5B: 6GB+ VRAM - VibeVoice-Large (7B): 16GB+ VRAM - **Training**: 48GB+ VRAM for VibeVoice-Large - VibeVoice-1.5B LoRA: 16GB+ VRAM minimum - VibeVoice-Large (7B) LoRA: 48GB+ VRAM minimum ### Software ```bash git clone https://github.com/vibevoice-community/VibeVoice.git cd VibeVoice/ pip install -e . ``` ## Usage ### Quick Start with Gradio ```bash python demo/gradio_demo.py \ --model_path aoi-ot/VibeVoice-Large \ --checkpoint_path ABDALLALSWAITI/vibevoice-arabic-Z \ #--share ``` ### Command Line Inference ```bash python demo/inference_from_file.py \ --model_path aoi-ot/VibeVoice-Large \ --txt_path your_arabic_text.txt \ --speaker_names Frank \ --checkpoint_path ABDALLALSWAITI/vibevoice-arabic-Z ``` ### Python API ```python from vibevoice import VibeVoiceModel # Load model with Arabic LoRA model = VibeVoiceModel.from_pretrained( "aoi-ot/VibeVoice-Large", lora_path="ABDALLALSWAITI/vibevoice-arabic-Z" ) # Generate speech text = "Speaker 0: مرحبا، كيف حالك؟" audio = model.generate(text, speaker_names=["Frank"]) ``` ## Training Your Own LoRA ### 1. Installation ```bash git clone https://github.com/voicepowered-ai/VibeVoice-finetuning cd VibeVoice-finetuning pip install -e . pip uninstall -y transformers && pip install transformers==4.51.3 wandb login # Optional ``` ### 2. Prepare Dataset ### Hugging Face Dataset ```python from datasets import Dataset, Audio data = { "text": [ "Speaker 0: مرحبا بك.", "Speaker 0: كيف يمكنني مساعدتك؟" ], "audio": [ "audio1.wav", "audio2.wav" ] } dataset = Dataset.from_dict(data) dataset = dataset.cast_column("audio", Audio(sampling_rate=24000)) dataset.push_to_hub("your-username/arabic-tts-dataset") ``` Then train with: ```bash python -m vibevoice.finetune.train_vibevoice \ --model_name_or_path vibevoice/VibeVoice-1.5B \ #or aoi-ot/VibeVoice-Large --dataset_name your-username/arabic-tts-dataset \ --text_column_name text \ --audio_column_name audio \ --voice_prompts_column_name audio \ --output_dir finetune_vibevoice_zac \ --per_device_train_batch_size 8 \ --gradient_accumulation_steps 16 \ --learning_rate 2.5e-5 \ --num_train_epochs 1 \ --logging_steps 10 \ --save_steps 100 \ --eval_steps 100 \ --report_to wandb \ --remove_unused_columns False \ --bf16 True \ --do_train \ --gradient_clipping \ --gradient_checkpointing False \ --ddpm_batch_mul 4 \ --diffusion_loss_weight 1.4 \ --train_diffusion_head True \ --ce_loss_weight 0.04 \ --voice_prompt_drop_rate 0.2 \ --lora_target_modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \ --lr_scheduler_type cosine \ --warmup_ratio 0.03 \ --max_grad_norm 0.8``` ``` ### example how dataset could be https://huggingface.co/datasets/vibevoice/jenny_vibevoice_formatted ### Second method Create a `prompts.jsonl` file: ```json {"text": "Speaker 0: مرحبا، هذا اختبار.", "audio": "audio1.wav"} {"text": "Speaker 0: هذا مثال آخر.", "audio": "audio2.wav"} ``` Or use a Hugging Face dataset with columns: - `text`: Transcription with speaker labels - `audio`: 24kHz audio files - `voice_prompts`: (Optional) Reference voice clips ### Train ```bash python -m src.finetune_vibevoice_lora \ --model_name_or_path aoi-ot/VibeVoice-Large \ --processor_name_or_path src/vibevoice/processor \ --train_jsonl prompts.jsonl \ --text_column_name text \ --audio_column_name audio \ --output_dir output_arabic_lora \ --per_device_train_batch_size 8 \ --gradient_accumulation_steps 16 \ --learning_rate 2.5e-5 \ --num_train_epochs 5 \ --logging_steps 10 \ --save_steps 100 \ --report_to wandb \ --remove_unused_columns False \ --bf16 True \ --do_train \ --gradient_clipping \ --gradient_checkpointing False \ --ddpm_batch_mul 4 \ --diffusion_loss_weight 1.4 \ --train_diffusion_head True \ --ce_loss_weight 0.04 \ --voice_prompt_drop_rate 0.2 \ --lora_target_modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \ --lr_scheduler_type cosine \ --warmup_ratio 0.03 \ --max_grad_norm 0.8 ``` ### 4. Use Your Trained LoRA ```bash python demo/gradio_demo.py \ --model_path aoi-ot/VibeVoice-Large \ --checkpoint_path output_arabic_lora/lora/checkpoint-500 \ --share ``` ## Dataset Format ### JSONL Format **Single Speaker (auto-generated voice prompt):** ```json {"text": "Speaker 0: النص العربي هنا.", "audio": "/path/to/audio.wav"} ``` **Single Speaker (custom voice prompt):** ```json {"text": "Speaker 0: النص العربي هنا.", "audio": "/path/to/audio.wav", "voice_prompts": "/path/to/reference.wav"} ``` **Multi-Speaker:** ```json {"text": "Speaker 0: كيف حالك؟\nSpeaker 1: أنا بخير، شكراً.", "audio": "/path/to/conversation.wav", "voice_prompts": ["/path/to/speaker0_ref.wav", "/path/to/speaker1_ref.wav"]} ``` ## Training Parameters | Parameter | Description | Recommended | |-----------|-------------|-------------| | `--model_name_or_path` | Base model | `aoi-ot/VibeVoice-Large` | | `--per_device_train_batch_size` | Batch size per GPU | `8` | | `--gradient_accumulation_steps` | Gradient accumulation | `16` | | `--learning_rate` | Learning rate | `2.5e-5` | | `--num_train_epochs` | Training epochs | `5-10` | | `--diffusion_loss_weight` | Diffusion loss weight | `1.4` | | `--ce_loss_weight` | Cross-entropy loss | `0.04` | | `--voice_prompt_drop_rate` | Voice prompt dropout | `0.2` | | `--lora_r` | LoRA rank | `8` | | `--lora_alpha` | LoRA alpha | `32` | ## Memory Optimization ### For Limited VRAM (32-40GB) ```bash --per_device_train_batch_size 4 \ --gradient_accumulation_steps 32 \ --gradient_checkpointing True ``` ### Use LoRA on Diffusion Head ```bash # Replace --train_diffusion_head True with: --lora_wrap_diffusion_head True ``` ## Citation ```bibtex @misc{vibevoice-arabic-lora, author = {ABDALLALSWAITI}, title = {VibeVoice Arabic LoRA}, year = {2025}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/ABDALLALSWAITI/vibevoice-arabic-Z}} } ``` ## Acknowledgements - Thanks to **Juan Pablo Gallego** from VoicePowered AI for the unofficial training code - Original VibeVoice by Microsoft Research - Community maintained by the VibeVoice community ## License This model is released under the MIT License. See the [LICENSE](LICENSE) file for details. --- ### 💖 Support This Project If you enjoy using this extension and would like to support continued development, please consider [buying me a coffee](https://paypal.me/abdallalswaiti). Every contribution helps keep this project going and enables new features!