vibevoice-arabic-Z / README.md

Update README.md

3eeae5d verified about 2 months ago

7.36 kB



	---
	license: mit
	language:
	- ar
	- en
	base_model: aoi-ot/VibeVoice-Large
	tags:
	- text-to-speech
	- tts
	- audio
	- vibevoice
	- lora
	- arabic
	pipeline_tag: text-to-speech
	---

	# VibeVoice Arabic LoRA

	This is a LoRA (Low-Rank Adaptation) fine-tuned model for Arabic text-to-speech, based on `aoi-ot/VibeVoice-Large`.

	## Model Description

	- Base Model: [aoi-ot/VibeVoice-Large](https://huggingface.co/aoi-ot/VibeVoice-Large)
	- Training Method: LoRA fine-tuning
	- Language: Arabic
	- License: MIT

	## Requirements

	### Hardware
	- Inference:
	- VibeVoice-1.5B: 6GB+ VRAM
	- VibeVoice-Large (7B): 16GB+ VRAM
	- Training: 48GB+ VRAM for VibeVoice-Large
	- VibeVoice-1.5B LoRA: 16GB+ VRAM minimum
	- VibeVoice-Large (7B) LoRA: 48GB+ VRAM minimum



	### Software
	```bash
	git clone https://github.com/vibevoice-community/VibeVoice.git
	cd VibeVoice/
	pip install -e .
	```

	## Usage

	### Quick Start with Gradio

	```bash
	python demo/gradio_demo.py \
	--model_path aoi-ot/VibeVoice-Large \
	--checkpoint_path ABDALLALSWAITI/vibevoice-arabic-Z \
	#--share
	```

	### Command Line Inference

	```bash
	python demo/inference_from_file.py \
	--model_path aoi-ot/VibeVoice-Large \
	--txt_path your_arabic_text.txt \
	--speaker_names Frank \
	--checkpoint_path ABDALLALSWAITI/vibevoice-arabic-Z
	```

	### Python API

	```python
	from vibevoice import VibeVoiceModel

	# Load model with Arabic LoRA
	model = VibeVoiceModel.from_pretrained(
	"aoi-ot/VibeVoice-Large",
	lora_path="ABDALLALSWAITI/vibevoice-arabic-Z"
	)

	# Generate speech
	text = "Speaker 0: مرحبا، كيف حالك؟"
	audio = model.generate(text, speaker_names=["Frank"])
	```

	## Training Your Own LoRA

	### 1. Installation

	```bash
	git clone https://github.com/voicepowered-ai/VibeVoice-finetuning
	cd VibeVoice-finetuning
	pip install -e .
	pip uninstall -y transformers && pip install transformers==4.51.3
	wandb login # Optional
	```

	### 2. Prepare Dataset

	### Hugging Face Dataset

	```python
	from datasets import Dataset, Audio

	data = {
	"text": [
	"Speaker 0: مرحبا بك.",
	"Speaker 0: كيف يمكنني مساعدتك؟"
	],
	"audio": [
	"audio1.wav",
	"audio2.wav"
	]
	}

	dataset = Dataset.from_dict(data)
	dataset = dataset.cast_column("audio", Audio(sampling_rate=24000))
	dataset.push_to_hub("your-username/arabic-tts-dataset")
	```

	Then train with:
	```bash
	python -m vibevoice.finetune.train_vibevoice \
	--model_name_or_path vibevoice/VibeVoice-1.5B \ #or aoi-ot/VibeVoice-Large
	--dataset_name your-username/arabic-tts-dataset \
	--text_column_name text \
	--audio_column_name audio \
	--voice_prompts_column_name audio \
	--output_dir finetune_vibevoice_zac \
	--per_device_train_batch_size 8 \
	--gradient_accumulation_steps 16 \
	--learning_rate 2.5e-5 \
	--num_train_epochs 1 \
	--logging_steps 10 \
	--save_steps 100 \
	--eval_steps 100 \
	--report_to wandb \
	--remove_unused_columns False \
	--bf16 True \
	--do_train \
	--gradient_clipping \
	--gradient_checkpointing False \
	--ddpm_batch_mul 4 \
	--diffusion_loss_weight 1.4 \
	--train_diffusion_head True \
	--ce_loss_weight 0.04 \
	--voice_prompt_drop_rate 0.2 \
	--lora_target_modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \
	--lr_scheduler_type cosine \
	--warmup_ratio 0.03 \
	--max_grad_norm 0.8```

	```
	### example how dataset could be

	https://huggingface.co/datasets/vibevoice/jenny_vibevoice_formatted

	### Second method Create a `prompts.jsonl` file:

	```json
	{"text": "Speaker 0: مرحبا، هذا اختبار.", "audio": "audio1.wav"}
	{"text": "Speaker 0: هذا مثال آخر.", "audio": "audio2.wav"}
	```

	Or use a Hugging Face dataset with columns:
	- `text`: Transcription with speaker labels
	- `audio`: 24kHz audio files
	- `voice_prompts`: (Optional) Reference voice clips

	### Train

	```bash
	python -m src.finetune_vibevoice_lora \
	--model_name_or_path aoi-ot/VibeVoice-Large \
	--processor_name_or_path src/vibevoice/processor \
	--train_jsonl prompts.jsonl \
	--text_column_name text \
	--audio_column_name audio \
	--output_dir output_arabic_lora \
	--per_device_train_batch_size 8 \
	--gradient_accumulation_steps 16 \
	--learning_rate 2.5e-5 \
	--num_train_epochs 5 \
	--logging_steps 10 \
	--save_steps 100 \
	--report_to wandb \
	--remove_unused_columns False \
	--bf16 True \
	--do_train \
	--gradient_clipping \
	--gradient_checkpointing False \
	--ddpm_batch_mul 4 \
	--diffusion_loss_weight 1.4 \
	--train_diffusion_head True \
	--ce_loss_weight 0.04 \
	--voice_prompt_drop_rate 0.2 \
	--lora_target_modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \
	--lr_scheduler_type cosine \
	--warmup_ratio 0.03 \
	--max_grad_norm 0.8
	```

	### 4. Use Your Trained LoRA

	```bash
	python demo/gradio_demo.py \
	--model_path aoi-ot/VibeVoice-Large \
	--checkpoint_path output_arabic_lora/lora/checkpoint-500 \
	--share
	```

	## Dataset Format

	### JSONL Format

	Single Speaker (auto-generated voice prompt):
	```json
	{"text": "Speaker 0: النص العربي هنا.", "audio": "/path/to/audio.wav"}
	```

	Single Speaker (custom voice prompt):
	```json
	{"text": "Speaker 0: النص العربي هنا.", "audio": "/path/to/audio.wav", "voice_prompts": "/path/to/reference.wav"}
	```

	Multi-Speaker:
	```json
	{"text": "Speaker 0: كيف حالك؟\nSpeaker 1: أنا بخير، شكراً.", "audio": "/path/to/conversation.wav", "voice_prompts": ["/path/to/speaker0_ref.wav", "/path/to/speaker1_ref.wav"]}
	```


	## Training Parameters

	\| Parameter \| Description \| Recommended \|
	\|-----------\|-------------\|-------------\|
	\| `--model_name_or_path` \| Base model \| `aoi-ot/VibeVoice-Large` \|
	\| `--per_device_train_batch_size` \| Batch size per GPU \| `8` \|
	\| `--gradient_accumulation_steps` \| Gradient accumulation \| `16` \|
	\| `--learning_rate` \| Learning rate \| `2.5e-5` \|
	\| `--num_train_epochs` \| Training epochs \| `5-10` \|
	\| `--diffusion_loss_weight` \| Diffusion loss weight \| `1.4` \|
	\| `--ce_loss_weight` \| Cross-entropy loss \| `0.04` \|
	\| `--voice_prompt_drop_rate` \| Voice prompt dropout \| `0.2` \|
	\| `--lora_r` \| LoRA rank \| `8` \|
	\| `--lora_alpha` \| LoRA alpha \| `32` \|

	## Memory Optimization

	### For Limited VRAM (32-40GB)

	```bash
	--per_device_train_batch_size 4 \
	--gradient_accumulation_steps 32 \
	--gradient_checkpointing True
	```

	### Use LoRA on Diffusion Head

	```bash
	# Replace --train_diffusion_head True with:
	--lora_wrap_diffusion_head True
	```


	## Citation

	```bibtex
	@misc{vibevoice-arabic-lora,
	author = {ABDALLALSWAITI},
	title = {VibeVoice Arabic LoRA},
	year = {2025},
	publisher = {HuggingFace},
	howpublished = {\url{https://huggingface.co/ABDALLALSWAITI/vibevoice-arabic-Z}}
	}
	```

	## Acknowledgements

	- Thanks to Juan Pablo Gallego from VoicePowered AI for the unofficial training code
	- Original VibeVoice by Microsoft Research
	- Community maintained by the VibeVoice community

	## License

	This model is released under the MIT License. See the [LICENSE](LICENSE) file for details.

	---


	### 💖 Support This Project
	If you enjoy using this extension and would like to support continued development, please consider [buying me a coffee](https://paypal.me/abdallalswaiti). Every contribution helps keep this project going and enables new features!