File size: 7,360 Bytes
057ced6
1eef85f
2eb3a6b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ca832ef
2eb3a6b
 
 
 
 
ca832ef
3eeae5d
ca832ef
2eb3a6b
ca832ef
 
 
 
2eb3a6b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59fd0cc
2eb3a6b
 
 
 
 
 
 
 
59fd0cc
2eb3a6b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59fd0cc
2eb3a6b
 
03ca6a1
2eb3a6b
 
 
 
 
 
 
 
 
 
 
 
 
03ca6a1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
295af5c
2eb3a6b
 
 
 
 
 
 
 
 
 
 
03ca6a1
2eb3a6b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
057ced6
2eb3a6b
 
 
 
057ced6
2eb3a6b
 
 
 
d71f104
2eb3a6b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8742ce8
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292


---
license: mit
language:
- ar
- en
base_model: aoi-ot/VibeVoice-Large
tags:
- text-to-speech
- tts
- audio
- vibevoice
- lora
- arabic
pipeline_tag: text-to-speech
---

# VibeVoice Arabic LoRA

This is a LoRA (Low-Rank Adaptation) fine-tuned model for Arabic text-to-speech, based on `aoi-ot/VibeVoice-Large`.

## Model Description

- **Base Model**: [aoi-ot/VibeVoice-Large](https://huggingface.co/aoi-ot/VibeVoice-Large)
- **Training Method**: LoRA fine-tuning
- **Language**: Arabic 
- **License**: MIT

## Requirements

### Hardware
- **Inference**:
  - VibeVoice-1.5B: 6GB+ VRAM
  - VibeVoice-Large (7B): 16GB+ VRAM
- **Training**: 48GB+ VRAM for VibeVoice-Large
  - VibeVoice-1.5B LoRA: 16GB+ VRAM minimum
  - VibeVoice-Large (7B) LoRA: 48GB+ VRAM minimum



### Software
```bash
git clone https://github.com/vibevoice-community/VibeVoice.git
cd VibeVoice/
pip install -e .
```

## Usage

### Quick Start with Gradio

```bash
python demo/gradio_demo.py \
  --model_path aoi-ot/VibeVoice-Large \
  --checkpoint_path ABDALLALSWAITI/vibevoice-arabic-Z \
  #--share
```

### Command Line Inference

```bash
python demo/inference_from_file.py \
  --model_path aoi-ot/VibeVoice-Large \
  --txt_path your_arabic_text.txt \
  --speaker_names Frank \
  --checkpoint_path ABDALLALSWAITI/vibevoice-arabic-Z
```

### Python API

```python
from vibevoice import VibeVoiceModel

# Load model with Arabic LoRA
model = VibeVoiceModel.from_pretrained(
    "aoi-ot/VibeVoice-Large",
    lora_path="ABDALLALSWAITI/vibevoice-arabic-Z"
)

# Generate speech
text = "Speaker 0: مرحبا، كيف حالك؟"
audio = model.generate(text, speaker_names=["Frank"])
```

## Training Your Own  LoRA

### 1. Installation

```bash
git clone https://github.com/voicepowered-ai/VibeVoice-finetuning
cd VibeVoice-finetuning
pip install -e .
pip uninstall -y transformers && pip install transformers==4.51.3
wandb login  # Optional
```

### 2. Prepare Dataset

### Hugging Face Dataset

```python
from datasets import Dataset, Audio

data = {
    "text": [
        "Speaker 0: مرحبا بك.",
        "Speaker 0: كيف يمكنني مساعدتك؟"
    ],
    "audio": [
        "audio1.wav",
        "audio2.wav"
    ]
}

dataset = Dataset.from_dict(data)
dataset = dataset.cast_column("audio", Audio(sampling_rate=24000))
dataset.push_to_hub("your-username/arabic-tts-dataset")
```

Then train with:
```bash
python -m vibevoice.finetune.train_vibevoice \
    --model_name_or_path vibevoice/VibeVoice-1.5B \ #or aoi-ot/VibeVoice-Large 
    --dataset_name your-username/arabic-tts-dataset \
    --text_column_name text \
    --audio_column_name audio \
    --voice_prompts_column_name audio \
    --output_dir finetune_vibevoice_zac \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 16 \
    --learning_rate 2.5e-5 \
    --num_train_epochs 1 \
    --logging_steps 10 \
    --save_steps 100 \
    --eval_steps 100 \
    --report_to wandb \
    --remove_unused_columns False \
    --bf16 True \
    --do_train \
    --gradient_clipping \
    --gradient_checkpointing False \
    --ddpm_batch_mul 4 \
    --diffusion_loss_weight 1.4 \
    --train_diffusion_head True \
    --ce_loss_weight 0.04 \
    --voice_prompt_drop_rate 0.2 \
    --lora_target_modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.03 \
    --max_grad_norm 0.8```

```
### example how dataset could be

https://huggingface.co/datasets/vibevoice/jenny_vibevoice_formatted

### Second method Create a `prompts.jsonl` file:

```json
{"text": "Speaker 0: مرحبا، هذا اختبار.", "audio": "audio1.wav"}
{"text": "Speaker 0: هذا مثال آخر.", "audio": "audio2.wav"}
```

Or use a Hugging Face dataset with columns:
- `text`: Transcription with speaker labels
- `audio`: 24kHz audio files
- `voice_prompts`: (Optional) Reference voice clips

###  Train

```bash
python -m src.finetune_vibevoice_lora \
  --model_name_or_path aoi-ot/VibeVoice-Large \
  --processor_name_or_path src/vibevoice/processor \
  --train_jsonl prompts.jsonl \
  --text_column_name text \
  --audio_column_name audio \
  --output_dir output_arabic_lora \
  --per_device_train_batch_size 8 \
  --gradient_accumulation_steps 16 \
  --learning_rate 2.5e-5 \
  --num_train_epochs 5 \
  --logging_steps 10 \
  --save_steps 100 \
  --report_to wandb \
  --remove_unused_columns False \
  --bf16 True \
  --do_train \
  --gradient_clipping \
  --gradient_checkpointing False \
  --ddpm_batch_mul 4 \
  --diffusion_loss_weight 1.4 \
  --train_diffusion_head True \
  --ce_loss_weight 0.04 \
  --voice_prompt_drop_rate 0.2 \
  --lora_target_modules q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj \
  --lr_scheduler_type cosine \
  --warmup_ratio 0.03 \
  --max_grad_norm 0.8
```

### 4. Use Your Trained LoRA

```bash
python demo/gradio_demo.py \
  --model_path aoi-ot/VibeVoice-Large \
  --checkpoint_path output_arabic_lora/lora/checkpoint-500 \
  --share
```

## Dataset Format

### JSONL Format

**Single Speaker (auto-generated voice prompt):**
```json
{"text": "Speaker 0: النص العربي هنا.", "audio": "/path/to/audio.wav"}
```

**Single Speaker (custom voice prompt):**
```json
{"text": "Speaker 0: النص العربي هنا.", "audio": "/path/to/audio.wav", "voice_prompts": "/path/to/reference.wav"}
```

**Multi-Speaker:**
```json
{"text": "Speaker 0: كيف حالك؟\nSpeaker 1: أنا بخير، شكراً.", "audio": "/path/to/conversation.wav", "voice_prompts": ["/path/to/speaker0_ref.wav", "/path/to/speaker1_ref.wav"]}
```


## Training Parameters

| Parameter | Description | Recommended |
|-----------|-------------|-------------|
| `--model_name_or_path` | Base model | `aoi-ot/VibeVoice-Large` |
| `--per_device_train_batch_size` | Batch size per GPU | `8` |
| `--gradient_accumulation_steps` | Gradient accumulation | `16` |
| `--learning_rate` | Learning rate | `2.5e-5` |
| `--num_train_epochs` | Training epochs | `5-10` |
| `--diffusion_loss_weight` | Diffusion loss weight | `1.4` |
| `--ce_loss_weight` | Cross-entropy loss | `0.04` |
| `--voice_prompt_drop_rate` | Voice prompt dropout | `0.2` |
| `--lora_r` | LoRA rank | `8` |
| `--lora_alpha` | LoRA alpha | `32` |

## Memory Optimization

### For Limited VRAM (32-40GB)

```bash
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 32 \
--gradient_checkpointing True
```

### Use LoRA on Diffusion Head

```bash
# Replace --train_diffusion_head True with:
--lora_wrap_diffusion_head True
```


## Citation

```bibtex
@misc{vibevoice-arabic-lora,
  author = {ABDALLALSWAITI},
  title = {VibeVoice Arabic LoRA},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/ABDALLALSWAITI/vibevoice-arabic-Z}}
}
```

## Acknowledgements

- Thanks to **Juan Pablo Gallego** from VoicePowered AI for the unofficial training code
- Original VibeVoice by Microsoft Research
- Community maintained by the VibeVoice community

## License

This model is released under the MIT License. See the [LICENSE](LICENSE) file for details.

---


### 💖 Support This Project
If you enjoy using this extension and would like to support continued development, please consider [buying me a coffee](https://paypal.me/abdallalswaiti). Every contribution helps keep this project going and enables new features!