File size: 18,265 Bytes
feca559
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "gpuType": "T4"
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# [VoiceCore](https://huggingface.co/webbigdata/VoiceCore) Demo.\n",
        "\n",
        "webbigdata/VoiceCoreをColab上で無料で動かすサンプルスクリプトです  \n",
        "This is a sample script that runs webbigdata/VoiceCore for free on Colab.  \n",
        "\n",
        "Enter your Japanese text and we'll create voice wave file.  \n",
        "日本語のテキストを入力すると、その文章を音声にしたWAF fileを作成します  \n",
        "\n",
        "\n",
        "## How to run/動かし方\n",
        "\n",
        "If you are on a github page, click the Open in Colab button at the top of the screen to launch Colab.\n",
        "\n",
        "あなたが見ているのがgithubのページである場合、画面上部に表示されているOpen in Colabボタンを押してColabを起動してください\n",
        "\n",
        "![github.png]()\n",
        "\n",
        "Next, run each cell one by one (i.e. click the \"\" in order as shown in the image below).  \n",
        "次に、セルを1つずつ実行(つまり、以下の画像のような「▷」を順番にクリック)してください  \n",
        "\n",
        "![cell.png]()\n"
      ],
      "metadata": {
        "id": "k-Rs1yFEdLdo"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 1. Install Required Libraries"
      ],
      "metadata": {
        "id": "UbdUkAusy1_N"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": true,
        "cellView": "form",
        "id": "lyrygqjF6-09"
      },
      "outputs": [],
      "source": [
        "%%capture\n",
        "%%shell\n",
        "#@title Install Required Libraries\n",
        "\n",
        "pip install snac transformers scipy"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 2. Setting Up\n",
        "\n",
        "2つのモデルをダウンロードするためやや時間がかかります  \n",
        "This will take some time as two models will be downloaded.  "
      ],
      "metadata": {
        "id": "3w85X9ciyzlz"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "%%capture\n",
        "#@title (1)Dependent Libraries and Utility Functions/依存ライブラリとユーティリティ関数\n",
        "# ======== セル1: 依存ライブラリとユーティリティ関数 ========\n",
        "\n",
        "import torch\n",
        "from transformers import AutoTokenizer, AutoModelForCausalLM\n",
        "\n",
        "model_name = \"webbigdata/VoiceCore\"\n",
        "\n",
        "# bfloat16が利用可能かチェックして適切なデータ型を選択\n",
        "if torch.cuda.is_available() and torch.cuda.is_bf16_supported():\n",
        "    dtype = torch.bfloat16\n",
        "else:\n",
        "    dtype = torch.float16\n",
        "\n",
        "model = AutoModelForCausalLM.from_pretrained(\n",
        "  model_name,\n",
        "  torch_dtype=dtype,\n",
        "  device_map=\"auto\",\n",
        "  use_cache=True,\n",
        ")\n",
        "tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
        "\n",
        "import locale\n",
        "import torchaudio.transforms as T\n",
        "import os\n",
        "import torch\n",
        "from snac import SNAC\n",
        "locale.getpreferredencoding = lambda: \"UTF-8\"\n",
        "\n",
        "snac_model = SNAC.from_pretrained(\"hubertsiuzdak/snac_24khz\")\n",
        "snac_model.to(\"cpu\")\n"
      ],
      "metadata": {
        "id": "al8F1n-Fmpq7",
        "cellView": "form"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 3. Run VoiceCore"
      ],
      "metadata": {
        "id": "Fh8DAKfM3xE0"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "  各声の用途制限、連絡・クレジット表記義務については[webbigdata/VoiceCore](https://huggingface.co/webbigdata/VoiceCore)を参照してください。現Versionでは女性の声はプレビュー版の位置づけです。高音域でノイズが乗ってしまう傾向があります。  \n",
        "  Please refer to [webbigdata/VoiceCore](https://huggingface.co/webbigdata/VoiceCore) for usage restrictions and contact/credit obligations for each voice. In the current version, the female voice is a preview version. There is a tendency for noise to be added in the high range."
      ],
      "metadata": {
        "id": "g-CC4lcWMW5w"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "#@title (1)声の選択とテキストの入力/Voice select and text input\n",
        "# 音声選択\n",
        "voice_type = 'matsukaze_male (さわやかな男性) (c)松風' #@param [\"amitaro_female (明るい女の子 (c)あみたろの声素材工房)\", \"matsukaze_male (さわやかな男性) (c)松風\", \"naraku_female (落ち着いた女性) (c)極楽唯\", \"shiguu_male (大人びた少年) (c)刻鳴時雨(CV:丸ころ)\", \"sayoko_female (一般81歳女性) (c)Fusic サヨ子音声コーパス\", \"dahara1_male (一般男性)\"]\n",
        "\n",
        "# 発声テキスト入力\n",
        "speech_text = \"こんにちは、今日もよろしくお願いします。\" #@param {type:\"string\"}"
      ],
      "metadata": {
        "cellView": "form",
        "id": "LfYTVtZr2trR"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "#@title (2)声の生成 / Generate voice\n",
        "# voice_typeから実際の音声名を抽出\n",
        "chosen_voice = voice_type.split(' (')[0] + \"[neutral]\"\n",
        "prompts = [speech_text]\n",
        "\n",
        "print(f\"選択された音声: {chosen_voice}\")\n",
        "print(f\"テキスト: {speech_text}\")\n",
        "\n",
        "# 音声生成処理\n",
        "prompts_ = [(f\"{chosen_voice}: \" + p) if chosen_voice else p for p in prompts]\n",
        "all_input_ids = []\n",
        "for prompt in prompts_:\n",
        "  input_ids = tokenizer(prompt, return_tensors=\"pt\").input_ids\n",
        "  all_input_ids.append(input_ids)\n",
        "\n",
        "start_token = torch.tensor([[ 128259]], dtype=torch.int64) # Start of human\n",
        "end_tokens = torch.tensor([[128009, 128260, 128261]], dtype=torch.int64) # End of text, End of human\n",
        "\n",
        "all_modified_input_ids = []\n",
        "for input_ids in all_input_ids:\n",
        "  modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1) # SOH SOT Text EOT EOH\n",
        "  all_modified_input_ids.append(modified_input_ids)\n",
        "\n",
        "all_padded_tensors = []\n",
        "all_attention_masks = []\n",
        "max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids])\n",
        "\n",
        "for modified_input_ids in all_modified_input_ids:\n",
        "  padding = max_length - modified_input_ids.shape[1]\n",
        "  padded_tensor = torch.cat([torch.full((1, padding), 128263, dtype=torch.int64), modified_input_ids], dim=1)\n",
        "  attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)\n",
        "  all_padded_tensors.append(padded_tensor)\n",
        "  all_attention_masks.append(attention_mask)\n",
        "\n",
        "all_padded_tensors = torch.cat(all_padded_tensors, dim=0)\n",
        "all_attention_masks = torch.cat(all_attention_masks, dim=0)\n",
        "\n",
        "input_ids = all_padded_tensors.to(\"cuda\")\n",
        "attention_mask = all_attention_masks.to(\"cuda\")\n",
        "\n",
        "generated_ids = model.generate(\n",
        "      input_ids=input_ids,\n",
        "      attention_mask=attention_mask,\n",
        "      max_new_tokens=8196,\n",
        "      do_sample=True,\n",
        "      temperature=0.6,\n",
        "      top_p=0.90,\n",
        "      repetition_penalty=1.1,\n",
        "      eos_token_id=128258,\n",
        "      use_cache=True\n",
        "  )\n",
        "\n",
        "token_to_find = 128257\n",
        "token_to_remove = 128258\n",
        "#print(generated_ids)\n",
        "\n",
        "token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)\n",
        "if len(token_indices[1]) > 0:\n",
        "    last_occurrence_idx = token_indices[1][-1].item()\n",
        "    cropped_tensor = generated_ids[:, last_occurrence_idx+1:]\n",
        "else:\n",
        "    cropped_tensor = generated_ids\n",
        "\n",
        "mask = cropped_tensor != token_to_remove\n",
        "processed_rows = []\n",
        "for row in cropped_tensor:\n",
        "    masked_row = row[row != token_to_remove]\n",
        "    processed_rows.append(masked_row)\n",
        "\n",
        "code_lists = []\n",
        "for row in processed_rows:\n",
        "    row_length = row.size(0)\n",
        "    new_length = (row_length // 7) * 7\n",
        "    trimmed_row = row[:new_length]\n",
        "    trimmed_row = [t - 128266 for t in trimmed_row]\n",
        "    code_lists.append(trimmed_row)\n",
        "\n",
        "def redistribute_codes(code_list):\n",
        "  layer_1 = []\n",
        "  layer_2 = []\n",
        "  layer_3 = []\n",
        "  for i in range((len(code_list)+6)//7):\n",
        "    layer_1.append(code_list[7*i])\n",
        "    layer_2.append(code_list[7*i+1]-4096)\n",
        "    layer_3.append(code_list[7*i+2]-(2*4096))\n",
        "    layer_3.append(code_list[7*i+3]-(3*4096))\n",
        "    layer_2.append(code_list[7*i+4]-(4*4096))\n",
        "    layer_3.append(code_list[7*i+5]-(5*4096))\n",
        "    layer_3.append(code_list[7*i+6]-(6*4096))\n",
        "  codes = [torch.tensor(layer_1).unsqueeze(0),\n",
        "         torch.tensor(layer_2).unsqueeze(0),\n",
        "         torch.tensor(layer_3).unsqueeze(0)]\n",
        "  audio_hat = snac_model.decode(codes)\n",
        "  return audio_hat\n",
        "\n",
        "my_samples = []\n",
        "for code_list in code_lists:\n",
        "  samples = redistribute_codes(code_list)\n",
        "  my_samples.append(samples)\n",
        "\n",
        "# 音声ファイル保存と再生\n",
        "import scipy.io.wavfile as wavfile\n",
        "from IPython.display import Audio, display\n",
        "import numpy as np\n",
        "\n",
        "if len(prompts) != len(my_samples):\n",
        "  raise Exception(\"Number of prompts and samples do not match\")\n",
        "else:\n",
        "  for i in range(len(my_samples)):\n",
        "    print(f\"プロンプト: {prompts[i]}\")\n",
        "    samples = my_samples[i]\n",
        "    sample_np = samples.detach().squeeze().to(\"cpu\").numpy()\n",
        "\n",
        "    # ファイル名を設定\n",
        "    filename = f\"audio_{i}_{prompts[i][:20].replace(' ', '_').replace('/', '_')}.wav\"\n",
        "\n",
        "    # WAVファイルとして保存(サンプリングレート: 24000Hz)\n",
        "    wavfile.write(filename, 24000, sample_np)\n",
        "\n",
        "    # Colab上で再生\n",
        "    print(f\"生成された音声ファイル: {filename}\")\n",
        "    display(Audio(sample_np, rate=24000))"
      ],
      "metadata": {
        "cellView": "form",
        "id": "NocLpdwcYyJa"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 謝辞 / Acknowledgment\n",
        "全ての合成音声の研究者/愛好家/声データ提供者の皆様。彼らの研究成果/データ/熱意がなけなければ、このモデルは完成できなかったでしょう。直接使用しなかったデータ/知識などにも大いに影響/励ましを受けました。  \n",
        "To all researchers and enthusiasts of synthetic speech, Voice data provider. Without their research results, data, and enthusiasm, this model would not have been completed. I was also greatly influenced and encouraged by data and knowledge that I did not directly use.  \n",
        "\n",
        "- [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)\n",
        "- [canopylabs/orpheus-tts](https://huggingface.co/collections/canopylabs/orpheus-tts-67d9ea3f6c05a941c06ad9d2)\n",
        "- [hubertsiuzdak/snac_24khz](https://huggingface.co/hubertsiuzdak/snac_24khz)\n",
        "- [Unsloth](https://unsloth.ai/) for Traing script.\n",
        "- [Huggingface](https://huggingface.co/) for storage."
      ],
      "metadata": {
        "id": "G19mXDdBLeon"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Developer/開発\n",
        "\n",
        "- **Developed by:** dahara1@webbigdata\n",
        "- **Model type:** text audio generation\n",
        "- **Language(s) (NLP):** Japanese\n",
        "- **model :** [webbigdata/VoiceCore](https://huggingface.co/webbigdata/VoiceCore)"
      ],
      "metadata": {
        "id": "0kZ8Jo4s6S01"
      }
    }
  ]
}