File size: 18,265 Bytes
feca559 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 |
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"gpuType": "T4"
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
},
"accelerator": "GPU"
},
"cells": [
{
"cell_type": "markdown",
"source": [
"# [VoiceCore](https://huggingface.co/webbigdata/VoiceCore) Demo.\n",
"\n",
"webbigdata/VoiceCoreをColab上で無料で動かすサンプルスクリプトです \n",
"This is a sample script that runs webbigdata/VoiceCore for free on Colab. \n",
"\n",
"Enter your Japanese text and we'll create voice wave file. \n",
"日本語のテキストを入力すると、その文章を音声にしたWAF fileを作成します \n",
"\n",
"\n",
"## How to run/動かし方\n",
"\n",
"If you are on a github page, click the Open in Colab button at the top of the screen to launch Colab.\n",
"\n",
"あなたが見ているのがgithubのページである場合、画面上部に表示されているOpen in Colabボタンを押してColabを起動してください\n",
"\n",
"\n",
"\n",
"Next, run each cell one by one (i.e. click the \"▷\" in order as shown in the image below). \n",
"次に、セルを1つずつ実行(つまり、以下の画像のような「▷」を順番にクリック)してください \n",
"\n",
"\n"
],
"metadata": {
"id": "k-Rs1yFEdLdo"
}
},
{
"cell_type": "markdown",
"source": [
"## 1. Install Required Libraries"
],
"metadata": {
"id": "UbdUkAusy1_N"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"cellView": "form",
"id": "lyrygqjF6-09"
},
"outputs": [],
"source": [
"%%capture\n",
"%%shell\n",
"#@title Install Required Libraries\n",
"\n",
"pip install snac transformers scipy"
]
},
{
"cell_type": "markdown",
"source": [
"## 2. Setting Up\n",
"\n",
"2つのモデルをダウンロードするためやや時間がかかります \n",
"This will take some time as two models will be downloaded. "
],
"metadata": {
"id": "3w85X9ciyzlz"
}
},
{
"cell_type": "code",
"source": [
"%%capture\n",
"#@title (1)Dependent Libraries and Utility Functions/依存ライブラリとユーティリティ関数\n",
"# ======== セル1: 依存ライブラリとユーティリティ関数 ========\n",
"\n",
"import torch\n",
"from transformers import AutoTokenizer, AutoModelForCausalLM\n",
"\n",
"model_name = \"webbigdata/VoiceCore\"\n",
"\n",
"# bfloat16が利用可能かチェックして適切なデータ型を選択\n",
"if torch.cuda.is_available() and torch.cuda.is_bf16_supported():\n",
" dtype = torch.bfloat16\n",
"else:\n",
" dtype = torch.float16\n",
"\n",
"model = AutoModelForCausalLM.from_pretrained(\n",
" model_name,\n",
" torch_dtype=dtype,\n",
" device_map=\"auto\",\n",
" use_cache=True,\n",
")\n",
"tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
"\n",
"import locale\n",
"import torchaudio.transforms as T\n",
"import os\n",
"import torch\n",
"from snac import SNAC\n",
"locale.getpreferredencoding = lambda: \"UTF-8\"\n",
"\n",
"snac_model = SNAC.from_pretrained(\"hubertsiuzdak/snac_24khz\")\n",
"snac_model.to(\"cpu\")\n"
],
"metadata": {
"id": "al8F1n-Fmpq7",
"cellView": "form"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## 3. Run VoiceCore"
],
"metadata": {
"id": "Fh8DAKfM3xE0"
}
},
{
"cell_type": "markdown",
"source": [
" 各声の用途制限、連絡・クレジット表記義務については[webbigdata/VoiceCore](https://huggingface.co/webbigdata/VoiceCore)を参照してください。現Versionでは女性の声はプレビュー版の位置づけです。高音域でノイズが乗ってしまう傾向があります。 \n",
" Please refer to [webbigdata/VoiceCore](https://huggingface.co/webbigdata/VoiceCore) for usage restrictions and contact/credit obligations for each voice. In the current version, the female voice is a preview version. There is a tendency for noise to be added in the high range."
],
"metadata": {
"id": "g-CC4lcWMW5w"
}
},
{
"cell_type": "code",
"source": [
"#@title (1)声の選択とテキストの入力/Voice select and text input\n",
"# 音声選択\n",
"voice_type = 'matsukaze_male (さわやかな男性) (c)松風' #@param [\"amitaro_female (明るい女の子 (c)あみたろの声素材工房)\", \"matsukaze_male (さわやかな男性) (c)松風\", \"naraku_female (落ち着いた女性) (c)極楽唯\", \"shiguu_male (大人びた少年) (c)刻鳴時雨(CV:丸ころ)\", \"sayoko_female (一般81歳女性) (c)Fusic サヨ子音声コーパス\", \"dahara1_male (一般男性)\"]\n",
"\n",
"# 発声テキスト入力\n",
"speech_text = \"こんにちは、今日もよろしくお願いします。\" #@param {type:\"string\"}"
],
"metadata": {
"cellView": "form",
"id": "LfYTVtZr2trR"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"#@title (2)声の生成 / Generate voice\n",
"# voice_typeから実際の音声名を抽出\n",
"chosen_voice = voice_type.split(' (')[0] + \"[neutral]\"\n",
"prompts = [speech_text]\n",
"\n",
"print(f\"選択された音声: {chosen_voice}\")\n",
"print(f\"テキスト: {speech_text}\")\n",
"\n",
"# 音声生成処理\n",
"prompts_ = [(f\"{chosen_voice}: \" + p) if chosen_voice else p for p in prompts]\n",
"all_input_ids = []\n",
"for prompt in prompts_:\n",
" input_ids = tokenizer(prompt, return_tensors=\"pt\").input_ids\n",
" all_input_ids.append(input_ids)\n",
"\n",
"start_token = torch.tensor([[ 128259]], dtype=torch.int64) # Start of human\n",
"end_tokens = torch.tensor([[128009, 128260, 128261]], dtype=torch.int64) # End of text, End of human\n",
"\n",
"all_modified_input_ids = []\n",
"for input_ids in all_input_ids:\n",
" modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1) # SOH SOT Text EOT EOH\n",
" all_modified_input_ids.append(modified_input_ids)\n",
"\n",
"all_padded_tensors = []\n",
"all_attention_masks = []\n",
"max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids])\n",
"\n",
"for modified_input_ids in all_modified_input_ids:\n",
" padding = max_length - modified_input_ids.shape[1]\n",
" padded_tensor = torch.cat([torch.full((1, padding), 128263, dtype=torch.int64), modified_input_ids], dim=1)\n",
" attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)\n",
" all_padded_tensors.append(padded_tensor)\n",
" all_attention_masks.append(attention_mask)\n",
"\n",
"all_padded_tensors = torch.cat(all_padded_tensors, dim=0)\n",
"all_attention_masks = torch.cat(all_attention_masks, dim=0)\n",
"\n",
"input_ids = all_padded_tensors.to(\"cuda\")\n",
"attention_mask = all_attention_masks.to(\"cuda\")\n",
"\n",
"generated_ids = model.generate(\n",
" input_ids=input_ids,\n",
" attention_mask=attention_mask,\n",
" max_new_tokens=8196,\n",
" do_sample=True,\n",
" temperature=0.6,\n",
" top_p=0.90,\n",
" repetition_penalty=1.1,\n",
" eos_token_id=128258,\n",
" use_cache=True\n",
" )\n",
"\n",
"token_to_find = 128257\n",
"token_to_remove = 128258\n",
"#print(generated_ids)\n",
"\n",
"token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)\n",
"if len(token_indices[1]) > 0:\n",
" last_occurrence_idx = token_indices[1][-1].item()\n",
" cropped_tensor = generated_ids[:, last_occurrence_idx+1:]\n",
"else:\n",
" cropped_tensor = generated_ids\n",
"\n",
"mask = cropped_tensor != token_to_remove\n",
"processed_rows = []\n",
"for row in cropped_tensor:\n",
" masked_row = row[row != token_to_remove]\n",
" processed_rows.append(masked_row)\n",
"\n",
"code_lists = []\n",
"for row in processed_rows:\n",
" row_length = row.size(0)\n",
" new_length = (row_length // 7) * 7\n",
" trimmed_row = row[:new_length]\n",
" trimmed_row = [t - 128266 for t in trimmed_row]\n",
" code_lists.append(trimmed_row)\n",
"\n",
"def redistribute_codes(code_list):\n",
" layer_1 = []\n",
" layer_2 = []\n",
" layer_3 = []\n",
" for i in range((len(code_list)+6)//7):\n",
" layer_1.append(code_list[7*i])\n",
" layer_2.append(code_list[7*i+1]-4096)\n",
" layer_3.append(code_list[7*i+2]-(2*4096))\n",
" layer_3.append(code_list[7*i+3]-(3*4096))\n",
" layer_2.append(code_list[7*i+4]-(4*4096))\n",
" layer_3.append(code_list[7*i+5]-(5*4096))\n",
" layer_3.append(code_list[7*i+6]-(6*4096))\n",
" codes = [torch.tensor(layer_1).unsqueeze(0),\n",
" torch.tensor(layer_2).unsqueeze(0),\n",
" torch.tensor(layer_3).unsqueeze(0)]\n",
" audio_hat = snac_model.decode(codes)\n",
" return audio_hat\n",
"\n",
"my_samples = []\n",
"for code_list in code_lists:\n",
" samples = redistribute_codes(code_list)\n",
" my_samples.append(samples)\n",
"\n",
"# 音声ファイル保存と再生\n",
"import scipy.io.wavfile as wavfile\n",
"from IPython.display import Audio, display\n",
"import numpy as np\n",
"\n",
"if len(prompts) != len(my_samples):\n",
" raise Exception(\"Number of prompts and samples do not match\")\n",
"else:\n",
" for i in range(len(my_samples)):\n",
" print(f\"プロンプト: {prompts[i]}\")\n",
" samples = my_samples[i]\n",
" sample_np = samples.detach().squeeze().to(\"cpu\").numpy()\n",
"\n",
" # ファイル名を設定\n",
" filename = f\"audio_{i}_{prompts[i][:20].replace(' ', '_').replace('/', '_')}.wav\"\n",
"\n",
" # WAVファイルとして保存(サンプリングレート: 24000Hz)\n",
" wavfile.write(filename, 24000, sample_np)\n",
"\n",
" # Colab上で再生\n",
" print(f\"生成された音声ファイル: {filename}\")\n",
" display(Audio(sample_np, rate=24000))"
],
"metadata": {
"cellView": "form",
"id": "NocLpdwcYyJa"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## 謝辞 / Acknowledgment\n",
"全ての合成音声の研究者/愛好家/声データ提供者の皆様。彼らの研究成果/データ/熱意がなけなければ、このモデルは完成できなかったでしょう。直接使用しなかったデータ/知識などにも大いに影響/励ましを受けました。 \n",
"To all researchers and enthusiasts of synthetic speech, Voice data provider. Without their research results, data, and enthusiasm, this model would not have been completed. I was also greatly influenced and encouraged by data and knowledge that I did not directly use. \n",
"\n",
"- [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)\n",
"- [canopylabs/orpheus-tts](https://huggingface.co/collections/canopylabs/orpheus-tts-67d9ea3f6c05a941c06ad9d2)\n",
"- [hubertsiuzdak/snac_24khz](https://huggingface.co/hubertsiuzdak/snac_24khz)\n",
"- [Unsloth](https://unsloth.ai/) for Traing script.\n",
"- [Huggingface](https://huggingface.co/) for storage."
],
"metadata": {
"id": "G19mXDdBLeon"
}
},
{
"cell_type": "markdown",
"source": [
"## Developer/開発\n",
"\n",
"- **Developed by:** dahara1@webbigdata\n",
"- **Model type:** text audio generation\n",
"- **Language(s) (NLP):** Japanese\n",
"- **model :** [webbigdata/VoiceCore](https://huggingface.co/webbigdata/VoiceCore)"
],
"metadata": {
"id": "0kZ8Jo4s6S01"
}
}
]
} |