File size: 18,265 Bytes
feca559 |
|
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"gpuType": "T4"
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
},
"accelerator": "GPU"
},
"cells": [
{
"cell_type": "markdown",
"source": [
"# [VoiceCore](https://huggingface.co/webbigdata/VoiceCore) Demo.\n",
"\n",
"webbigdata/VoiceCoreをColab上で無料で動かすサンプルスクリプトです \n",
"This is a sample script that runs webbigdata/VoiceCore for free on Colab. \n",
"\n",
"Enter your Japanese text and we'll create voice wave file. \n",
"日本語のテキストを入力すると、その文章を音声にしたWAF fileを作成します \n",
"\n",
"\n",
"## How to run/動かし方\n",
"\n",
"If you are on a github page, click the Open in Colab button at the top of the screen to launch Colab.\n",
"\n",
"あなたが見ているのがgithubのページである場合、画面上部に表示されているOpen in Colabボタンを押してColabを起動してください\n",
"\n",
"\n",
"\n",
"Next, run each cell one by one (i.e. click the \"▷\" in order as shown in the image below). \n",
"次に、セルを1つずつ実行(つまり、以下の画像のような「▷」を順番にクリック)してください \n",
"\n",
"\n"
],
"metadata": {
"id": "k-Rs1yFEdLdo"
}
},
{
"cell_type": "markdown",
"source": [
"## 1. Install Required Libraries"
],
"metadata": {
"id": "UbdUkAusy1_N"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"cellView": "form",
"id": "lyrygqjF6-09"
},
"outputs": [],
"source": [
"%%capture\n",
"%%shell\n",
"#@title Install Required Libraries\n",
"\n",
"pip install snac transformers scipy"
]
},
{
"cell_type": "markdown",
"source": [
"## 2. Setting Up\n",
"\n",
"2つのモデルをダウンロードするためやや時間がかかります \n",
"This will take some time as two models will be downloaded. "
],
"metadata": {
"id": "3w85X9ciyzlz"
}
},
{
"cell_type": "code",
"source": [
"%%capture\n",
"#@title (1)Dependent Libraries and Utility Functions/依存ライブラリとユーティリティ関数\n",
"# ======== セル1: 依存ライブラリとユーティリティ関数 ========\n",
"\n",
"import torch\n",
"from transformers import AutoTokenizer, AutoModelForCausalLM\n",
"\n",
"model_name = \"webbigdata/VoiceCore\"\n",
"\n",
"# bfloat16が利用可能かチェックして適切なデータ型を選択\n",
"if torch.cuda.is_available() and torch.cuda.is_bf16_supported():\n",
" dtype = torch.bfloat16\n",
"else:\n",
" dtype = torch.float16\n",
"\n",
"model = AutoModelForCausalLM.from_pretrained(\n",
" model_name,\n",
" torch_dtype=dtype,\n",
" device_map=\"auto\",\n",
" use_cache=True,\n",
")\n",
"tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
"\n",
"import locale\n",
"import torchaudio.transforms as T\n",
"import os\n",
"import torch\n",
"from snac import SNAC\n",
"locale.getpreferredencoding = lambda: \"UTF-8\"\n",
"\n",
"snac_model = SNAC.from_pretrained(\"hubertsiuzdak/snac_24khz\")\n",
"snac_model.to(\"cpu\")\n"
],
"metadata": {
"id": "al8F1n-Fmpq7",
"cellView": "form"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## 3. Run VoiceCore"
],
"metadata": {
"id": "Fh8DAKfM3xE0"
}
},
{
"cell_type": "markdown",
"source": [
" 各声の用途制限、連絡・クレジット表記義務については[webbigdata/VoiceCore](https://huggingface.co/webbigdata/VoiceCore)を参照してください。現Versionでは女性の声はプレビュー版の位置づけです。高音域でノイズが乗ってしまう傾向があります。 \n",
" Please refer to [webbigdata/VoiceCore](https://huggingface.co/webbigdata/VoiceCore) for usage restrictions and contact/credit obligations for each voice. In the current version, the female voice is a preview version. There is a tendency for noise to be added in the high range."
],
"metadata": {
"id": "g-CC4lcWMW5w"
}
},
{
"cell_type": "code",
"source": [
"#@title (1)声の選択とテキストの入力/Voice select and text input\n",
"# 音声選択\n",
"voice_type = 'matsukaze_male (さわやかな男性) (c)松風' #@param [\"amitaro_female (明るい女の子 (c)あみたろの声素材工房)\", \"matsukaze_male (さわやかな男性) (c)松風\", \"naraku_female (落ち着いた女性) (c)極楽唯\", \"shiguu_male (大人びた少年) (c)刻鳴時雨(CV:丸ころ)\", \"sayoko_female (一般81歳女性) (c)Fusic サヨ子音声コーパス\", \"dahara1_male (一般男性)\"]\n",
"\n",
"# 発声テキスト入力\n",
"speech_text = \"こんにちは、今日もよろしくお願いします。\" #@param {type:\"string\"}"
],
"metadata": {
"cellView": "form",
"id": "LfYTVtZr2trR"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"#@title (2)声の生成 / Generate voice\n",
"# voice_typeから実際の音声名を抽出\n",
"chosen_voice = voice_type.split(' (')[0] + \"[neutral]\"\n",
"prompts = [speech_text]\n",
"\n",
"print(f\"選択された音声: {chosen_voice}\")\n",
"print(f\"テキスト: {speech_text}\")\n",
"\n",
"# 音声生成処理\n",
"prompts_ = [(f\"{chosen_voice}: \" + p) if chosen_voice else p for p in prompts]\n",
"all_input_ids = []\n",
"for prompt in prompts_:\n",
" input_ids = tokenizer(prompt, return_tensors=\"pt\").input_ids\n",
" all_input_ids.append(input_ids)\n",
"\n",
"start_token = torch.tensor([[ 128259]], dtype=torch.int64) # Start of human\n",
"end_tokens = torch.tensor([[128009, 128260, 128261]], dtype=torch.int64) # End of text, End of human\n",
"\n",
"all_modified_input_ids = []\n",
"for input_ids in all_input_ids:\n",
" modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1) # SOH SOT Text EOT EOH\n",
" all_modified_input_ids.append(modified_input_ids)\n",
"\n",
"all_padded_tensors = []\n",
"all_attention_masks = []\n",
"max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids])\n",
"\n",
"for modified_input_ids in all_modified_input_ids:\n",
" padding = max_length - modified_input_ids.shape[1]\n",
" padded_tensor = torch.cat([torch.full((1, padding), 128263, dtype=torch.int64), modified_input_ids], dim=1)\n",
" attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)\n",
" all_padded_tensors.append(padded_tensor)\n",
" all_attention_masks.append(attention_mask)\n",
"\n",
"all_padded_tensors = torch.cat(all_padded_tensors, dim=0)\n",
"all_attention_masks = torch.cat(all_attention_masks, dim=0)\n",
"\n",
"input_ids = all_padded_tensors.to(\"cuda\")\n",
"attention_mask = all_attention_masks.to(\"cuda\")\n",
"\n",
"generated_ids = model.generate(\n",
" input_ids=input_ids,\n",
" attention_mask=attention_mask,\n",
" max_new_tokens=8196,\n",
" do_sample=True,\n",
" temperature=0.6,\n",
" top_p=0.90,\n",
" repetition_penalty=1.1,\n",
" eos_token_id=128258,\n",
" use_cache=True\n",
" )\n",
"\n",
"token_to_find = 128257\n",
"token_to_remove = 128258\n",
"#print(generated_ids)\n",
"\n",
"token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)\n",
"if len(token_indices[1]) > 0:\n",
" last_occurrence_idx = token_indices[1][-1].item()\n",
" cropped_tensor = generated_ids[:, last_occurrence_idx+1:]\n",
"else:\n",
" cropped_tensor = generated_ids\n",
"\n",
"mask = cropped_tensor != token_to_remove\n",
"processed_rows = []\n",
"for row in cropped_tensor:\n",
" masked_row = row[row != token_to_remove]\n",
" processed_rows.append(masked_row)\n",
"\n",
"code_lists = []\n",
"for row in processed_rows:\n",
" row_length = row.size(0)\n",
" new_length = (row_length // 7) * 7\n",
" trimmed_row = row[:new_length]\n",
" trimmed_row = [t - 128266 for t in trimmed_row]\n",
" code_lists.append(trimmed_row)\n",
"\n",
"def redistribute_codes(code_list):\n",
" layer_1 = []\n",
" layer_2 = []\n",
" layer_3 = []\n",
" for i in range((len(code_list)+6)//7):\n",
" layer_1.append(code_list[7*i])\n",
" layer_2.append(code_list[7*i+1]-4096)\n",
" layer_3.append(code_list[7*i+2]-(2*4096))\n",
" layer_3.append(code_list[7*i+3]-(3*4096))\n",
" layer_2.append(code_list[7*i+4]-(4*4096))\n",
" layer_3.append(code_list[7*i+5]-(5*4096))\n",
" layer_3.append(code_list[7*i+6]-(6*4096))\n",
" codes = [torch.tensor(layer_1).unsqueeze(0),\n",
" torch.tensor(layer_2).unsqueeze(0),\n",
" torch.tensor(layer_3).unsqueeze(0)]\n",
" audio_hat = snac_model.decode(codes)\n",
" return audio_hat\n",
"\n",
"my_samples = []\n",
"for code_list in code_lists:\n",
" samples = redistribute_codes(code_list)\n",
" my_samples.append(samples)\n",
"\n",
"# 音声ファイル保存と再生\n",
"import scipy.io.wavfile as wavfile\n",
"from IPython.display import Audio, display\n",
"import numpy as np\n",
"\n",
"if len(prompts) != len(my_samples):\n",
" raise Exception(\"Number of prompts and samples do not match\")\n",
"else:\n",
" for i in range(len(my_samples)):\n",
" print(f\"プロンプト: {prompts[i]}\")\n",
" samples = my_samples[i]\n",
" sample_np = samples.detach().squeeze().to(\"cpu\").numpy()\n",
"\n",
" # ファイル名を設定\n",
" filename = f\"audio_{i}_{prompts[i][:20].replace(' ', '_').replace('/', '_')}.wav\"\n",
"\n",
" # WAVファイルとして保存(サンプリングレート: 24000Hz)\n",
" wavfile.write(filename, 24000, sample_np)\n",
"\n",
" # Colab上で再生\n",
" print(f\"生成された音声ファイル: {filename}\")\n",
" display(Audio(sample_np, rate=24000))"
],
"metadata": {
"cellView": "form",
"id": "NocLpdwcYyJa"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## 謝辞 / Acknowledgment\n",
"全ての合成音声の研究者/愛好家/声データ提供者の皆様。彼らの研究成果/データ/熱意がなけなければ、このモデルは完成できなかったでしょう。直接使用しなかったデータ/知識などにも大いに影響/励ましを受けました。 \n",
"To all researchers and enthusiasts of synthetic speech, Voice data provider. Without their research results, data, and enthusiasm, this model would not have been completed. I was also greatly influenced and encouraged by data and knowledge that I did not directly use. \n",
"\n",
"- [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)\n",
"- [canopylabs/orpheus-tts](https://huggingface.co/collections/canopylabs/orpheus-tts-67d9ea3f6c05a941c06ad9d2)\n",
"- [hubertsiuzdak/snac_24khz](https://huggingface.co/hubertsiuzdak/snac_24khz)\n",
"- [Unsloth](https://unsloth.ai/) for Traing script.\n",
"- [Huggingface](https://huggingface.co/) for storage."
],
"metadata": {
"id": "G19mXDdBLeon"
}
},
{
"cell_type": "markdown",
"source": [
"## Developer/開発\n",
"\n",
"- **Developed by:** dahara1@webbigdata\n",
"- **Model type:** text audio generation\n",
"- **Language(s) (NLP):** Japanese\n",
"- **model :** [webbigdata/VoiceCore](https://huggingface.co/webbigdata/VoiceCore)"
],
"metadata": {
"id": "0kZ8Jo4s6S01"
}
}
]
} |