--- license: apache-2.0 language: - ja base_model: - IndexTeam/IndexTTS-2 --- read dnsmos scores sample human voice score (you can find wav tsukuyomui_chan_corpus datasets. datasets was not in train ) japanese speaker's human voice score avrgs ↓ [tsukuyomui_corpus_sample](./doc/tsukuyomui_corpus_sample.csv) ``` ,filename,len_in_sec,sr,num_hops,OVRL_raw,SIG_raw,BAK_raw,OVRL,SIG,BAK,P808_MOS 0,.\test\VOICEACTRESS100_017.wav,5.0001875,16000,1,3.2000952,3.5052588,3.896164,2.9227098969128984,3.2528423553812953,3.8747408680754774,3.3338459 1,.\test\VOICEACTRESS100_012.wav,5.1775,16000,1,2.7019486,2.9619849,3.500944,2.565979346828354,2.8846291622132165,3.6237025378548644,3.7821674 2,.\test\VOICEACTRESS100_003.wav,4.8096875,16000,1,2.1873345,2.6199408,2.827399,2.1621915352276515,2.6273744596126547,3.1010927877041423,3.4254858 3,.\test\VOICEACTRESS100_013.wav,4.8898125,16000,1,2.9094923,3.0740592,4.1051044,2.7186855172813935,2.964647590293625,3.99083596341334,3.0937512 4,.\test\VOICEACTRESS100_009.wav,5.734375,16000,2,2.7630239,3.0332813,3.5787196,2.611414337554783,2.9348871896737814,3.676297288529258,3.829452 5,.\test\VOICEACTRESS100_004.wav,5.4488125,16000,1,2.751989,2.9214494,3.7092264,2.6033311787824274,2.855168354030632,3.761127332034449,3.520155 6,.\test\VOICEACTRESS100_005.wav,10.55375,16000,1,3.3857923,3.6571841,3.9042385,3.047098137709349,3.3469432742323764,3.879440925353043,3.8805087 7,.\test\VOICEACTRESS100_030.wav,4.98825,16000,1,2.7880013,3.2666614,3.3197627,2.6300024481893955,3.0972332957037687,3.4948681538919852,3.9011118 8,.\test\VOICEACTRESS100_016.wav,4.91025,16000,1,3.122046,3.4049766,3.9595494,2.8690361983886943,3.1886048068513237,3.91117498264337,3.331627 9,.\test\VOICEACTRESS100_020.wav,6.2341875,16000,3,2.1416757,2.421304,3.2206821,2.1243002049505084,2.468773497435098,3.419672666404756,3.5495617 ``` and tts synthesized voice score avrgs ↓ amitaro's courpus siingle speaker ft model (not upload, you can ft single speaker and nearly score) [dnsmos_out_sample](./doc/dnsmos_out_sample.csv) ``` ,filename,len_in_sec,sr,num_hops,OVRL_raw,SIG_raw,BAK_raw,OVRL,SIG,BAK,P808_MOS 0,.\test\amitaro dataset (raw human voice) emoNormal002.wav,2.761,16000,2,3.8337939,4.0836735,4.4308057,3.3279373705739044,3.5903295083053113,4.148855150656194,3.5687275 1,.\test\amitaro dataset (raw human voice)emoNormal003.wav,3.1690625,16000,3,3.8572223,4.1315084,4.4619784,3.3418322115351735,3.615698270734455,4.162368712074678,3.7228901 2,.\test\amitaro dataset (raw human voice) emoNormal001.wav,1.637125,16000,4,3.26939,3.466292,4.3564534,2.9692884277592313,3.227623923756912,4.1147632738098014,3.3102732 3,.\test\sbv2 amitaro.wav,4.2376875,16000,7,3.1194606,3.6841893,3.3271327,2.867083568671068,3.36297831453127,3.4996973662018243,3.2683856 6,.\test\fix rev17 我輩は猫である(pd).wav,64.0678125,16000,55,3.467322,3.843662,3.9634974,3.095989628156305,3.454443241861787,3.900091444817868,3.7221756 ``` [finaly verson generated audio rev17 ](https://huggingface.co/WariHima/index-tts-japanese-prosody/blob/main/synthesized_wav/fix%20rev17%20%E6%88%91%E8%BC%A9%E3%81%AF%E7%8C%AB%E3%81%A7%E3%81%82%E3%82%8B(pd).wav) and other audio in [ generated audio dir ](https://huggingface.co/WariHima/index-tts-japanese-prosody/blob/main/synthesized_wav/) --- train, infer (webgui) code in this fork vram use lower than original and this model only work this repo https://github.com/q9uri/index-tts-ja model pretrain use jvnv courpus, cretaed by taakamichi shinosuke sensei and japanese voice actor! and reason-speech-v2-denoized original reazon-speech was created by reazon team, source voice was japanese tv wav file under licensed by 日本国著作権の例外項目 denoized by fishaudio. use uvr5 reupload hf by litagin02 anime-whisper-0.3 use create text transcript kanji in suppres token,transcripts nearly kana only text. model was trained sigle rtx 3060 (max 60% setting,power look like rtx a2000) batch size 1, amp don't use (haha, i forgotten. recommend use amp) gpu was i have'n ローカルllmに向き合う会 hackason. thx サルドラ (@saldra) ゆづき may i support me? buy gpu for me amazon.jp [shoplist](https://www.amazon.jp/hz/wishlist/ls/1P9G04X6Z55BK?ref_=wl_shar) download custom pretrain models https://huggingface.co/WariHima/index-tts-japanese-prosody and other orginal index-tts2 weight file need. infer need cuda 12.8 and vram 8gb created voice length * 2 sec, 36000+6000x8x6cycle_steps.pth rename to gpt.pth, copy to ./checkpoints. japanese-bpe.model to ./checkpoints, don't be rename. run webui python webui.py