---
license: apache-2.0
language:
- ja
base_model:
- IndexTeam/IndexTTS-2
---

read dnsmos scores
 
 sample human voice score (you can find wav tsukuyomui_chan_corpus datasets. datasets was not in train ) 
 japanese speaker's human voice score avrgs ↓   
 [tsukuyomui_corpus_sample](./doc/tsukuyomui_corpus_sample.csv) 

```
 ,filename,len_in_sec,sr,num_hops,OVRL_raw,SIG_raw,BAK_raw,OVRL,SIG,BAK,P808_MOS
0,.\test\VOICEACTRESS100_017.wav,5.0001875,16000,1,3.2000952,3.5052588,3.896164,2.9227098969128984,3.2528423553812953,3.8747408680754774,3.3338459
1,.\test\VOICEACTRESS100_012.wav,5.1775,16000,1,2.7019486,2.9619849,3.500944,2.565979346828354,2.8846291622132165,3.6237025378548644,3.7821674
2,.\test\VOICEACTRESS100_003.wav,4.8096875,16000,1,2.1873345,2.6199408,2.827399,2.1621915352276515,2.6273744596126547,3.1010927877041423,3.4254858
3,.\test\VOICEACTRESS100_013.wav,4.8898125,16000,1,2.9094923,3.0740592,4.1051044,2.7186855172813935,2.964647590293625,3.99083596341334,3.0937512
4,.\test\VOICEACTRESS100_009.wav,5.734375,16000,2,2.7630239,3.0332813,3.5787196,2.611414337554783,2.9348871896737814,3.676297288529258,3.829452
5,.\test\VOICEACTRESS100_004.wav,5.4488125,16000,1,2.751989,2.9214494,3.7092264,2.6033311787824274,2.855168354030632,3.761127332034449,3.520155
6,.\test\VOICEACTRESS100_005.wav,10.55375,16000,1,3.3857923,3.6571841,3.9042385,3.047098137709349,3.3469432742323764,3.879440925353043,3.8805087
7,.\test\VOICEACTRESS100_030.wav,4.98825,16000,1,2.7880013,3.2666614,3.3197627,2.6300024481893955,3.0972332957037687,3.4948681538919852,3.9011118
8,.\test\VOICEACTRESS100_016.wav,4.91025,16000,1,3.122046,3.4049766,3.9595494,2.8690361983886943,3.1886048068513237,3.91117498264337,3.331627
9,.\test\VOICEACTRESS100_020.wav,6.2341875,16000,3,2.1416757,2.421304,3.2206821,2.1243002049505084,2.468773497435098,3.419672666404756,3.5495617
```

and  tts synthesized voice score avrgs ↓   

amitaro's courpus siingle speaker ft model (not upload, you can ft single speaker and nearly score)  
[dnsmos_out_sample](./doc/dnsmos_out_sample.csv)  

```
,filename,len_in_sec,sr,num_hops,OVRL_raw,SIG_raw,BAK_raw,OVRL,SIG,BAK,P808_MOS
0,.\test\amitaro dataset (raw human voice) emoNormal002.wav,2.761,16000,2,3.8337939,4.0836735,4.4308057,3.3279373705739044,3.5903295083053113,4.148855150656194,3.5687275
1,.\test\amitaro dataset (raw human voice)emoNormal003.wav,3.1690625,16000,3,3.8572223,4.1315084,4.4619784,3.3418322115351735,3.615698270734455,4.162368712074678,3.7228901
2,.\test\amitaro dataset (raw human voice) emoNormal001.wav,1.637125,16000,4,3.26939,3.466292,4.3564534,2.9692884277592313,3.227623923756912,4.1147632738098014,3.3102732
3,.\test\sbv2 amitaro.wav,4.2376875,16000,7,3.1194606,3.6841893,3.3271327,2.867083568671068,3.36297831453127,3.4996973662018243,3.2683856
6,.\test\fix rev17 我輩は猫である(pd).wav,64.0678125,16000,55,3.467322,3.843662,3.9634974,3.095989628156305,3.454443241861787,3.900091444817868,3.7221756
```

[finaly verson generated audio rev17 ](https://huggingface.co/WariHima/index-tts-japanese-prosody/blob/main/synthesized_wav/fix%20rev17%20%E6%88%91%E8%BC%A9%E3%81%AF%E7%8C%AB%E3%81%A7%E3%81%82%E3%82%8B(pd).wav)

and other audio in  [ generated audio dir ](https://huggingface.co/WariHima/index-tts-japanese-prosody/blob/main/synthesized_wav/)  
  
---  

train, infer (webgui)  code  in this fork
vram use lower than original  and this model only work this repo
https://github.com/q9uri/index-tts-ja

model  pretrain use jvnv courpus,
cretaed by taakamichi shinosuke sensei and japanese voice actor!

and reason-speech-v2-denoized

original reazon-speech was created by reazon team, 
source voice was japanese tv 
wav file under licensed by 日本国著作権の例外項目

denoized by fishaudio. use uvr5
reupload hf by litagin02

anime-whisper-0.3 use create text transcript
kanji in suppres token,transcripts nearly kana only text.

model was trained sigle rtx 3060 (max 60% setting,power look like rtx a2000)   
batch size 1, amp don't use (haha, i forgotten. recommend use amp)

gpu was i have'n ローカルllmに向き合う会 hackason. thx サルドラ (@saldra) ゆづき

may i support me?  
buy gpu for me amazon.jp [shoplist](https://www.amazon.jp/hz/wishlist/ls/1P9G04X6Z55BK?ref_=wl_shar)

download custom pretrain models  
https://huggingface.co/WariHima/index-tts-japanese-prosody  
  
and other orginal index-tts2 weight file need.  
  
infer need cuda 12.8 and vram 8gb
created voice length * 2 sec,
  
36000+6000x8x6cycle_steps.pth rename to gpt.pth, copy to ./checkpoints.  
japanese-bpe.model to  ./checkpoints, don't be rename.  

run webui
python webui.py