Can WE improve this better?

by kikouousya - opened Jul 26

Jul 26

I really like this model! Its performance on some pornographic lines is better than many traditional models. If possible, I would like to discuss with you further to implement a better version, including longer samples, better short text support, tone stability, input with annotation (like FishAudio), and so on

I would like provide massive computing resrouce in these areas. I wonder if you are interested?

OmniAICreator

Owner Jul 27

Thank you very much!

To be honest, I'm a complete beginner in audio and TTS development—this model was actually my first attempt at creating a TTS model. Unfortunately, I don't yet have the skills or knowledge needed for more advanced development in this area. My primary experience is with developing large language models (LLMs), and because this model shares many similarities with LLMs, I was able to build it relatively easily. However, for serious and advanced audio-related development, I lack the necessary expertise. Therefore, I currently don't have many concrete ideas or strategies to further improve this model.

If you're still interested in discussing this, I'd be happy to talk further!

kikouousya

Jul 27

Well, Acturelly, we can

Try larger model, like Llasa-8B
Pre training is first conducted with the general gal&anime game&ASMR dataset without semantic tags (More is Better)
3 Using sematic tags like (sad), (sexual) {screaming),...

Fish Audio already done this, but due to the fact that Fish Audio only open-source a 0.5B distillation model, we can only plan to use this distillation model for generating semantic labels on the dataset.
And BAD quality in erotic content

Use LLM to re transcribe and re label data from stage 2

In this case, we can even use structed Data like this, (less data is needed, but also a big cost)

{
  "audio_filepath": "dataset/folder1/sound_001.wav",
  "speaker_id": "character_A",
  "emotion_primary": "sadness",
  "emotion_secondary": "pain",
  "events": [
    {
      "start_time": 0.5,
      "end_time": 1.8,
      "type": "speech",
      "transcript": "なんで、そんなひどいこと…するの？"
    },
    {
      "start_time": 1.9,
      "end_time": 2.5,
      "type": "vocalization",
      "description": "(泣き)",
      "intensity": 0.7 
    },
    {
      "start_time": 2.8,
      "end_time": 4.2,
      "type": "vocalization",
      "description": "(悔やんだ嘆き)",
      "intensity": 0.9 
    }
  ]
}

But, be honest, I am also new to audio model training, and have not much time, you can add my discode kikouousya to discusse futher

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment