Can WE improve this better?

#1
by kikouousya - opened

I really like this model! Its performance on some pornographic lines is better than many traditional models. If possible, I would like to discuss with you further to implement a better version, including longer samples, better short text support, tone stability, input with annotation (like FishAudio), and so on

I would like provide massive computing resrouce in these areas. I wonder if you are interested?

Thank you very much!

To be honest, I'm a complete beginner in audio and TTS development—this model was actually my first attempt at creating a TTS model. Unfortunately, I don't yet have the skills or knowledge needed for more advanced development in this area. My primary experience is with developing large language models (LLMs), and because this model shares many similarities with LLMs, I was able to build it relatively easily. However, for serious and advanced audio-related development, I lack the necessary expertise. Therefore, I currently don't have many concrete ideas or strategies to further improve this model.

If you're still interested in discussing this, I'd be happy to talk further!

Well, Acturelly, we can

  1. Try larger model, like Llasa-8B
  2. Pre training is first conducted with the general gal&anime game&ASMR dataset without semantic tags (More is Better)
    3 Using sematic tags like (sad), (sexual) {screaming),...
  • Fish Audio already done this, but due to the fact that Fish Audio only open-source a 0.5B distillation model, we can only plan to use this distillation model for generating semantic labels on the dataset.
  • And BAD quality in erotic content
  1. Use LLM to re transcribe and re label data from stage 2
  • In this case, we can even use structed Data like this, (less data is needed, but also a big cost)
{
  "audio_filepath": "dataset/folder1/sound_001.wav",
  "speaker_id": "character_A",
  "emotion_primary": "sadness",
  "emotion_secondary": "pain",
  "events": [
    {
      "start_time": 0.5,
      "end_time": 1.8,
      "type": "speech",
      "transcript": "なんで、そんなひどいこと…するの?"
    },
    {
      "start_time": 1.9,
      "end_time": 2.5,
      "type": "vocalization",
      "description": "(泣き)",
      "intensity": 0.7 
    },
    {
      "start_time": 2.8,
      "end_time": 4.2,
      "type": "vocalization",
      "description": "(悔やんだ嘆き)",
      "intensity": 0.9 
    }
  ]
}

But, be honest, I am also new to audio model training, and have not much time, you can add my discode kikouousya to discusse futher

Sign up or log in to comment