Anime-Speech-Japanese-Refiner

This model is a fine-tuned version of Qwen/Qwen3-Omni-30B-A3B-Instruct.

This is an audio processing model specialized for Japanese anime-style or game-style speech. It takes an audio input and its original transcription (text) to generate detailed descriptions (emotion, profile, etc.) and a refined transcription that includes non-speech events (e.g., breaths, sighs).

It was fine-tuned using the NandemoGHS/Galgame_Gemini_Captions dataset.

The training was conducted using the ms-swift library with the Megatron Backend.

Demo: https://huggingface.co/spaces/OmniAICreator/Anime-Speech-Japanese-Refiner-Demo

Intended Use and Limitations

This model is specifically designed for Japanese game-style or anime-style speech.

Due to the nature of its training data, it is not expected to perform well on:

  • Languages other than Japanese.
  • General conversational speech (e.g., meetings, casual dialogue).

How to Use (Inference)

We recommend using vLLM for inference.

vLLM Installation Requirements

This model requires building vLLM from a recent development commit as it is not yet supported in the latest stable release (v0.11.0 as of this writing).

It has been tested and confirmed to work with commit 18961c5ea62976efc50525b72e40337993c5e4f9. You must build vLLM from source:

git clone [https://github.com/vllm-project/vllm.git](https://github.com/vllm-project/vllm.git)
cd vllm
uv pip install . --torch-backend=auto -v --prerelease=allow

This requirement will likely be unnecessary after the v0.11.1 release.

Inference Example

Here is a simple inference script using vLLM:

import os
import torch

from vllm import LLM, SamplingParams
from transformers import Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info

if __name__ == '__main__':
    # vLLM engine v1 not supported yet
    os.environ['VLLM_USE_V1'] = '0'

    MODEL_PATH = "NandemoGHS/Anime-Speech-Japanese-Refiner-FP8-DYNAMIC"

    llm = LLM(
            model=MODEL_PATH, trust_remote_code=True, gpu_memory_utilization=0.95,
            tensor_parallel_size=torch.cuda.device_count(),
            limit_mm_per_prompt={'audio': 1},
            max_num_seqs=8,
            max_model_len=8192,
            seed=100,
    )

    sampling_params = SamplingParams(
        temperature=0.6,
        top_p=0.95,
        top_k=20,
        max_tokens=4096,
    )

    processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)

    audio_path = "https://huggingface.co/NandemoGHS/Anime-Speech-Japanese-Refiner/resolve/main/examples/example1.wav"

    original_transcription = "あっ、あぁんっ、好き、大好きですわ…。もっと…はぁ、んんっ、はぁんっ、もっとぉ!"

    prompt = f"""これから与えられる音声クリップとその文字起こしについて、声の特徴と読み上げスタイル、感情などをアノテーションしたうえで、日本語の短いキャプションで要約してください。
出力には以下の項目を含めてください。

profile: 話者プロファイル(例: お姉さん的な女性声/落ち着いた男性声/少女声 等)
mood: 感情・ムード(例: 明るい/落ち着いた/緊張/怒り/恐怖/悲しみ/快楽 等)
speed: 話速(例: とても遅い/やや速い/一定/(1.2×) 等)
prosody: 抑揚・リズム(例: 平坦/メリハリ/語尾上げ下げ/ため息混じり 等)
pitch_timbre: ピッチ/声質(例: 高め/低め/息多め/張りのある/囁き 等)
style: 発話スタイル(例: ナレーション風/会話調/朗読調/プレゼン調/囁き/喘ぎ/嗚咽/叫び 等)
emotion: 感情タグ(次のリストから1つ選択: ["angry", "sad", "disdainful", "excited", "surprised", "satisfied", "unhappy", "anxious", "hysterical", "delighted", "scared", "worried", "indifferent", "upset", "impatient", "nervous", "guilty", "scornful", "frustrated", "depressed", "panicked", "furious", "empathetic", "embarrassed", "reluctant", "disgusted", "keen", "moved", "proud", "relaxed", "grateful", "confident", "interested", "curious", "confused", "joyful", "disapproving", "negative", "denying", "astonished", "serious", "sarcastic", "conciliative", "comforting", "sincere", "sneering", "hesitating", "yielding", "painful", "awkward", "amused", "loving", "dating", "longing", "aroused", "seductive", "ecstatic", "shy"])
notes: 特記事項(間の取り方、笑い・ため・ブレス、ノイズ感、キス音、効果音、チュパ音 等)
caption: 上記を1〜2文・全角30〜80文字で自然文に要約
refined_text: 元の文字起こしテキストに、必要に応じて特殊タグを音声中のイベントの描写として文章のどこかに挿入したもの(必要なければ元テキストをそのまま出力)。

元の文字起こしテキスト: {original_transcription}
元の音声クリップ:"""

    messages = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "audio", "audio": audio_path},
            ],
        }
    ]

    text = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    audios, _, _ = process_mm_info(messages, use_audio_in_video=False)

    inputs = {
        'prompt': text,
        'multi_modal_data': {},
    }

    if audios is not None:
        inputs['multi_modal_data']['audio'] = audios

    outputs = llm.generate([inputs], sampling_params=sampling_params)

    print(outputs[0].outputs[0].text)

Example Output

This is the output generated for this example audio and its transcription.

emotion: ecstatic
profile: お嬢様風の女性声
mood: 快楽、絶頂
speed: 途切れ途切れ
prosody: 喘ぎながら話す
pitch_timbre: 高く、息多め、裏返り気味
style: 喘ぎ
notes: 激しい喘ぎ声と荒い息遣いが混じる。性的な行為の最中を強く示唆する。
caption: お嬢様風の女性が快楽に喘いでいる。高く裏返った声で、息遣い荒く途切れ途切れに話す。絶頂に近い興奮状態。
refined_text: (喘ぎ)あっ、あぁんっ、好き、大好きですわ…。(吐息)もっと…はぁ、んんっ、はぁんっ、もっとぉ!

Notebook Example

For a more detailed walkthrough, please see the inference_example.ipynb notebook. (Note: You will need to adapt the prompt for this Refiner model).

Output Format

The model outputs a structured description of the audio in Japanese, following this format:

emotion: {Emotion of the speech}
profile: {Speaker profile}
mood: {Mood of the speech}
speed: {Speaking speed}
prosody: {Prosody, rhythm}
pitch_timbre:{Pitch, voice quality}
style: {Style of utterance}
notes: {Other relevant notes}
caption: {A comprehensive caption integrating all elements}
refined_text: {Original transcription with added event tags (e.g., (吐息))}

License

This model is licensed under CC-BY-NC-4.0 License.

Furthermore, the training data utilized outputs from Gemini 2.5 Pro. Therefore, any use that competes with or violates the terms of service of Gemini is strictly prohibited.

Downloads last month
92
Safetensors
Model size
35B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NandemoGHS/Anime-Speech-Japanese-Refiner

Finetuned
(8)
this model
Quantizations
1 model

Dataset used to train NandemoGHS/Anime-Speech-Japanese-Refiner

Space using NandemoGHS/Anime-Speech-Japanese-Refiner 1