nu-dialogue
/

j-moshi

+# J-Moshi: A Japanese Full-duplex Spoken Dialogue System
+[![Japanese](https://img.shields.io/badge/README-Japanese-red.svg)](README.md) [![License](https://img.shields.io/badge/License-CC_BY--NC_4.0-blue.svg)](LICENSE)
+[📑 **Paper**](http://arxiv.org/abs/2506.02979)
+&nbsp;|&nbsp;
+[🤗 **Model**](https://huggingface.co/nu-dialogue/j-moshi-ext/blob/main/README-en.md)
+&nbsp;|&nbsp;
+[🖥️ **Demo**](https://nu-dialogue.github.io/j-moshi?lang=en)
+&nbsp;|&nbsp;
+[🔧 **Training Code**](https://github.com/nu-dialogue/moshi-finetune)
+J-Moshi is a full-duplex spoken dialogue model for Japanese. Built upon the English 7B-parameter full-duplex spoken dialogue model [Moshi](https://arxiv.org/abs/2410.00037), it was developed through additional training on Japanese spoken dialogue data. The model realizes natural turn-taking behaviors such as speech overlaps and backchannels in real-time, similar to human-to-human conversations. For more details, please refer to [our paper](http://arxiv.org/abs/2506.02979).
+This repository provides the trained J-Moshi models and instructions for interacting with the models. [Audio samples](https://nu-dialogue.github.io/j-moshi?lang=en) generated by J-Moshi and [our training codebase](https://github.com/nu-dialogue/moshi-finetune) used for training J-Moshi are also available.
+## Models
+Two variants of J-Moshi are publicly available:
+- [nu-dialogue/j-moshi](https://huggingface.co/nu-dialogue/j-moshi)
+    - A model based on [kyutai/moshiko-pytorch-bf16](https://huggingface.co/kyutai/moshiko-pytorch-bf16), trained on large-scale Japanese spoken dialogue data.
+- [nu-dialogue/j-moshi-ext](https://huggingface.co/nu-dialogue/j-moshi-ext)
+    - A model based on [kyutai/moshiko-pytorch-bf16](https://huggingface.co/kyutai/moshiko-pytorch-bf16), trained on large-scale Japanese spoken dialogue data and augmented data synthesized using multi-stream TTS.
+Each repository contains the following three model files:
+- `model.safetensors`
+    - The main J-Moshi model weights.
+- `tokenizer_spm_32k_3.model`
+    - Text tokenizer. Japanese SentencePiece model from [rinna/japanese-gpt2-medium](https://huggingface.co/rinna/japanese-gpt2-medium).
+- `tokenizer-e351c8d8-checkpoint125.safetensors`
+    - Audio tokenizer. Mimi model from [kyutai/moshiko-pytorch-bf16](https://huggingface.co/kyutai/moshiko-pytorch-bf16).
+## Interactive Demo
+You can interact with J-Moshi using the official [Moshi PyTorch implementation](https://github.com/kyutai-labs/moshi/tree/main/moshi) from Kyutai. For implementation details, please refer to the original Moshi repository [kyutai-labs/moshi](https://github.com/kyutai-labs/moshi).
+### Installation
+Python 3.10 or higher is required.
+```bash
+pip install moshi<=0.2.2
+```
+### Usage
+You can launch the web UI by running `moshi.server`. Specify the J-Moshi 🤗 HuggingFace Hub repository ([nu-dialogue/j-moshi](https://huggingface.co/nu-dialogue/j-moshi), [nu-dialogue/j-moshi-ext](https://huggingface.co/nu-dialogue/j-moshi-ext)) using the `--hf-repo` option.
+```bash
+python -m moshi.server --hf-repo nu-dialogue/j-moshi-ext
+```
+### Tips
+- A Linux GPU machine with at least 24GB of VRAM is required for execution. MacOS is not supported.
+- To prevent echo of the model's speech output, please use earphones or headphones instead of speakers during dialogue. Audio devices can be configured in the browser when accessing the web UI.
+## Training Details
+The following spoken dialogue corpora were used for training J-Moshi. In addition to these datasets, J-Moshi-ext was also trained on augmented data synthesized from text dialogue corpora. The corpora used are as follows:
+- Spoken dialogue corpora
+    - [J-CHAT](https://arxiv.org/abs/2407.15828)
+    - [Japanese Callhome](https://catalog.ldc.upenn.edu/LDC96S37)
+    - [CSJ](https://www.isca-archive.org/sspr_2003/maekawa03_sspr.html#)
+    - [Travel Agency Dialogue Corpus](https://dl.acm.org/doi/10.1145/3675166)
+    - Casual Dialogue Corpus (in-house)
+    - Consultation Dialogue Corpus (in-house)
+- Text dialogue corpora
+    - [Japanese PersonaChat](https://arxiv.org/abs/2109.05217)
+    - [Japanese EmpatheticDialogues](https://arxiv.org/abs/2109.05217)
+    - [Japanese Daily Dialogue Corpus](https://github.com/jqk09a/japanese-daily-dialogue)
+    - [RealPersonaChat](https://aclanthology.org/2023.paclic-1.85/)
+Training was conducted using 128 NVIDIA V100 32GB GPUs.
+## Terms of Use
+J-Moshi is released under [CC BY-NC 4.0](LICENSE) and is intended for research purposes. This model is not intended for any malicious use, including impersonation or fraud. Additionally, the model's outputs may contain biases or inaccurate or offensive information derived from the training data. We assume no responsibility for any damages arising from its use.
+## Acknowledgements
+This research was supported by JST Moonshot R&D, Grant Number JPMJMS2011. The casual dialogue corpus and consultation dialogue corpus were constructed in joint research with AISIN Corporation. We used the computational resources of the supercomputer "Flow" at the Information Technology Center, Nagoya University. Finally, we would like to thank Kyutai Labs for releasing Moshi's technical paper and model.
+<a href="https://avatar-ss.org"><img src="https://nu-dialogue.github.io/j-moshi/static/image/moonshot_logo.svg" width="200"></a>
+## Citation
+```bibtex
+@inproceedings{ohashi2025jmoshi,
+    title={Towards a Japanese Full-duplex Spoken Dialogue System},
+    author={Ohashi, Atsumoto and Iizuka, Shinya and Jiang, Jingjing and Higashinaka, Ryuichiro},
+    booktitle={Proceedings of the 26th Interspeech Conference},
+    year={2025},
+}
+```