Moshi
Safetensors
Japanese
atsumoto commited on
Commit
bb8192b
·
verified ·
1 Parent(s): 293195b

Create README-en.md

Browse files
Files changed (1) hide show
  1. README-en.md +93 -0
README-en.md ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # J-Moshi: A Japanese Full-duplex Spoken Dialogue System
2
+
3
+ [![Japanese](https://img.shields.io/badge/README-Japanese-red.svg)](README.md) [![License](https://img.shields.io/badge/License-CC_BY--NC_4.0-blue.svg)](LICENSE)
4
+
5
+ [📑 **Paper**](http://arxiv.org/abs/2506.02979)
6
+  | 
7
+ [🤗 **Model**](https://huggingface.co/nu-dialogue/j-moshi-ext/blob/main/README-en.md)
8
+  | 
9
+ [🖥️ **Demo**](https://nu-dialogue.github.io/j-moshi?lang=en)
10
+  | 
11
+ [🔧 **Training Code**](https://github.com/nu-dialogue/moshi-finetune)
12
+
13
+ J-Moshi is a full-duplex spoken dialogue model for Japanese. Built upon the English 7B-parameter full-duplex spoken dialogue model [Moshi](https://arxiv.org/abs/2410.00037), it was developed through additional training on Japanese spoken dialogue data. The model realizes natural turn-taking behaviors such as speech overlaps and backchannels in real-time, similar to human-to-human conversations. For more details, please refer to [our paper](http://arxiv.org/abs/2506.02979).
14
+
15
+ This repository provides the trained J-Moshi models and instructions for interacting with the models. [Audio samples](https://nu-dialogue.github.io/j-moshi?lang=en) generated by J-Moshi and [our training codebase](https://github.com/nu-dialogue/moshi-finetune) used for training J-Moshi are also available.
16
+
17
+
18
+ ## Models
19
+ Two variants of J-Moshi are publicly available:
20
+ - [nu-dialogue/j-moshi](https://huggingface.co/nu-dialogue/j-moshi)
21
+ - A model based on [kyutai/moshiko-pytorch-bf16](https://huggingface.co/kyutai/moshiko-pytorch-bf16), trained on large-scale Japanese spoken dialogue data.
22
+ - [nu-dialogue/j-moshi-ext](https://huggingface.co/nu-dialogue/j-moshi-ext)
23
+ - A model based on [kyutai/moshiko-pytorch-bf16](https://huggingface.co/kyutai/moshiko-pytorch-bf16), trained on large-scale Japanese spoken dialogue data and augmented data synthesized using multi-stream TTS.
24
+
25
+ Each repository contains the following three model files:
26
+ - `model.safetensors`
27
+ - The main J-Moshi model weights.
28
+ - `tokenizer_spm_32k_3.model`
29
+ - Text tokenizer. Japanese SentencePiece model from [rinna/japanese-gpt2-medium](https://huggingface.co/rinna/japanese-gpt2-medium).
30
+ - `tokenizer-e351c8d8-checkpoint125.safetensors`
31
+ - Audio tokenizer. Mimi model from [kyutai/moshiko-pytorch-bf16](https://huggingface.co/kyutai/moshiko-pytorch-bf16).
32
+
33
+ ## Interactive Demo
34
+ You can interact with J-Moshi using the official [Moshi PyTorch implementation](https://github.com/kyutai-labs/moshi/tree/main/moshi) from Kyutai. For implementation details, please refer to the original Moshi repository [kyutai-labs/moshi](https://github.com/kyutai-labs/moshi).
35
+
36
+ ### Installation
37
+ Python 3.10 or higher is required.
38
+
39
+ ```bash
40
+ pip install moshi<=0.2.2
41
+ ```
42
+
43
+ ### Usage
44
+ You can launch the web UI by running `moshi.server`. Specify the J-Moshi 🤗 HuggingFace Hub repository ([nu-dialogue/j-moshi](https://huggingface.co/nu-dialogue/j-moshi), [nu-dialogue/j-moshi-ext](https://huggingface.co/nu-dialogue/j-moshi-ext)) using the `--hf-repo` option.
45
+
46
+ ```bash
47
+ python -m moshi.server --hf-repo nu-dialogue/j-moshi-ext
48
+ ```
49
+
50
+ ### Tips
51
+ - A Linux GPU machine with at least 24GB of VRAM is required for execution. MacOS is not supported.
52
+ - To prevent echo of the model's speech output, please use earphones or headphones instead of speakers during dialogue. Audio devices can be configured in the browser when accessing the web UI.
53
+
54
+
55
+ ## Training Details
56
+ The following spoken dialogue corpora were used for training J-Moshi. In addition to these datasets, J-Moshi-ext was also trained on augmented data synthesized from text dialogue corpora. The corpora used are as follows:
57
+
58
+ - Spoken dialogue corpora
59
+ - [J-CHAT](https://arxiv.org/abs/2407.15828)
60
+ - [Japanese Callhome](https://catalog.ldc.upenn.edu/LDC96S37)
61
+ - [CSJ](https://www.isca-archive.org/sspr_2003/maekawa03_sspr.html#)
62
+ - [Travel Agency Dialogue Corpus](https://dl.acm.org/doi/10.1145/3675166)
63
+ - Casual Dialogue Corpus (in-house)
64
+ - Consultation Dialogue Corpus (in-house)
65
+
66
+ - Text dialogue corpora
67
+ - [Japanese PersonaChat](https://arxiv.org/abs/2109.05217)
68
+ - [Japanese EmpatheticDialogues](https://arxiv.org/abs/2109.05217)
69
+ - [Japanese Daily Dialogue Corpus](https://github.com/jqk09a/japanese-daily-dialogue)
70
+ - [RealPersonaChat](https://aclanthology.org/2023.paclic-1.85/)
71
+
72
+ Training was conducted using 128 NVIDIA V100 32GB GPUs.
73
+
74
+
75
+ ## Terms of Use
76
+ J-Moshi is released under [CC BY-NC 4.0](LICENSE) and is intended for research purposes. This model is not intended for any malicious use, including impersonation or fraud. Additionally, the model's outputs may contain biases or inaccurate or offensive information derived from the training data. We assume no responsibility for any damages arising from its use.
77
+
78
+
79
+ ## Acknowledgements
80
+ This research was supported by JST Moonshot R&D, Grant Number JPMJMS2011. The casual dialogue corpus and consultation dialogue corpus were constructed in joint research with AISIN Corporation. We used the computational resources of the supercomputer "Flow" at the Information Technology Center, Nagoya University. Finally, we would like to thank Kyutai Labs for releasing Moshi's technical paper and model.
81
+
82
+ <a href="https://avatar-ss.org"><img src="https://nu-dialogue.github.io/j-moshi/static/image/moonshot_logo.svg" width="200"></a>
83
+
84
+
85
+ ## Citation
86
+ ```bibtex
87
+ @inproceedings{ohashi2025jmoshi,
88
+ title={Towards a Japanese Full-duplex Spoken Dialogue System},
89
+ author={Ohashi, Atsumoto and Iizuka, Shinya and Jiang, Jingjing and Higashinaka, Ryuichiro},
90
+ booktitle={Proceedings of the 26th Interspeech Conference},
91
+ year={2025},
92
+ }
93
+ ```