---
library_name: transformers
license: cc-by-nc-sa-4.0
language:
- en
base_model:
- ryota-komatsu/bigvgan
---

# Model Card for Model ID

## Model Details

### Model Description

- **Model type:** Flow matching-based Diffusion Transformer with a BigVGAN vocoder

### Model Sources

- **Repository:** [Code](https://github.com/ryota-komatsu/speaker_disentangled_hubert)
- **Demo:** [Project page](https://ryota-komatsu.github.io/speaker_disentangled_hubert)

## How to Get Started with the Model

Use the code below to get started with the model.

```sh
git clone https://github.com/ryota-komatsu/speaker_disentangled_hubert.git
cd speaker_disentangled_hubert
sudo apt install git-lfs  # for UTMOS
conda create -y -n py310 -c pytorch -c nvidia -c conda-forge python=3.10.18 pip=24.0 faiss-gpu=1.11.0
conda activate py310
pip install -r requirements/requirements.txt
sh scripts/setup.sh
```

```python
import torchaudio
from src.flow_matching import FlowMatchingWithBigVGan
from src.s5hubert import S5HubertForSyllableDiscovery
wav_path = "/path/to/wav"
# download pretrained models from hugging face hub
encoder = S5HubertForSyllableDiscovery.from_pretrained("ryota-komatsu/s5-hubert", device_map="cuda")
decoder = FlowMatchingWithBigVGan.from_pretrained("ryota-komatsu/s5-hubert-decoder", device_map="cuda")
# load a waveform
waveform, sr = torchaudio.load(wav_path)
waveform = torchaudio.functional.resample(waveform, sr, 16000)
# encode a waveform into syllabic units
outputs = encoder(waveform.to(encoder.device))
# syllabic units
units = outputs[0]["units"]  # [3950, 67, ..., 503]
units = units.unsqueeze(0)
# unit-to-speech synthesis
audio_values = decoder(units)
```

## Training Hyperparameters

- **Training regime:** fp16 mixed precision