|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- audio |
|
|
library_name: pytorch |
|
|
--- |
|
|
|
|
|
# Vocos |
|
|
|
|
|
#### Note: This repo has no affiliation with the author of Vocos. |
|
|
|
|
|
Pretrained Vocos model with a 48kHz sampling rate, as opposed to 24kHz of the official. |
|
|
|
|
|
## Usage |
|
|
Make sure the Vocos library is installed: |
|
|
|
|
|
```bash |
|
|
pip install vocos |
|
|
``` |
|
|
|
|
|
then, load the model as usual: |
|
|
|
|
|
```python |
|
|
from vocos import Vocos |
|
|
vocos = Vocos.from_pretrained("kittn/vocos-mel-48khz-alpha1") |
|
|
``` |
|
|
|
|
|
For more detailed examples, see [github.com/charactr-platform/vocos#usage](https://github.com/charactr-platform/vocos#usage) |
|
|
|
|
|
## Evals |
|
|
TODO |
|
|
|
|
|
## Training details |
|
|
TODO |
|
|
|
|
|
## What is Vocos? |
|
|
|
|
|
Here's a summary from the official repo [[link](https://github.com/charactr-platform/vocos)]: |
|
|
|
|
|
> Vocos is a fast neural vocoder designed to synthesize audio waveforms from acoustic features. Trained using a Generative Adversarial Network (GAN) objective, Vocos can generate waveforms in a single forward pass. Unlike other typical GAN-based vocoders, Vocos does not model audio samples in the time domain. Instead, it generates spectral coefficients, facilitating rapid audio reconstruction through inverse Fourier transform. |
|
|
|
|
|
For more details and other variants, check out the repo link above. |
|
|
|
|
|
## Model summary |
|
|
```bash |
|
|
================================================================= |
|
|
Layer (type:depth-idx) Param # |
|
|
================================================================= |
|
|
Vocos -- |
|
|
├─MelSpectrogramFeatures: 1-1 -- |
|
|
│ └─MelSpectrogram: 2-1 -- |
|
|
│ │ └─Spectrogram: 3-1 -- |
|
|
│ │ └─MelScale: 3-2 -- |
|
|
├─VocosBackbone: 1-2 -- |
|
|
│ └─Conv1d: 2-2 918,528 |
|
|
│ └─LayerNorm: 2-3 2,048 |
|
|
│ └─ModuleList: 2-4 -- |
|
|
│ │ └─ConvNeXtBlock: 3-3 4,208,640 |
|
|
│ │ └─ConvNeXtBlock: 3-4 4,208,640 |
|
|
│ │ └─ConvNeXtBlock: 3-5 4,208,640 |
|
|
│ │ └─ConvNeXtBlock: 3-6 4,208,640 |
|
|
│ │ └─ConvNeXtBlock: 3-7 4,208,640 |
|
|
│ │ └─ConvNeXtBlock: 3-8 4,208,640 |
|
|
│ │ └─ConvNeXtBlock: 3-9 4,208,640 |
|
|
│ │ └─ConvNeXtBlock: 3-10 4,208,640 |
|
|
│ └─LayerNorm: 2-5 2,048 |
|
|
├─ISTFTHead: 1-3 -- |
|
|
│ └─Linear: 2-6 2,101,250 |
|
|
│ └─ISTFT: 2-7 -- |
|
|
================================================================= |
|
|
Total params: 36,692,994 |
|
|
Trainable params: 36,692,994 |
|
|
Non-trainable params: 0 |
|
|
================================================================= |
|
|
``` |