--- library_name: transformers tags: - vocoder - audio license: mit --- # Vocos-Encodec-24kHz: EnCodec-Based Neural Vocoder (Transformers-compatible version) The Vocos model was proposed in [Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis](https://huggingface.co/papers/2306.00814). This model outputs 24kHz audio from [EnCodec](https://huggingface.co/facebook/encodec_24khz) codes. This checkpoint is a Transformers-compatible version of [charactr/vocos-encodec-24khz](https://huggingface.co/charactr/vocos-encodec-24khz). # 🔊 Audio samples below 👇 ## Example usage ```python from datasets import load_dataset, Audio from transformers import VocosModel, VocosProcessor from scipy.io.wavfile import write as write_wav # can be chosen from [1.5, 3, 6, 12] bandwidth = 6.0 # load model and processor model_id = "hf-audio/vocos-encodec-24khz" processor = VocosProcessor.from_pretrained(model_id) model = VocosModel.from_pretrained(model_id, device_map="auto") sampling_rate = processor.feature_extractor.sampling_rate # load audio sample ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") ds = ds.cast_column("audio", Audio(sampling_rate=sampling_rate)) audio = ds[0]["audio"]["array"] inputs = processor(audio=audio, bandwidth=bandwidth, sampling_rate=sampling_rate).to(model.device) print(inputs.input_features.shape) # -- (batch, codes, frame): [1, 128, 440] outputs = model(**inputs) audio = outputs.audio print(audio.shape) # -- (batch, time): [1, 140800] # save audio to file write_wav("vocos_encodec.wav", sampling_rate, audio[0].detach().cpu().numpy()) ``` **Original** **Mel-based Vocos ([hf-audio/vocos-mel-24khz](https://huggingface.co/hf-audio/vocos-mel-24khz))** **EnCodec-based Vocos (this model)** ## Reconstructing audio from Bark tokens The EnCodec variant can also process precomputed RVQ codes directly. You provide quantized audio codes as input to the processor, which converts them into embeddings for the Vocos model. Bark is a text-to-speech model that encodes input text into discrete EnCodec RVQ codes, then uses EnCodec to convert those codes into an audio waveform. The Vocos vocoder is often integrated with Bark instead of relying only on the EnCodec's decoder for better audio quality. Below is an example using the Transformers implementation of [Bark](./bark) to generate quantized codes from text, then decoding them with `VocosProcessor` and `VocosModel`: ```python from transformers import VocosModel, VocosProcessor, BarkProcessor, BarkModel from transformers.models.bark.generation_configuration_bark import BarkSemanticGenerationConfig, BarkCoarseGenerationConfig, BarkFineGenerationConfig from scipy.io.wavfile import write as write_wav bandwidth = 6.0 # load the Bark model and processor bark_id = "suno/bark-small" bark_processor = BarkProcessor.from_pretrained(bark_id) bark = BarkModel.from_pretrained(bark_id, device_map="auto") text_prompt = "We've been messing around with this new model called Vocos." bark_inputs = bark_processor(text_prompt, return_tensors="pt").to(bark.device) # building generation configs for each stage semantic_generation_config = BarkSemanticGenerationConfig(**bark.generation_config.semantic_config) coarse_generation_config = BarkCoarseGenerationConfig(**bark.generation_config.coarse_acoustics_config) fine_generation_config = BarkFineGenerationConfig(**bark.generation_config.fine_acoustics_config) # generating the RVQ codes semantic_tokens = bark.semantic.generate( **bark_inputs, semantic_generation_config=semantic_generation_config) coarse_tokens = bark.coarse_acoustics.generate( semantic_tokens, semantic_generation_config=semantic_generation_config, coarse_generation_config=coarse_generation_config, codebook_size=bark.generation_config.codebook_size) fine_tokens = bark.fine_acoustics.generate( coarse_tokens, semantic_generation_config=semantic_generation_config, coarse_generation_config=coarse_generation_config, fine_generation_config=fine_generation_config, codebook_size=bark.generation_config.codebook_size) codes = fine_tokens.squeeze(0) # -- `codes` shape (8 codebooks, * frames) # Reconstruct audio with Vocos from codes vocos_id = "hf-audio/vocos-encodec-24khz" processor = VocosProcessor.from_pretrained(vocos_id) vocos_model = VocosModel.from_pretrained(vocos_id, device_map="auto") sampling_rate = processor.feature_extractor.sampling_rate # generate audio inputs = processor(codes=codes.to("cpu"), bandwidth=bandwidth).to(vocos_model.device) audio = vocos_model(**inputs).audio # save audio to file write_wav("vocos_bark.wav", sampling_rate, audio[0].detach().cpu().numpy()) ``` **Output from Bark tokens** ## Batch processing For batch processing, the `padding_mask` output `VocosProcessor` can be used to get equivalent outputs as single-file processing. ```python from datasets import Audio, load_dataset from scipy.io.wavfile import write as write_wav from transformers import VocosModel, VocosProcessor n_audio = 2 # number of audio samples to process in a batch bandwidth = 12 # can be chosen from [1.5, 3, 6, 12] # load model and processor model_id = "hf-audio/vocos-encodec-24khz" processor = VocosProcessor.from_pretrained(model_id) model = VocosModel.from_pretrained(model_id, device_map="auto") sampling_rate = processor.feature_extractor.sampling_rate # load audio sample ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") ds = ds.cast_column("audio", Audio(sampling_rate=sampling_rate)) audio = [audio_sample["array"] for audio_sample in ds[-n_audio:]["audio"]] print(f"Input audio shape: {[_sample.shape for _sample in audio]}") # Input audio shape: [(170760,), (107520,)] # prepare batch inputs = processor(audio=audio, bandwidth=bandwidth, sampling_rate=sampling_rate, device=model.device) print(inputs.input_features.shape) # torch.Size([2, 128, 534]) # apply model outputs = model(**inputs) audio_vocos = outputs.audio print(audio_vocos.shape) # torch.Size([2, 170880]) # save audio to file for i in range(n_audio): # remove padding padding_mask = inputs.padding_mask[i].bool() valid_audio = audio_vocos[i][padding_mask].detach().cpu().numpy() print(f"Output audio shape {i}: {valid_audio.shape}") # Output audio shape 0: (170760,) # Output audio shape 1: (107520,) write_wav(f"vocos_encodec_{i}.wav", sampling_rate, valid_audio) # save original audio to file for i in range(n_audio): write_wav(f"original_{i}.wav", sampling_rate, audio[i]) ```