--- library_name: "pytorch" language: - en tags: - audio - diffusion - editing license: "other" --- # SAO-Instruct: Free-form Audio Editing using Natural Language Instructions [Paper](https://www.arxiv.org/abs/2510.22795) | [Sample Page](https://eth-disco.github.io/sao-instruct) | [Code](https://github.com/eth-disco/sao-instruct) ![SAO-Instruct Overview](assets/sao-instruct.png) SAO-Instruct is a model based on Stable Audio Open capable of editing audio clips using any free-form natural language instruction. To train our model, we create a dataset of audio editing triplets (input audio, edit instruction, output audio) using Prompt-to-Prompt, DDPM inversion, and a manual editing pipeline. Although partially trained on synthetic data, our model generalizes well to real in-the-wild audio clips and unseen edit instructions. ## Inference To get started, clone the repository and install the dependencies: ```shell git clone https://github.com/ETH-DISCO/sao-instruct.git pip install -r model/requirements.txt && pip install model/stable-audio-tools ``` Use the following script to perform inference with SAO-Instruct weights from 🤗 Hugging Face. When `encode_audio` is set to `True`, the provided audio is encoded into the latent space and used as a starting point for generation. You can control the amount of noise added to the encoded audio using the `encoded_audio_noise` parameter. Experiment with different configurations to achieve optimal results. ```python import torch from IPython.display import Audio, display from model.sao_instruct import SAOInstruct device = "cuda" if torch.cuda.is_available() else "cpu" model = SAOInstruct.from_pretrained("disco-eth/sao-instruct").eval().to(device) audio_path = "path/to/audio.wav" edited_clips = model.edit_audio( instructions=["add a cat meowing"], audio_path=audio_path, encode_audio=True, cfg_scale=6, encoded_audio_noise=4 ) display(Audio(audio_path)) for clip in edited_clips: display(Audio(clip, rate=model.sample_rate, normalize=False)) ``` ## Data Generation The required files to generate audio editing triplets are in the `dataset/` folder. ### Prompt Generation The script `generate_prompts.py` can be used for prompt generation. It accepts a `.jsonl` file as input in the following form: ```json lines {"caption": "Audio Caption", "metadata": {}} ``` This input `.jsonl` file can be created using the `prepare_captions.py` script for AudioCaps, WavCaps, and AudioSetSL. If you download audio clips from captioning datasets (e.g., if you want to use DDPM inversion for paired sample generation), the `metadata` field can be used to match them to their specific filename. The output of this script is a `.jsonl` file that includes processed prompts, containing the input caption, edit instruction, and output caption. ### Paired Sample Generation #### Prompt-to-Prompt After generating prompts, you can use Prompt-to-Prompt to generate a synthetic dataset of edited audio pairs. The Prompt-to-Prompt pipeline consists of two parts: - Candidate Search: Searching for ideal candidates (CFG, seed) for all prompts in the prompt file. - Sample Generation: Generating the edited audio pairs using the candidates found in the previous step. Use the script `generate_candidates.py` for the candidate search. The script `generate_samples.py` can be used for Prompt-to-Prompt sample generation (use the mode `p2p`). We have included the source code of [Stable Audio Open](https://github.com/Stability-AI/stable-audio-tools) with the adaptations made for Prompt-to-Prompt in `audio_generation/p2p/stable-audio-tools` (particularly in `audio_generation/p2p/stable-audio-tools/models/transformer.py`). You can install its requirements using: ```shell pip install audio_generation/p2p/stable-audio-tools ``` Make sure that the `k_diffusion` package is configured to use the same starting noise. Change the function `sample_dpmpp_3m_sde` in the `k_diffusion/sampling.py` file to: ```python if eta: noise = noise_sampler(sigmas[i], sigmas[i + 1])[0].unsqueeze(dim=0) noise = noise.repeat(x.shape[0], 1, 1) x = x + noise * sigmas[i + 1] * (-2 * h * eta).expm1().neg().sqrt() * s_noise ``` #### DDPM Inversion The script `generate_samples.py` can be used to create samples using DDPM inversion (use the mode `edit`). We follow the implementation from the paper [Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion](https://github.com/HilaManor/AudioEditingCode/). Clone the repository and install its dependencies using: ```shell cd audio_generation && git clone https://github.com/HilaManor/AudioEditingCode.git cd AudioEditingCode && pip install -r requirements.txt ``` #### Manual Edits For generating manual edits, use the script `manual_edits/generate_manual_samples.py`. ## Fine-tuning Stable Audio Open We provide training and data loading scripts to enable fine-tuning on audio editing triplets: - `model/stable-audio-tools/train_edit.py` - Modified training script for audio editing tasks - `model/stable-audio-tools/stable_audio_tools/data/dataset_edit.py` - Custom dataset loader for editing triplets - `model/stable-audio-tools/stable_audio_tools/configs` - Contains configuration files for both the model and dataset Otherwise, follow the official recommendations from [Stable Audio Open](https://github.com/Stability-AI/stable-audio-tools) to fine-tune the model. ## Attribution and License This repository builds upon **Stable Audio Open**, a model developed by [Stability AI](https://stability.ai). It uses checkpoints and components from [`stabilityai/stable-audio-open-1.0`](https://huggingface.co/stabilityai/stable-audio-open-1.0) that are licensed under the **[Stability AI Community License](./LICENSE-StabilityAI)**. Please see the [NOTICE](./NOTICE) file for required attribution. **Powered by Stability AI** This repository and its contents are released for **academic research and non-commercial use only**.