# Evaluation This directory contains the evaluation code to reproduce the results from the SAM-Audio paper. The evaluation framework supports multiple datasets, prompting modes (text-only, span, visual), and metrics. ## Setup Before running evaluation, ensure you have: 1. Installed the SAM-Audio package and its dependencies 2. Authenticated with Hugging Face to access the model checkpoints (see main [README](../README.md)) ## Quick Start Run evaluation on the default setting (instr-pro): ```bash python main.py ``` You can also use multiple GPUs to speed up evaluation: ```bash torchrun --nproc_per_node= python main.py ``` Evaluate on a specific setting: ```bash python main.py --setting sfx ``` Evaluate on multiple settings: ```bash python main.py --setting sfx speech music ``` ## Available Evaluation Settings Run `python main.py --help` to see all available settings ## Command Line Options ```bash python main.py [OPTIONS] ``` ### Options: - `-s, --setting` - Which setting(s) to evaluate (default: `instr-pro`) - Choices: See available settings above - Can specify multiple settings: `--setting sfx speech music` - `--cache-path` - Where to cache downloaded datasets (default: `~/.cache/sam_audio`) - `-p, --checkpoint-path` - Model checkpoint to evaluate (default: `facebook/sam-audio-1b`) - Can use local path or Hugging Face model ID - `-b, --batch-size` - Batch size for evaluation (default: `1`) - `-w, --num-workers` - Number of data loading workers (default: `4`) - `-c, --candidates` - Number of reranking candidates (default: `8`) ## Evaluation Metrics The evaluation framework computes the following metrics: - **Judge** - SAM Audio Judge quality assessment metric - **Aesthetic** - Aesthetic quality metric - **CLAP** - Audio-text alignment metric (CLAP similarity) - **ImageBind** - Audio-video alignment metric (for visual settings only) ## Output Results are saved to the `results/` directory as JSON files, one per setting: ``` results/ ├── sfx.json ├── speech.json └── music.json ``` Each JSON file contains the averaged metric scores across all samples in that setting. Example output: ```json { "JudgeOverall": "4.386", "JudgeFaithfulness": "4.708", "JudgeRecall": "4.934", "JudgePrecision": "4.451", "ContentEnjoyment": "5.296", "ContentUsefulness": "6.903", "ProductionComplexity": "4.301", "ProductionQuality": "7.100", "CLAPSimilarity": "0.271" } ```