ARCH is a framework designed to benchmark audio representations. The goal is to provide a unified framework for researchers to compare their audio representations and to provide a benchmark for the community to evaluate their models. The project is currently in its first release. The details about the datasets and the models are available in the GitHub repository.
| Model | Size | Audio Events | Music | Speech | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ESC-50 | US8K | FSD50K | VIVAE | FMA | MTT | IRMAS | MS-DB | RAVDESS | A-MNIST | SLURP | EMOVO | ||
| facebook/wav2vec2-base | B | 45.73 | 55.48 | 19.39 | 31.47 | 50.54 | 37.56 | 35.14 | 66.06 | 55.32 | 86.38 | 14.37 | 31.80 |
| microsoft/wavlm-base | B | 49.88 | 61.84 | 17.63 | 36.31 | 48.71 | 34.93 | 32.62 | 54.18 | 67.94 | 99.50 | 30.98 | 43.08 |
| microsoft/wavlm-base-plus | B | 58.73 | 64.07 | 21.57 | 36.17 | 56.17 | 38.24 | 35.76 | 57.51 | 52.20 | 99.63 | 28.06 | 36.73 |
| facebook/hubert-base-ls960 | B | 58.90 | 67.28 | 24.53 | 40.48 | 54.63 | 38.78 | 36.65 | 58.46 | 65.28 | 99.58 | 33.75 | 40.48 |
| facebook/data2vec-audio-base | B | 23.63 | 45.63 | 10.06 | 30.19 | 40.58 | 27.60 | 25.87 | 50.74 | 48.03 | 99.06 | 43.57 | 27.27 |
| ALM/wav2vec2-base-audioset | B | 52.61 | 70.48 | 21.29 | 31.26 | 59.50 | 37.92 | 35.85 | 64.61 | 45.94 | 88.09 | 11.00 | 30.83 |
| ALM/hubert-base-audioset | B | 68.80 | 79.09 | 31.05 | 40.06 | 65.87 | 43.44 | 47.67 | 67.81 | 63.54 | 98.84 | 20.53 | 33.39 |
| facebook/wav2vec2-large-robust | L | 13.13 | 42.70 | 5.80 | 22.01 | 41.71 | 20.95 | 19.91 | 50.23 | 11.57 | 45.74 | 7.33 | 19.27 |
| facebook/wav2vec2-xls-r-300m | L | 51.28 | 69.96 | 23.71 | 36.28 | 56.96 | 38.28 | 38.42 | 66.71 | 31.48 | 98.88 | 12.74 | 20.35 |
| microsoft/wavlm-large | L | 67.20 | 70.92 | 32.21 | 42.51 | 61.13 | 41.29 | 42.53 | 68.00 | 71.76 | 99.75 | 42.34 | 45.29 |
| facebook/hubert-large-ll60k | L | 63.98 | 70.00 | 29.51 | 40.95 | 54.79 | 38.36 | 36.81 | 64.08 | 72.57 | 99.95 | 45.26 | 43.76 |
| facebook/data2vec-audio-large | L | 25.35 | 49.15 | 10.82 | 30.57 | 43.46 | 28.52 | 27.08 | 44.20 | 45.14 | 99.15 | 28.60 | 23.07 |
| ALM/wav2vec2-large-audioset | L | 74.39 | 79.00 | 37.58 | 39.65 | 66.58 | 44.51 | 49.87 | 76.90 | 59.49 | 99.42 | 17.74 | 38.20 |
| ALM/hubert-large-audioset | L | 71.52 | 75.63 | 37.41 | 44.28 | 67.54 | 43.35 | 50.46 | 77.82 | 73.26 | 99.59 | 20.46 | 38.61 |
| facebook/wav2vec2-xls-r-1b | XL | 66.95 | 75.90 | 31.61 | 40.41 | 62.79 | 41.99 | 43.57 | 69.79 | 55.44 | 99.86 | 25.14 | 34.58 |
| facebook/hubert-xlarge-ll60k | XL | 63.40 | 69.66 | 29.32 | 42.72 | 56.25 | 37.76 | 37.30 | 64.71 | 75.69 | 99.95 | 47.81 | 47.17 |
Best-performing model per size is highlighted in bold. Best performing model overall is highlighted in underlined.