A newer version of this model is available: HiDolen/Mini-BS-RoFormer-V2-46.8M

Model Card for Model ID

Model for the Music source separation task. Its implementation is referenced to the existing BS-RoFormer code.

针对音乐音频分离任务的模型。改编自现有的 BS-RoFormer 模型代码。

Model Details

模型参数：

depth = 8
hidden_size = 256
intermediate_size = 256 * 3

总参数量 17.9M，在 MUSDB18HQ 数据的 val 集上达到平均 SDR 9.0 的性能。分轨具体 SDR：

bass，8.31
drums，9.55
other，8.14
vocal，10.03

Uses

使用的 transformers 库版本为 4.55.4。为了正常运行模型还需要安装库 soudfile、einops 和 librosa。

GPU 推理：

from transformers import AutoModel
import soundfile
import torch
import librosa

model_name = "HiDolen/Mini-BS-RoFormer-18M"
model = AutoModel.from_pretrained(
    model_name,
    trust_remote_code=True,
)
model.to("cuda")

# 加载音频
file = "./Bruno Mars - Runaway Baby.mp3"
waveform, sr = librosa.load(file, sr=44100, mono=False)
waveform = torch.tensor(waveform).float()
waveform = waveform.to("cuda")

# 进行推理
result = model.separate(
    waveform,
    chunk_size=44100 * 6,
    overlap_size=44100 * 3,
    gap_size=0,
    batch_size=2,
    verbose=True,
)

# 保存处理结果
for i in range(result.shape[0]):
    soundfile.write(f"separated_stem_{i}.wav", result[i].cpu().numpy().T, 44100)

只分离伴奏人声 2 轨而不是分离 4 轨：

from transformers import AutoModel
import soundfile
import torch
import librosa

model_name = "HiDolen/Mini-BS-RoFormer-18M"
model = AutoModel.from_pretrained(
    model_name,
    trust_remote_code=True,
)
model.to("cuda")

# 加载音频
file = "./Bruno Mars - Runaway Baby.mp3"
waveform, sr = librosa.load(file, sr=44100, mono=False)
waveform = torch.tensor(waveform).float()
waveform = waveform.to("cuda")

# 进行推理
result = model.separate(
    waveform,
    chunk_size=44100 * 6,
    overlap_size=44100 * 3,
    gap_size=0,
    batch_size=2,
    verbose=True,
)

instrumental = result[0] + result[1] + result[2]
vocals = result[3]
result = torch.stack([instrumental, vocals], dim=0)
for i in range(result.shape[0]):
    soundfile.write(f"separated_stem_{i}.wav", result[i].cpu().numpy().T, 44100)

Training Details

使用 MUSDB18HQ 数据进行训练。

不使用原论文中提到的 Multi-STFT 损失项以提高训练速度。

学习率 5e-4，以 batch_size=6 训练 200k 步。

训练时发现的一些技巧：

数据增强不是越多越好。去除音高和拉伸增强能让收敛更快，也可以占用更少 CPU 资源
Multi-STFT 损失可以去掉，不影响训练的同时能极大提高训练速度
MaskEstimator 可以只有一个线性层，仍然能拟合且能节省大量参数
适量减少 freq_transformer 层数几乎不影响整体性能。主要会影响人声这种频段丰富的元素

Acknowledgments

https://github.com/lucidrains/BS-RoFormer
https://arxiv.org/abs/2309.02612 (Music Source Separation with Band-Split RoPE Transformer)

Downloads last month: 172

Safetensors

Model size

17.9M params

Tensor type

F32

HiDolen
/

Mini-BS-RoFormer-18M

Model Card for Model ID

Model Details

Uses

Training Details

Acknowledgments

Dataset used to train HiDolen/Mini-BS-RoFormer-18M