hubertsiuzdak
/

snac_24khz

Model card Files Files and versions

Performance on high-pitch speech

#2

by matthen - opened Jul 2

matthen

Jul 2

Hi- thanks for the great model!

I'm finding that for some high-pitched speech, the decoded output sounds like the voice is sort of cracking. Below I have attached an example input, decoded output, and a screenshot of the spectrograms.

I'm curious if this is something that anyone has seen before? And if there are any ideas for how to improve performance, maybe some pre-processing on the waveform that could help?

Thanks!

input file:

encoded then decoded with hubertsiuzdak/snac_24khz:

spectrograms- I circled the part in the decoded output where the formants are kind of disconnected:

Aug 22

Same

Oct 8

I tried fixing this with finetuning, but I don't think it's possible. IMO, one l0 code for 0.8ms just isn't enough temporal resolution.

If 22khz is fine, nemo seems like a better option for voices like this:

https://huggingface.co/spaces/Gapeleon/nemo_codec_test

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment