Performance on high-pitch speech

#2
by matthen - opened

Hi- thanks for the great model!

I'm finding that for some high-pitched speech, the decoded output sounds like the voice is sort of cracking. Below I have attached an example input, decoded output, and a screenshot of the spectrograms.

I'm curious if this is something that anyone has seen before? And if there are any ideas for how to improve performance, maybe some pre-processing on the waveform that could help?

Thanks!

input file:

encoded then decoded with hubertsiuzdak/snac_24khz:

spectrograms- I circled the part in the decoded output where the formants are kind of disconnected:
image.png

I tried fixing this with finetuning, but I don't think it's possible. IMO, one l0 code for 0.8ms just isn't enough temporal resolution.

If 22khz is fine, nemo seems like a better option for voices like this:

https://huggingface.co/spaces/Gapeleon/nemo_codec_test

Sign up or log in to comment