Performance on high-pitch speech
Hi- thanks for the great model!
I'm finding that for some high-pitched speech, the decoded output sounds like the voice is sort of cracking. Below I have attached an example input, decoded output, and a screenshot of the spectrograms.
I'm curious if this is something that anyone has seen before? And if there are any ideas for how to improve performance, maybe some pre-processing on the waveform that could help?
Thanks!
input file:
encoded then decoded with hubertsiuzdak/snac_24khz:
spectrograms- I circled the part in the decoded output where the formants are kind of disconnected:
Same
I tried fixing this with finetuning, but I don't think it's possible. IMO, one l0 code for 0.8ms just isn't enough temporal resolution.
If 22khz is fine, nemo seems like a better option for voices like this: