A few insights from running Chatterbox fully offline on iPhone and Mac

#42
by leowangxyz - opened

I've been playing with the Chatterbox model and put together an iOS and macOS app called Chinny that runs it fully on device. The core idea is to export the model as onnx (fp32), with a few targeted optimizations for on device inference (will release the scripts). Peak RAM sits around 3.2 GB, and interestingly performance is coming in better than some reports from users on PC. Chatterbox has been genuinely impressive in my tests and it feels like it has HUGE potential on consumer hardware as things continue to improve.

If you want to fully leverage CoreML, I suggest using only CoreML supported ops to rewrite the model where needed, and exporting directly to CoreML format. Right now many ops get offloaded to the CPU, and in my tests a mixed backend with CPU and CoreML adds a lot of overhead and can even double the inference time.

Similarly don't mix precisions. The precision conversions add high overhead.

The conditional decoder is the slowest step on an iPhone, with attention releated ops as the main bottleneck. I brought it down by about 90 percent, though it is still on the slow side.

I shared a demo at https://www.reddit.com/r/LocalLLaMA/comments/1o4y3b7/chinny_the_unlimited_ondevice_voice_cloner_just/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button. Feel free to try the app (https://apps.apple.com/us/app/chinny-offline-voice-cloner/id6753816417). It is offline, free and unlimited. The model is packed into the app. No ads or permissions required. Just a project to see how far on device AI can go.

Resemble AI org

Hi, thanks for sharing - this is really interesting.

On the "conditioning decoder" (did you mean encoder? I'm assuming that here) - this creates the speaker conditioning latents for the model. However, you can pre-compute those latents and re-use them for your speakers.

I'm not on the inference/production side of things - so I don't have much other insight than that.

Resemble AI org

Hi @leowangxyz
I'm Jin working on inference at Resemble AI now. It's interesting and thank you for your contribution.
I saw the your post, and seems like you want to make it open source. Could you let me know any email for reaching out to you for collaboration? Or you can send a mail to [email protected] or message to https://www.linkedin.com/in/deokjin-seo-276a1520b/

Thanks

Hi, thanks for sharing - this is really interesting.

On the "conditioning decoder" (did you mean encoder? I'm assuming that here) - this creates the speaker conditioning latents for the model. However, you can pre-compute those latents and re-use them for your speakers.

I'm not on the inference/production side of things - so I don't have much other insight than that.

Sorry it's a typo. Should be "conditional decoder". It's the ConditionalDecoder class in the s3gen decoder.py.

leowangxyz changed discussion status to closed

Hi @leowangxyz
I'm Jin working on inference at Resemble AI now. It's interesting and thank you for your contribution.
I saw the your post, and seems like you want to make it open source. Could you let me know any email for reaching out to you for collaboration? Or you can send a mail to [email protected] or message to https://www.linkedin.com/in/deokjin-seo-276a1520b/

Thanks

I sent you an email. Looking foward to hearing from you!

leowangxyz changed discussion status to open
Resemble AI org

FWIW, you can check out our ONNX models:

which should work well for cross-platform deployment (see readme for usage instructions)

FWIW, you can check out our ONNX models:

which should work well for cross-platform deployment (see readme for usage instructions)

@Xenova @vladislavbro Excellent work! I took a similar approach, but your code is much much clearer (I'll probably rewrite and borrow some of your code when I release mine πŸ˜€ ). I'll test the performance of your cond decoder ONNX model on iOS and hopefully it's faster.

Sign up or log in to comment