A few insights from running Chatterbox fully offline on iPhone and Mac

#42

by leowangxyz - opened 19 days ago

19 days ago

•

I've been playing with the Chatterbox model and put together an iOS and macOS app called Chinny that runs it fully on device. The core idea is to export the model as onnx (fp32), with a few targeted optimizations for on device inference (will release the scripts). Peak RAM sits around 3.2 GB, and interestingly performance is coming in better than some reports from users on PC. Chatterbox has been genuinely impressive in my tests and it feels like it has HUGE potential on consumer hardware as things continue to improve.

If you want to fully leverage CoreML, I suggest using only CoreML supported ops to rewrite the model where needed, and exporting directly to CoreML format. Right now many ops get offloaded to the CPU, and in my tests a mixed backend with CPU and CoreML adds a lot of overhead and can even double the inference time.

Similarly don't mix precisions. The precision conversions add high overhead.

The conditional decoder is the slowest step on an iPhone, with attention releated ops as the main bottleneck. I brought it down by about 90 percent, though it is still on the slow side.

I shared a demo at https://www.reddit.com/r/LocalLLaMA/comments/1o4y3b7/chinny_the_unlimited_ondevice_voice_cloner_just/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button. Feel free to try the app (https://apps.apple.com/us/app/chinny-offline-voice-cloner/id6753816417). It is offline, free and unlimited. The model is packed into the app. No ads or permissions required. Just a project to see how far on device AI can go.

ollieollie

Resemble AI org 17 days ago

Hi, thanks for sharing - this is really interesting.

On the "conditioning decoder" (did you mean encoder? I'm assuming that here) - this creates the speaker conditioning latents for the model. However, you can pre-compute those latents and re-use them for your speakers.

I'm not on the inference/production side of things - so I don't have much other insight than that.

jin9581

Resemble AI org 16 days ago

Hi @leowangxyz
I'm Jin working on inference at Resemble AI now. It's interesting and thank you for your contribution.
I saw the your post, and seems like you want to make it open source. Could you let me know any email for reaching out to you for collaboration? Or you can send a mail to [email protected] or message to https://www.linkedin.com/in/deokjin-seo-276a1520b/

Thanks

leowangxyz

13 days ago

•

edited 13 days ago

Hi, thanks for sharing - this is really interesting.

On the "conditioning decoder" (did you mean encoder? I'm assuming that here) - this creates the speaker conditioning latents for the model. However, you can pre-compute those latents and re-use them for your speakers.

I'm not on the inference/production side of things - so I don't have much other insight than that.

Sorry it's a typo. Should be "conditional decoder". It's the ConditionalDecoder class in the s3gen decoder.py.

leowangxyz changed discussion status to closed 13 days ago

leowangxyz

12 days ago

Hi @leowangxyz
I'm Jin working on inference at Resemble AI now. It's interesting and thank you for your contribution.
I saw the your post, and seems like you want to make it open source. Could you let me know any email for reaching out to you for collaboration? Or you can send a mail to [email protected] or message to https://www.linkedin.com/in/deokjin-seo-276a1520b/

Thanks

I sent you an email. Looking foward to hearing from you!

leowangxyz changed discussion status to open 12 days ago

Xenova

Resemble AI org 12 days ago

FWIW, you can check out our ONNX models:

which should work well for cross-platform deployment (see readme for usage instructions)

leowangxyz

12 days ago

FWIW, you can check out our ONNX models:

https://huggingface.co/onnx-community/chatterbox-ONNX

https://huggingface.co/onnx-community/chatterbox-multilingual-ONNX

which should work well for cross-platform deployment (see readme for usage instructions)

@Xenova @vladislavbro Excellent work! I took a similar approach, but your code is much much clearer (I'll probably rewrite and borrow some of your code when I release mine 😀 ). I'll test the performance of your cond decoder ONNX model on iOS and hopefully it's faster.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment