Does the model really need to be run in FP32?
The file size of this model is double that of its predecessor, while having the around the same number of parameters. Is this intended, or is this just a mistake?
This is because of auxilary CTC model within .nemo file for timestamps support
That does explain the discrepancy. But now I feel that it is a bit disingenuous to call it 1 billion parameters, while its vram footprint is double that of its predecessor.
It's just two models bundled together. It's OK to drop the aux CTC model if you don't need timestamps.
But how do I do that? They are bundled together in a single file, and taking a quick glance at the model page, I do not see a way to only load the main ASR/STT part of the model.
I think sth like this should work:
mkdir tmp-ckpt && cd tmp-ckpt
tar xf ../canary-1b-v2.nemo
rm *timestamps_asr_model*
tar cf ../canary-1b-v2-notimestamp.nemo ./
Here's the timestamp model loading logic for reference
https://github.com/NVIDIA-NeMo/NeMo/blob/d2067cbf07e087eb98dd7b8e2ad0a36dfee1234d/nemo/collections/asr/models/aed_multitask_models.py#L1289-L1320
Thanks for the suggestion. May I suggest that this information be included in the model card? I feel this information might be useful.
Added to model card: https://huggingface.co/nvidia/canary-1b-v2#transcribing-with-timestamps Thanks.
To answer your original question, no you can run this on bfloat16 and save additional memory. For backward compatibility with all GPUs default is fp32.