--- license: apache-2.0 base_model: - openai/whisper-large-v3-turbo pipeline_tag: automatic-speech-recognition tags: - qualcomm - qnn - whisper - ONNX --- Hacked up version of the ai-hub-apps repo to export this model `python .\export.py -target-runtime onnx --device "Snapdragon X Elite CRD" --skip-profiling --skip-inferencing` Patched: whisper/model.py ```python # The number of Mel features per audio context # N_MELS = 80 # For Whisper V3 Turbo N_MELS = 128 ## COmmented out for now as we want to use it for Whisper V3 Turbo # # Audio embedding length # AUDIO_EMB_LEN = int(N_SAMPLES / N_MELS / 4) # # Audio length per MEL feature # MELS_AUDIO_LEN = AUDIO_EMB_LEN * 2 # Number of frames in the input mel spectrogram (e.g. 3000 for 30s audio at 160 hop_length). # This corresponds to the 'n_frames' dimension of the mel spectrogram input to the Whisper AudioEncoder. MELS_AUDIO_LEN = N_SAMPLES // HOP_LENGTH # Length of the audio embedding from the encoder output (e.g. 1500). # This corresponds to 'n_audio_ctx' in Whisper, which is MELS_AUDIO_LEN // 2 # due to the strided convolution in the encoder. This length is used for the # cross-attention key/value cache from the encoder. AUDIO_EMB_LEN = MELS_AUDIO_LEN // 2 ``` ```python WHISPER_VERSION = "large-v3-turbo" # N_MELS_LARGE_V3_TURBO = 128 # DEFAULT_INPUT_SEQ_LEN = 3000 @CollectionModel.add_component(WhisperEncoderInf) @CollectionModel.add_component(WhisperDecoderInf) class WhisperV3Turbo(BaseWhisper): @classmethod def from_pretrained(cls): return super().from_pretrained(WHISPER_VERSION) ``` You also need to patch this into the ai-hub library https://github.com/openai/whisper/blob/dd985ac4b90cafeef8712f2998d62c59c3e62d22/whisper/__init__.py#L30