Cahya Whisper Medium ONNX
ONNX-optimized version of the Cahya Whisper Medium model for Indonesian speech recognition.
Model Description
This repository contains the quantized ONNX version of the cahya/whisper-medium-id model, optimized for faster inference while maintaining transcription quality for Indonesian speech.
Model Files
encoder_model_quantized.onnx- Quantized encoder model (313 MB)decoder_model_quantized.onnx- Quantized decoder model (512 MB)config.json- Model configurationgeneration_config.json- Generation parametersexample.py- Usage example script
Performance Characteristics
- Model Size: ~825 MB (vs ~1GB original)
- Inference Speed: 20-40% faster than original
- Memory Usage: 15-30% lower memory consumption
- Quality: Minimal degradation in transcription accuracy
Installation
pip install -r requirements.txt
Usage
Basic Example
from example import CahyaWhisperONNX
# Initialize model
model = CahyaWhisperONNX("./")
# Transcribe audio file
transcription = model.transcribe("audio.wav")
print(transcription)
Command Line Usage
python example.py --audio path/to/audio.wav
Advanced Usage
import librosa
from example import CahyaWhisperONNX
# Initialize model
model = CahyaWhisperONNX("./")
# Load audio manually
audio, sr = librosa.load("audio.wav", sr=16000)
# Transcribe with custom parameters
transcription = model.transcribe(audio, max_new_tokens=256)
print(f"Transcription: {transcription}")
# Get model information
info = model.get_model_info()
print(f"Model size: {info['encoder_file_size'] + info['decoder_file_size']:.1f} MB")
Supported Audio Formats
- WAV, MP3, M4A, FLAC
- Recommended: 16kHz sample rate
- Maximum duration: 30 seconds (configurable)
Requirements
- Python 3.8+
- onnxruntime >= 1.16.0
- transformers >= 4.35.0
- librosa >= 0.10.0
Model Details
| Parameter | Value |
|---|---|
| Architecture | Whisper Medium |
| Language | Indonesian (ID) |
| Parameters | ~769M |
| Quantization | INT8 |
| Sample Rate | 16kHz |
| Context Length | 30s |
Benchmark Results
Performance comparison with original cahya/whisper-medium-id:
| Metric | Original | ONNX Quantized | Improvement |
|---|---|---|---|
| Model Size | 1024 MB | 825 MB | 19% smaller |
| Inference Time | 2.34s | 1.86s | 21% faster |
| Memory Usage | 45.2 MB | 38.7 MB | 14% lower |
| WER | 0.045 | 0.048 | -6% (minimal) |
Benchmarked on CPU with typical Indonesian speech samples
Limitations
- Quantization Effects: Slight quality degradation compared to original
- Hardware Compatibility: Some quantized operations may not work on all hardware
- Language Support: Optimized specifically for Indonesian language
- Context Window: Limited to 30-second audio segments
Troubleshooting
Common Issues
"Could not find an implementation for ConvInteger" Error
- This indicates missing quantization operator support
- Try updating onnxruntime:
pip install -U onnxruntime - Consider using onnxruntime-gpu if available
Out of Memory Error
- Reduce audio length to <30 seconds
- Use CPU execution provider: modify
providers=['CPUExecutionProvider']
Poor Transcription Quality
- Ensure audio is 16kHz sample rate
- Check audio quality and volume
- Try preprocessing audio (noise reduction, normalization)
Performance Tips
Faster Inference:
- Use shorter audio clips
- Reduce
max_new_tokensparameter - Use GPU if available with
onnxruntime-gpu
Better Quality:
- Preprocess audio (normalize volume, reduce noise)
- Use high-quality audio sources
- Ensure clear speech without background noise
Citation
@misc{cahya-whisper-medium-onnx,
title={Cahya Whisper Medium ONNX},
author={Indonesian Speech Recognition Community},
year={2024},
url={https://huggingface.co/asmud/cahya-whisper-medium-onnx}
}
License
Same license as the original Cahya Whisper model.
Related Models
- Original: cahya/whisper-medium-id
- Base model: openai/whisper-medium
- Downloads last month
- -
Model tree for asmud/cahya-whisper-medium-onnx
Evaluation results
- Word Error Rate on Indonesian Speech Test Setself-reported0.048
- Character Error Rate on Indonesian Speech Test Setself-reported0.025