NB/WARNING: I AM AWARE OF THE INFERENCE CHALLENGES, WILL FIND SOME TIME TO WORK ON THIS. PLEASE CHECK THIS MODEL adoamesh/whisper-small-swh-finetuned ALTERNATIVE. NB IT IS NOT QUANTIZED.
On Windows, _empty_affine_quantized is not implemented, so it fails immediately. Try using UNIX/LINUX
Model Card for Whisper Tiny Swahili Distilled 8-bit (CPU Support)
Model Description
This model is a quantized version of a distilled Whisper Tiny model specifically trained for Swahili speech recognition. The model has been compressed using dynamic quantization (int8) to reduce its size while maintaining performance.
- Model Type: Automatic Speech Recognition (ASR)
- Language: Swahili
- Base Model: OpenAI Whisper Tiny
- Quantization: Dynamic Quantization (int8) only works on CPU
- Distillation: Knowledge distillation from Whisper Small
Intended Uses & Limitations
Intended Uses
- Transcription of Swahili audio to text
- Speech-to-text applications for Swahili language
- Research in low-resource language speech recognition
Limitations
- May perform poorly on dialects or accents not well-represented in training data
- May struggle with technical terminology or domain-specific vocabulary
- Performance may degrade with noisy audio recordings
- Quantization may introduce slight degradation in transcription quality compared to the full-precision model
Training Data
The model was trained on Swahili speech data from the Fleurs-SLU dataset (check citations below):
- Training set: 3.62 hours of Swahili audio
- Test set: 0.60 hours of Swahili audio
- Total dataset: 4.22 hours of Swahili audio
Training Procedure
Knowledge Distillation
The model was trained using knowledge distillation with a Whisper Small model as the teacher and Whisper Tiny as the student. The distillation process combines:
- Soft targets from the teacher model (temperature-scaled logits)
- Hard targets from the ground truth transcriptions
Quantization
After distillation, the model was quantized using PyTorch's dynamic quantization:
- Linear layers were quantized to int8
- Quantization was applied post-training (PTQ)
- Model size was reduced while maintaining most of the performance
Evaluation Results
Distilled Model Performance
The distilled model was evaluated over 1000 training steps with the following metrics:
| Step | Training Loss | Validation Loss | WER (%) |
|---|---|---|---|
| 250 | 0.120300 | 0.711645 | 37.0963 |
| 500 | 0.014100 | 0.770001 | 33.0668 |
| 750 | 0.001000 | 0.792062 | 32.2354 |
| 1000 | 0.000700 | 0.803786 | 32.0754 |
| ------ | --------------- | ----------------- | --------- |
[ 796/1000 56:40 < 14:33, 0.23 it/s, Epoch 15.59/20] Step Training Loss Validation Loss Wer 250 0.120300 500 37 750 0.001000 0.792062 32.9
Comparison with Original Model
| Model | WER (%) | Improvement |
|---|---|---|
| Original Whisper-Small (OpenAI) | 103.10 | - |
| Fine-tuned Whisper-Small (Swahili) | 32.07 | 68% improvement (dropped by 70) |
{'eval_loss': 0.8037856817245483, 'eval_wer': 32.075471698113205, 'eval_runtime': 74.3925, 'eval_samples_per_second': 2.07, 'eval_steps_per_second': 0.269, 'epoch': 19.607843137254903}
Training Statistics
- Training runtime: 8323.69 seconds
- Training samples per second: 3.844
- Training steps per second: 0.24
- Total FLOPs: 9.17e+18
- Epochs: 39.22
Evaluation Statistics
- Evaluation loss: 0.8844
- Evaluation WER: 32.1394%
- Evaluation runtime: 73.19 seconds
- Evaluation samples per second: 2.104
- Evaluation steps per second: 0.273
Comparison with Example Transcription
Swahili Whisper Speech-to-Text Transcription
Code Implementation
import torch
import torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration
# --- 1️⃣ Load processor (shared between models) ---
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
# --- 2️⃣ Load models ---
# Fine-tuned Swahili Whisper-Small
finetuned_model_path = "./Ex02/whisper-small-swh/checkpoint-1000"
finetuned_model = WhisperForConditionalGeneration.from_pretrained(finetuned_model_path)
finetuned_model.generation_config.language = "swahili"
finetuned_model.generation_config.task = "transcribe"
finetuned_model.eval()
# Original OpenAI Whisper-Small
original_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
original_model.generation_config.language = "swahili"
original_model.generation_config.task = "transcribe"
original_model.eval()
# --- 3️⃣ Load audio file ---
audio_path = "./Ex02/Recording.wav"
waveform, sample_rate = torchaudio.load(audio_path)
# Resample if not 16kHz
if sample_rate != 16000:
resampler = torchaudio.transforms.Resample(sample_rate, 16000)
waveform = resampler(waveform)
# Prepare input features
input_features = processor(
waveform.numpy()[0],
sampling_rate=16000,
return_tensors="pt"
).input_features
# --- 4️⃣ Transcribe ---
with torch.no_grad():
# Fine-tuned
predicted_ids_finetuned = finetuned_model.generate(input_features)
transcription_finetuned = processor.batch_decode(predicted_ids_finetuned, skip_special_tokens=True)[0]
# Original
predicted_ids_original = original_model.generate(input_features)
transcription_original = processor.batch_decode(predicted_ids_original, skip_special_tokens=True)[0]
# --- 5️⃣ Print results ---
print("\n=== Transcriptions ===")
print(f"Fine-tuned Swahili Whisper-Small: {transcription_finetuned}")
print(f"Original Whisper-Small (OpenAI): {transcription_original}")
## How to Use
```python
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torchaudio
from huggingface_hub import login
# Optional: login if model is private, i.e Error 401s
login(token="hf_somekey")
# ============================================================
# 1️⃣ Load Processor and Quantized Model
# ============================================================
MODEL_ID = "adoamesh/whisper-tiny-swahili-distilled-8bit"
processor = WhisperProcessor.from_pretrained(MODEL_ID)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_ID)
model.eval()
# Force CPU execution (quantized model works best on CPU)
device = torch.device("cpu")
model.to(device)
# ============================================================
# 2️⃣ Load and Preprocess Audio
# ============================================================
audio_path = "Recording.wav"
waveform, sample_rate = torchaudio.load(audio_path)
# Resample to 16 kHz if necessary
if sample_rate != 16000:
resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
waveform = resampler(waveform)
# Prepare input features
input_features = processor(
waveform.squeeze().numpy(), # remove channel dim if present
sampling_rate=16000,
return_tensors="pt"
).input_features.to(device)
# ============================================================
# 3️⃣ Generate Transcription
# ============================================================
with torch.no_grad():
predicted_ids = model.generate(input_features)
# Decode text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(f"\n🗣️ Transcription:\n{transcription}\n")
Results Output ranked (Human Eval)
=== Transcriptions ===
- Fine-tuned Swahili Whisper-Small: Ni limwambia yulebindi kwamba na mpenda. 1️⃣
- Original Whisper-Small (OpenAI): Nili mwabi aiyule bindi kwa mba nampe dha. 2️⃣
Model Architecture
The model follows the Whisper Tiny architecture with:
- Encoder-decoder transformer structure
- 6 encoder layers and 6 decoder layers
- Model dimension: 384
- Feed-forward dimension: 1536
- 6 attention heads
Hardware Requirements
- CPU: Any modern CPU
- RAM: Minimum 2GB
- Storage: ~150MB for the model
Ethical Considerations
- The model should not be used for surveillance or discriminatory purposes
- Users should be aware of potential biases in the training data
- The model may not perform equally well across all Swahili dialects
Citation
If you use this model, please cite:
@misc{whisper-tiny-swahili-distilled-8bit,
author = {Daniel Amemba Odhiambo},
title = {Whisper Tiny Swahili Distilled 8-bit},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
howpublished = {\url{https://huggingface.co/adoamesh/whisper-tiny-swahili-distilled-8bit}}
My Citations
#### Fleurs-SLU for the Swahili speech data
@misc{schmidt2025fleursslumassivelymultilingualbenchmark,
title={Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding},
author={Fabian David Schmidt and Ivan Vulić and Goran Glavaš and David Ifeoluwa Adelani},
year={2025},
eprint={2501.06117},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.06117},
}
@misc{adelani2023sib200,
title={SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects},
author={David Ifeoluwa Adelani and Hannah Liu and Xiaoyu Shen and Nikita Vassilyev and Jesujoba O. Alabi and Yanke Mao and Haonan Gao and Annie En-Shiun Lee},
year={2023},
eprint={2309.07445},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
}
Acknowledgments
- OpenAI for the original Whisper model
- Fleurs-SLU for the Swahili speech data
- Hugging Face for the Transformers library and model hosting
- Downloads last month
- 145
Model tree for adoamesh/whisper-tiny-swahili-distilled-8bit
Base model
openai/whisper-small