NB/WARNING: I AM AWARE OF THE INFERENCE CHALLENGES, WILL FIND SOME TIME TO WORK ON THIS. PLEASE CHECK THIS MODEL adoamesh/whisper-small-swh-finetuned ALTERNATIVE. NB IT IS NOT QUANTIZED.

On Windows, _empty_affine_quantized is not implemented, so it fails immediately. Try using UNIX/LINUX

Model Card for Whisper Tiny Swahili Distilled 8-bit (CPU Support)

Model Description

This model is a quantized version of a distilled Whisper Tiny model specifically trained for Swahili speech recognition. The model has been compressed using dynamic quantization (int8) to reduce its size while maintaining performance.

Model Type: Automatic Speech Recognition (ASR)
Language: Swahili
Base Model: OpenAI Whisper Tiny
Quantization: Dynamic Quantization (int8) only works on CPU
Distillation: Knowledge distillation from Whisper Small

Intended Uses & Limitations

Intended Uses

Transcription of Swahili audio to text
Speech-to-text applications for Swahili language
Research in low-resource language speech recognition

Limitations

May perform poorly on dialects or accents not well-represented in training data
May struggle with technical terminology or domain-specific vocabulary
Performance may degrade with noisy audio recordings
Quantization may introduce slight degradation in transcription quality compared to the full-precision model

Training Data

The model was trained on Swahili speech data from the Fleurs-SLU dataset (check citations below):

Training set: 3.62 hours of Swahili audio
Test set: 0.60 hours of Swahili audio
Total dataset: 4.22 hours of Swahili audio

Training Procedure

Knowledge Distillation

The model was trained using knowledge distillation with a Whisper Small model as the teacher and Whisper Tiny as the student. The distillation process combines:

Soft targets from the teacher model (temperature-scaled logits)
Hard targets from the ground truth transcriptions

Quantization

After distillation, the model was quantized using PyTorch's dynamic quantization:

Linear layers were quantized to int8
Quantization was applied post-training (PTQ)
Model size was reduced while maintaining most of the performance

Evaluation Results

Distilled Model Performance

The distilled model was evaluated over 1000 training steps with the following metrics:

Step	Training Loss	Validation Loss	WER (%)
250	0.120300	0.711645	37.0963
500	0.014100	0.770001	33.0668
750	0.001000	0.792062	32.2354
1000	0.000700	0.803786	32.0754
------	---------------	-----------------	---------

[ 796/1000 56:40 < 14:33, 0.23 it/s, Epoch 15.59/20] Step Training Loss Validation Loss Wer 250 0.120300 500 37 750 0.001000 0.792062 32.9

Comparison with Original Model

Model	WER (%)	Improvement
Original Whisper-Small (OpenAI)	103.10	-
Fine-tuned Whisper-Small (Swahili)	32.07	68% improvement (dropped by 70)

{'eval_loss': 0.8037856817245483, 'eval_wer': 32.075471698113205, 'eval_runtime': 74.3925, 'eval_samples_per_second': 2.07, 'eval_steps_per_second': 0.269, 'epoch': 19.607843137254903}

Training Statistics

Training runtime: 8323.69 seconds
Training samples per second: 3.844
Training steps per second: 0.24
Total FLOPs: 9.17e+18
Epochs: 39.22

Evaluation Statistics

Evaluation loss: 0.8844
Evaluation WER: 32.1394%
Evaluation runtime: 73.19 seconds
Evaluation samples per second: 2.104
Evaluation steps per second: 0.273

Comparison with Example Transcription

Swahili Whisper Speech-to-Text Transcription

Code Implementation

import torch
import torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# --- 1️⃣ Load processor (shared between models) ---
processor = WhisperProcessor.from_pretrained("openai/whisper-small")

# --- 2️⃣ Load models ---
# Fine-tuned Swahili Whisper-Small
finetuned_model_path = "./Ex02/whisper-small-swh/checkpoint-1000"
finetuned_model = WhisperForConditionalGeneration.from_pretrained(finetuned_model_path)
finetuned_model.generation_config.language = "swahili"
finetuned_model.generation_config.task = "transcribe"
finetuned_model.eval()

# Original OpenAI Whisper-Small
original_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
original_model.generation_config.language = "swahili"
original_model.generation_config.task = "transcribe"
original_model.eval()

# --- 3️⃣ Load audio file ---
audio_path = "./Ex02/Recording.wav"
waveform, sample_rate = torchaudio.load(audio_path)

# Resample if not 16kHz
if sample_rate != 16000:
    resampler = torchaudio.transforms.Resample(sample_rate, 16000)
    waveform = resampler(waveform)

# Prepare input features
input_features = processor(
    waveform.numpy()[0],
    sampling_rate=16000,
    return_tensors="pt"
).input_features

# --- 4️⃣ Transcribe ---
with torch.no_grad():
    # Fine-tuned
    predicted_ids_finetuned = finetuned_model.generate(input_features)
    transcription_finetuned = processor.batch_decode(predicted_ids_finetuned, skip_special_tokens=True)[0]

    # Original
    predicted_ids_original = original_model.generate(input_features)
    transcription_original = processor.batch_decode(predicted_ids_original, skip_special_tokens=True)[0]

# --- 5️⃣ Print results ---
print("\n=== Transcriptions ===")
print(f"Fine-tuned Swahili Whisper-Small: {transcription_finetuned}")
print(f"Original Whisper-Small (OpenAI): {transcription_original}")


## How to Use

```python
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torchaudio
from huggingface_hub import login

# Optional: login if model is private, i.e Error 401s
login(token="hf_somekey")

# ============================================================
# 1️⃣ Load Processor and Quantized Model
# ============================================================
MODEL_ID = "adoamesh/whisper-tiny-swahili-distilled-8bit"

processor = WhisperProcessor.from_pretrained(MODEL_ID)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_ID)
model.eval()

# Force CPU execution (quantized model works best on CPU)
device = torch.device("cpu")
model.to(device)

# ============================================================
# 2️⃣ Load and Preprocess Audio
# ============================================================
audio_path = "Recording.wav"
waveform, sample_rate = torchaudio.load(audio_path)

# Resample to 16 kHz if necessary
if sample_rate != 16000:
    resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
    waveform = resampler(waveform)

# Prepare input features
input_features = processor(
    waveform.squeeze().numpy(),  # remove channel dim if present
    sampling_rate=16000,
    return_tensors="pt"
).input_features.to(device)

# ============================================================
# 3️⃣ Generate Transcription
# ============================================================
with torch.no_grad():
    predicted_ids = model.generate(input_features)

# Decode text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(f"\n🗣️  Transcription:\n{transcription}\n")

Results Output ranked (Human Eval)

=== Transcriptions ===

Fine-tuned Swahili Whisper-Small: Ni limwambia yulebindi kwamba na mpenda. 1️⃣
Original Whisper-Small (OpenAI): Nili mwabi aiyule bindi kwa mba nampe dha. 2️⃣

Model Architecture

The model follows the Whisper Tiny architecture with:

Encoder-decoder transformer structure
6 encoder layers and 6 decoder layers
Model dimension: 384
Feed-forward dimension: 1536
6 attention heads

Hardware Requirements

CPU: Any modern CPU
RAM: Minimum 2GB
Storage: ~150MB for the model

Ethical Considerations

The model should not be used for surveillance or discriminatory purposes
Users should be aware of potential biases in the training data
The model may not perform equally well across all Swahili dialects

Citation

If you use this model, please cite:

@misc{whisper-tiny-swahili-distilled-8bit,
  author = {Daniel Amemba Odhiambo},
  title = {Whisper Tiny Swahili Distilled 8-bit},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/adoamesh/whisper-tiny-swahili-distilled-8bit}}

My Citations

#### Fleurs-SLU for the Swahili speech data

@misc{schmidt2025fleursslumassivelymultilingualbenchmark,
      title={Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding}, 
      author={Fabian David Schmidt and Ivan Vulić and Goran Glavaš and David Ifeoluwa Adelani},
      year={2025},
      eprint={2501.06117},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.06117}, 
}

@misc{adelani2023sib200,
      title={SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects}, 
      author={David Ifeoluwa Adelani and Hannah Liu and Xiaoyu Shen and Nikita Vassilyev and Jesujoba O. Alabi and Yanke Mao and Haonan Gao and Annie En-Shiun Lee},
      year={2023},
      eprint={2309.07445},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}


}

Acknowledgments

OpenAI for the original Whisper model
Fleurs-SLU for the Swahili speech data
Hugging Face for the Transformers library and model hosting

Contact Author on LinkedIn

Downloads last month: 145

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for adoamesh/whisper-tiny-swahili-distilled-8bit

Base model

openai/whisper-small

Finetuned

(3010)

this model